rL296252 Made large integer operation codegen significantly worse.

Hi,

I’m working with workload where the bottleneck is cryptographic signature checks. Or, in compiler terms, most large integer operations.

Looking at rL296252 , the state of affair in that area degraded quite significantly, see test/CodeGen/X86/i256-add.ll for instance.

Is there some kind of work in progress here and it is expected to get better ? Because if not, that’s a big problem. It looks like the problem is that the compiler now choose to use pushfq/popfq in some cases rather than chaining adc to propagate the carry in additions.

I hope this can get sorted out quickly. I’m happy to help if that is necessary.

Thanks,

Amaury SECHET

rL296252’s main change was to turn on anti-aliasing in the DAGCombiner. This should generally be a mild improvement to code due to the relaxed memory constraints, modulo any patterns downstream that are no longer general enough. This looks to be the case here.

I’m going to leave this for a little while longer to get a check that all the buildbots pass, but I’ll revert this and make sure this test case looks more reasonable.

-Nirav

This patch only results in relaxing dependencies. This now allows new orderings that were not previously allowed, but, the fact that we then actually get such a suboptimal output likely indicates an issue elsewhere, that this allowance is exacerbating.

Some observations:

  1. For some reason, memop folding seems to be generating seriously non-optimal instructions. It is somehow causing there to be 7 adds in the output instead of 4 – some with the store integrated, but also keeping the original adds without the store integrated. That’s no good…and didn’t used to happen. I expect this is the main problem.

  2. The scheduler is then choosing an ordering that requires spilling eflags. Not sure why; possibly due to the former it’s pushed itself into a corner where this appears like it’s required.

  3. Then, even if you need to spill, it’s a shame that the x86 backend isn’t tracking bits in the flag register separately… Thus, a definition and use of the carry bit requires saving/restoring the entire flags register, even if all you cared about was the one carry bit. That’s quite unfortunate, as saving/restoring just the carry bit would be a LOT cheaper than saving/restoring the entire register. I suspect it’d be possible to define a 1-bit subregister of eflags and mark the various carry-in ops as only using that. Might be worthwhile doing that, separately, even if fixing #1 makes this particular issue disappear for this test case.

Yes. I’m seeing that as well. Not clear what’s going on.

In any case it looks to be unrelated to the alias analysis so barring concerns, I’m going to recommit the patch in the morning and let others take a look at this.

-Nirav

I see we’re missing an isel pattern for add producing carry and doing a memory RMW. I’m going to see if adding that helps anything.

Another problem seems to be that despite the fact we are making store instructions that produce flags, we aren’t transfering that flag production to the instructions that need the flags. So we replicate all of the add operations.

For this code the isel ends up creating adc with memory load and store, and then a separate adc with the same load, but no store. This means isel is now replicating loads which seems wrong. I suspect something is going wrong in merging input chains?

t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1
t71: i64 = add t2, Constant:i64<24>
t25: i64 = add t2, Constant:i64<8>
t24: i64,ch = load<LD8[%p]> t0, t2, undef:i64
t16: i64 = add t2, Constant:i64<16>
t38: i64,ch = load<LD8[%q]> t0, t4, undef:i64
t22: i64,ch = load<LD8[%p+24]> t0, t71, undef:i64
t26: i64,ch = load<LD8[%p+8]> t0, t25, undef:i64
t19: i64,ch = load<LD8[%p+16]> t0, t16, undef:i64
t69: i64 = add t4, Constant:i64<24>
t36: i64,ch = load<LD8[%q+24]> t0, t69, undef:i64
t39: i64 = add t4, Constant:i64<8>
t40: i64,ch = load<LD8[%q+8]> t0, t39, undef:i64
t29: i64 = add t4, Constant:i64<16>
t34: i64,ch = load<LD8[%q+16]> t0, t29, undef:i64
t79: i64,i32 = X86ISD::ADC t19, t34, t80:1
t80: i64,i32 = X86ISD::ADC t26, t40, t81:1
t81: i64,i32 = X86ISD::ADD t24, t38
t72: ch = TokenFactor t22:1, t38:1, t40:1, t34:1, t36:1
t78: i64,i32 = X86ISD::ADC t22, t36, t79:1
t73: ch = store<ST8[%p+24]> t72, t78, t71, undef:i64
t82: ch = TokenFactor t19:1, t38:1, t40:1, t34:1, t36:1
t83: ch = store<ST8[%p+16]> t82, t79, t16, undef:i64
t87: ch = TokenFactor t26:1, t38:1, t40:1, t34:1, t36:1
t88: ch = store<ST8[%p+8]> t87, t80, t25, undef:i64
t92: ch = TokenFactor t24:1, t38:1, t40:1, t34:1, t36:1
t93: ch = store<ST8[%p]> t92, t81, t2, undef:i64
t96: ch = TokenFactor t73, t83, t88, t93
t13: ch = X86ISD::RET_FLAG t96, TargetConstant:i32<0>

I think this might fix the problem. We aren’t counting the flag usage of the ADC in check foldable chain node when seeing if the users of the load only have a single use.

— a/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp

+++ b/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp

@@ -3345,7 +3345,7 @@ void SelectionDAGISel::SelectCodeCommon(SDNode *NodeToMatch,

// a single use.

bool HasMultipleUses = false;

for (unsigned i = 1, e = NodeStack.size()-1; i != e; ++i)

  • if (!NodeStack[i].hasOneUse()) {
  • if (!NodeStack[i].getNode()->hasOneUse()) {

HasMultipleUses = true;

break;

}