Codegen performance issue: LEA vs. INC.

Hi all.

I’m looking for an advice on how to deal with inefficient code generation for Intel Nehalem/Westmere architecture on 64-bit platform for the attached test.cpp (LLVM IR is in test.cpp.ll).

The inner loop has 11 iterations and eventually unrolled.

Test.lea.s is the assembly code of the outer loop. It simply has 11 loads, 11 FP add, 11 FP mull, 1 FP store and lea+mov for index computation, cmp and jump.

The problem is that lea is on critical path because it’s dispatched on the same port as all FP add operations (port 1).

Intel Architecture Code Analyzer (IACA) reports throughput for that assembly block is 12.95 cycles.

I made a short investigation and found that there is a pass in code gen that replaces index increment with lea.

Here is the snippet from llvm/lib/CodeGen/TwoAddressInstructionPass.cpp

if (MI.isConvertibleTo3Addr()) {

// This instruction is potentially convertible to a true

// three-address instruction. Check if it is profitable.

if (!regBKilled || isProfitableToConv3Addr(regA, regB)) {

// Try to convert it.

if (convertInstTo3Addr(mi, nmi, regA, regB, Dist)) {

++NumConvertedTo3Addr;

return true; // Done with this instruction.

}

}

}

regBKilled is false for my test case and isProfitableToConv3Addr is not even called.

I’ve made an experiment and left only

if (isProfitableToConv3Addr(regA, regB)) {

That gave me test.inc.s where lea replaced with inc+mov and this code is ~27% faster on my Westmere system. IACA throughput analysis gives 11 cycles for new block.

But the best performance I’ve got from switching scheduling algorithm from ILP to BURR (test.burr.s). It gives a few percent more vs. “ILP+INC” and I’m not sure why – it might be because test.burr.s has less instructions (no two moves that copy index) or it might be because additions scheduled differently. BURR puts loads and FP mul between additions, which are gathered at the end of the loop by ILP.

I didn’t run experiments on sandy bridge, but IACA gives 12.45 cycles for original code (test.lea.s), so I expect BURR to improve performance there too for the attached test case.

Unfortunately I’m familiar enough with the LLVM codegen code to make a good fix for this issue and I would appreciate any help.

Thanks,

Aleksey

test.cpp (555 Bytes)

test.cpp.ll (3.17 KB)

test.lea.s (832 Bytes)

test.inc.s (839 Bytes)

test.burr.s (801 Bytes)

This sounds like llvm.org/pr13320.

The two address pass is only concerned about register pressure. It sounds like it should be taught about profitability. In cases where profitability can only be determined with something machinetracemetric then it probably should live it to more sophisticated pass like regalloc.

In this case, we probably need a profitability target hook which knows about lea. We should also consider disabling it's dumb pseudo scheduling code when we enable MI scheduler.

Evan

The two address pass is only concerned about register pressure. It sounds like it should be taught about profitability. In cases where profitability can only be determined with something machinetracemetric then it probably should live it to more sophisticated pass like regalloc.

In this case, we probably need a profitability target hook which knows about lea. We should also consider disabling it's dumb pseudo scheduling code when we enable MI scheduler.

Sorry, I set this aside to look at closely and never got back to it.

The lea->cmp problem is fixed by switching to the MI scheduler. Please run with -mllvm -misched-bench to confirm.

Leaving the old ILP scheduler as the default continues to cause confusion. People have had plenty of time to evaluate the new scheduler. I’ll plan to switch the default for x86 on Monday.

Now, that doesn’t mean that your analysis of the 2-address pass is irrelevant. It’s just that the new pass order happens to work better. MI Scheduler also makes an effort to facilitate macro fusion. But for the record, the 2-address pass heuristics are clearly obsolete. As Rafael pointed out, that’s covered in PR13320. I’m honestly not even sure why we still use inc/dec in x86-64, saving a byte?

Long-term plan: ideally, some of the tricks the 2-address pass is doing would be done within the MI scheduler now where we track register pressure precisely and know the final location of instructions. The major hurdle in doing that is updating live intervals on-the-fly. repairIntervalsInRange is incomplete. That’s also the reason we can’t kill of the LiveVariables pass.

-Andy

The lea->cmp problem is fixed by switching to the MI scheduler. Please run with -mllvm -misched-bench to confirm.

I get the same output in the testcase in pr13320. The leaq is in
between the cmp and the jmp, preventing macro-fusion.

Cheers,
Rafael

Right, Aleksey's test case works now because the SD scheduler no longer reorders the inc with earlier uses of the loop counter.

The 2-address pass still does the wrong thing in pr13320. It would be nice to improve those heuristics. Also, the MI scheduler could correct the problem if it could create copies and split live ranges. That was part of the original MI scheduler plan but was never implemented. The implementation would look like this:

- DAG builder creates weak edges for vreg antidependence.
- Check for unresolved weak edge when scheduling the instruction.
- Create the copy an DAG node on -the-fly (not currently supported).
- Create a new live interval for the copy and shrink the original.

I'll copy this into the bug for future reference.

-Andy