Is there a reason for swapping SCEVs in the call to getMinusSCEV from CollectFixupsAndInitialFormulae (all targets) ?

Hi all,

In LSRInstance::CollectFixupsAndInitialFormulae() implementation, the function getMinusSCEV is called at a particular point as part of the code to replace/normalise 'x == y’ ICmp instructions into 'x - y == 0’. However, what the actual code does is replacing them by ‘y - x == 0’ instead.

This is the excerpt of code in the LSRInstance::CollectFixupsAndInitialFormulae function I am referring to:

// x == y → x - y == 0
const SCEV *N = SE.getSCEV(NV);
if (SE.isLoopInvariant(N, L) && isSafeToExpand(N, SE)) {
// S is normalized, so normalize N before folding it into S
// to keep the result normalized.
N = normalizeForPostIncUse(N, TmpPostIncLoops, SE);
Kind = LSRUse::ICmpZero;
S = SE.getMinusSCEV(N, S); // use instead S = SE.getMinusSCEV(S, N);
}

[ the comment “use instead" after the getMinusSCEV call is mine ]

This apparently unimportant difference has a remarkable effect latter on, when the optimal solution for the loop is computed in LSRInstance::SolveRecurse

In the function LSRInstance::SolveRecurse, given any two loop variants that are found to have the same cost, the first one is always chosen. This implies that the initial formula that was created in LSRInstance::CollectFixupsAndInitialFormulae() is the ones that will prevail by default (given same cost). The point that I want to highlight is that the current implementation of CollectFixupsAndInitialFormulae reverses the SCEVs found in the user source code rather than leaving them as found. This causes the following problem:

Consider the following code (please don’t take into account whether this makes or not something useful, it’s just to show the point) :

int asr_Lowert( int a, unsigned ammount )
{
if ( (ammount -= 8) >= 0 )
a = a >> 8;

while ( ammount-- )
a >>= 1;

return a;
}

The interesting part for this discussion is the ‘while’ loop. It gets compiled by Clang like this:

while.cond: ; preds = %while.body, %entry
%a.addr.1 = phi i32 [ %shr, %entry ], [ %shr1, %while.body ]
%ammount.addr.0 = phi i32 [ %sub, %entry ], [ %dec, %while.body ]
%tobool = icmp eq i32 %ammount.addr.0, 0
br i1 %tobool, label %while.end, label %while.body

while.body: ; preds = %while.cond
%dec = add i32 %ammount.addr.0, -1
%shr1 = ashr i32 %a.addr.1, 1
br label %while.cond

Clang mostly follows the source code and produces code that resembles it. It’s important to note is that variable ‘ ammount’ is decremented by 1 on each loop iteration as found on the user source code.

I am going to chose the x86 target because I really want this subject to gain some attention, but please note that this is a general subject that affects ALL targets. So, for the x86, the resultant assembly code is this:

.macosx_version_min 10, 12
.globl _asr_Lowert
_asr_Lowert:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset %ebp, -8
movl %esp, %ebp
.cfi_def_cfa_register %ebp
movl 8(%ebp), %eax
sarl $8, %eax
pushl $8
popl %ecx
subl 12(%ebp), %ecx
jmp LBB0_1
LBB0_2:
sarl %eax
incl %ecx
LBB0_1:
testl %ecx, %ecx
jne LBB0_2
popl %ebp
retl
.cfi_endproc

What I want to show is that, now, instead of ‘ammount’ being decremented, it is incremented. Actually, the compiler performs the oddity of setting the initial value of the ‘while’ variable to '8-ammount’ rather than ‘ ammount-8’. The compiler even needs to perform a ‘push’ and a ‘pop’ as a setup for that. This is all because the SCEVs where swapped in the LLVM implementation of LSRInstance::CollectFixupsAndInitialFormulae().

Now, let me show what happens if we implement the function as suggested in the comment, that is, by replacing the current code by S = SE.getMinusSCEV(S, N);

This is what we get now from LLVM for the same input:

.macosx_version_min 10, 12
.globl _asr_Lowert
_asr_Lowert:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset %ebp, -8
movl %esp, %ebp
.cfi_def_cfa_register %ebp
movl 8(%ebp), %eax
movl 12(%ebp), %ecx
addl $-8, %ecx
sarl $8, %eax
jmp LBB0_1
LBB0_2:
decl %ecx
sarl %eax
LBB0_1:
testl %ecx, %ecx
jne LBB0_2
popl %ebp
retl
.cfi_endproc

The code now uses decrement, and the initial ‘while’ variable value is ‘ ammount-8’ as we could expect from the user source code. The whole function is also shorter and it does not use potentially costly push/pop instructions.

I also want to point out, that this is not an isolated case at all. This happens all the time for small loops where costs of several variants are the same. In all cases, decrements in the user source code are replaced by positive increments with necessarily strange setup code (as shown in the first example above). Positive increments are replaced by decrements, which also may force extra code generation because the loop induction variable can’t be any longer used for useful purposes. The remarkable thing is that by swapping the SCEV expressions as we currently do, the compiler stops taking the user source code into account as a hint on what to do, and instead it may produce worse code in circumstances where the user may have already chosen a naturally better implementation, which the compiler just destroys by artificially replacing it by a different solution with a supposed identical cost.

Also, although this notably affects small loops with a few small-cost variants, it also affects more complex loops that may have two or more optimal solutions. Among the optimal solutions of the same cost, the algorithm will always chose the first one in the list, so again the one that will be chosen may be a reversed one with respect to the user code, unless it is a really different solution.

I do not understand the reason behind such reversing of loop expressions, and that’s the question that I post. I understand that we may want to always favour decrements (maybe for performance reasons in some targets), or that we may want to prioritise increments (maybe to keep induction variables usable in addressing modes in some targets), or even that we may want to do different things for different targets. But I do not quite understand the current implementation. It almost seems to me a typo in the LLVM source code that nobody has come across before. I have had my local copy of LLVM for some time with the code set as in the comment, and I have yet to find a case where it produces worse code than the original.

So, I really am interested in starting some informed discussion about this.

John Lluch

Too bad this did not go anywhere. I was looking forward to a discovery of this mystery.