There is a run time regression in Embench’s primecount benchmark since LLVM release 15.0. I have found that the offending change is ⚙ D129710 [SCEVExpander] Allow udiv with isKnownNonZero(RHS) + add vscale case .
I have made measurements with -target riscv32 -march=rv32imc -mabi=ilp32
and I’ve been using an instruction accurate simulator which handles all instruction as taking exactly one cycle. The number of cycles is ~50% more in LLVM 15.0 than in previous releases. But I think, this problem is backend independent.
Primecount counts the number of primes up to a certain constant, it contains a loop nest of depth 3 and there are goto
s to break out from the middle loop.
I could reduce the initial code (linked above) to this simplest case where there is still some meaningful difference in the run time:
int a[2];
int b() {
int c[2];
int d = 0;
a[0] = c[0] = ++d;
while (1) {
for (int e = 0; e < d; ++e) {
if (a[e] > 2)
goto f;
while (c[e] < 3)
c[e] += a[e];
}
break;
f:
++d;
}
return 3;
}
which is equivalent to the below code if we get rid of the goto
:
int a[2];
int b() {
int c[2];
int d = 0;
int exit_outer_loop = 0;
a[0] = c[0] = ++d;
while (!exit_outer_loop) {
for (int e = 0; e < d; ++e) {
if (a[e] > 2) {
++d;
exit_outer_loop = 0; // ensures the outer loop runs again
break; // exits the inner loop
}
while (c[e] < 3)
c[e] += a[e];
exit_outer_loop = 1; // ensures the outer loop terminates if inner loop completes
}
}
return 3;
}
What happens here is that the offending commit enables an optimization of the induction variable c[e]
in the innermost loop. During rewriteLoopExitValues
we move the calculation of the final exit value of the IV into the loop’s preheader.
; Preheader:
while.cond3.preheader: ; preds = %for.body
%arrayidx4 = getelementptr inbounds [2 x i32], ptr %c, i32 0, i32 %e.011
%arrayidx4.promoted = load i32, ptr %arrayidx4, align 4, !tbaa !4
br label %while.cond3
...
*** IR Dump After IndVarSimplifyPass on while.cond3 ***
; Preheader:
while.cond3.preheader: ; preds = %for.body
%arrayidx4 = getelementptr inbounds [2 x i32], ptr %c, i32 0, i32 %e.011
%arrayidx4.promoted = load i32, ptr %arrayidx4, align 4, !tbaa !4
%smax = call i32 @llvm.smax.i32(i32 %arrayidx4.promoted, i32 3)
%1 = sub i32 %smax, %arrayidx4.promoted
%umin = call i32 @llvm.umin.i32(i32 %1, i32 1)
%2 = sub i32 %1, %umin
%umax = call i32 @llvm.umax.i32(i32 %0, i32 1)
%3 = udiv i32 %2, %umax
%4 = add i32 %umin, %3
%5 = mul i32 %0, %4
br label %while.cond3
This was not happening before the offending commit because the SCEV expression that divides with 1 umax %0
was considered unsafe to expand (isSafeToExpand
). Now, after the offending commit, in rewriteLoopExitValues
we discover that this expansion has a high cost (isHighCostExpansion
), but since the loop can be deleted we do the expansion anyway. So, the offending commit is logically correct, but it reveals an underlying weakness.
I was thinking about as a possible solution, to consider the product of the trip counts of the containing loops and compare that to the gain we have with the expansion. More formally, do the expansion only if I * J * X < I * J * X'
, where I
and J
are the trip count of the outermost and middle loops. X
is the number of all operations before expansion in the innermost loop, X'
is the same, but after the expansion; this needs to calculate with the trip count of the innermost loop. The problem is that SCEV is not able to infer the loop trip count (backedge taken count neither) for any of the loops in the nest. This is perhaps because the induction variable of the innermost loop is dependent on the IV of an outer loop. Is SCEV capable of handling loop nests with dependent IVs? Any pointers on SCEV’s limitation could be helpful at this time. (Just out of curiosity, I was also experimenting to use Attributor
’s AAPotentialValues
to see if the dataflow framework could discover the value of c[e]
, unfortunately it could not. It reached the top.)
Any feed-back is more than welcome. Note, I’ve also created an issue with very similar content. [SCEV][IndVarSimplification] Run time regression in Embench/primecount benchmark · Issue #71424 · llvm/llvm-project · GitHub