-
Background:
For some small loop bodies whose number of multi-layer nested loops is constant, the compiler may choose to unroll completely for them. And the control flow branches are completely eliminated until the GVN optimization is complete.Now, there is no CSE optimization after GVN. As a result, cyclic variables are segmented and repeat compute the base for different outer circulation. Therefore, a CSE optimization is required to simplify the iterative variables, as they can use the same base.
I suspect it would help if you present some example IR (and possibly source). On the face of it, shouldn’t LICM help with this? Or is this something that only happens when an entire loop nest is fully unrolled?
Thanks for your attention, the related case can be found [SLP] gcc generate better code than clang base on stride offset · Issue #57278 · llvm/llvm-project · GitHub. For the following case, I compile it with following command, and we can see all the BBs are merged into one after GVN in the file all.txt.
- cmd: clang -march=armv8.2-a -O3 -ffast-math test.c -S -mllvm -indvars-widen-indvars=false -mllvm -print-after-all &> all.txt
void foo ( float *restrict fi, real *restrict f, int ci) {
/* Add accumulated i-forces to the force array */
for (int i = 0; i < UNROLLI; i++) {
for (int d = 0; d < DIM; d++) {
f[(ci*UNROLLI+i)*F_STRIDE+d] += fi[i*FI_STRIDE+d];
}
}
return;
}
- I tried to add a CSE pass after the GVN, and the SLP will take active, then the final assemble will be similar to the manual unroll version.
+++ b/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
@@ -617,6 +617,7 @@ void PassManagerBuilder::addFunctionSimplificationPasses(
// Run instcombine after redundancy elimination to exploit opportunities
// opened up by them.
MPM.add(createInstructionCombiningPass());
+ MPM.add(createEarlyCSEPass(true /* Enable mem-ssa. */)); // Catch trivial redundancies
addExtensionsToPM(EP_Peephole, MPM);
It would be an interesting research topic if passes could return some information about what they did. Based on that information the pass manager could schedule the next pass. Instead of a static pass pipeline, we could have a dynamic pipeline.
E.g. the loop vectoriser said I did a lot or I feel bad but there were no opportunities. In the first case, there are maybe some opportunities to clean up the IR. In the later case, you could continue with the normal pass schedule.
Also your example about loop optimisation. Did the loop optimiser unroll loops or reordered loops? In the former case CSE might help. In the later case you could just continue with the normal schedule.
Thanks @tschuett.
In my example, the trip count of the both inner and outer loops are small, so the loop is completely unrolled, this is just the case you mentioned, which CSE might help.
So we need a dynamic pipeline to reduce the negative impact on performance, it would be similar to
bool is unrolled = <pass Unroll loops on loop>
...
if (unrolled) // here, we only add the CSE pass when the loop is unrolled.
MPM.add(createEarlyCSEPass(true /* Enable mem-ssa. */));