Spurious peeling in simple loop unrolling

Hi,

We’ve noticed some codegen regressions due to spurious peeling in the simple loop unrolling pass. This happens when the incoming values of the phi nodes causing the peeling are equivalent, since GVN only runs after the loop peeling. Ironically, this prevents vectorization on our target due to the loss of alignment in the loop body caused by the peeling. The spurious peeling seems to affect AArch64 as well, so it’s possible other targets may be affected.

Attached is an example test case and the corresponding codegen as of 2632ba6a358a62c5cbaddc141de81b756b68698f (with and without 35b3989a30eefa66cd6edca4c6e1ec061c05ad96 reverted). Compiling with -O3 -S -target aarch64-linux-gnu gives the attached 3 logs. Prerotate is just before loop rotation. The postrotate logfile shows the IR after the rotate loop pass and a few more passes have run (the logs after the rotate loop pass only show the loop body). Loop rotate introduces two elidable phis:

%33 = phi i8* [ %9, %22 ], [ %14, %32 ]

%34 = phi %class.HomemadeVector.0* [ %8, %22 ], [ %13, %32 ]

This leads to loop peeling inside simple loop unrolling to peel the first iteration of the loop. Prior to 35b3989a30eefa66cd6edca4c6e1ec061c05ad96 the phi would remain until GVN runs and would remove them since %8 and %13 are the same values. The same would happen to %9 and %14.

Running GVN earlier would fix this issue, but I suspect it would have other regressions. Does anyone know how to address this?

Best regards,

Thomas

spurious_loop_peeling.cpp (486 Bytes)

spurious_loop_peeling-35b3989a30eefa66cd6edca4c6e1ec061c05ad96-reverted.s (2.9 KB)

spurious_loop_peeling-2632ba6a358a62c5cbaddc141de81b756b68698f.s (3.24 KB)

spurious_loop_peeling-2632ba6a358a62c5cbaddc141de81b756b68698f_postpeeling.log (4.01 KB)

spurious_loop_peeling-2632ba6a358a62c5cbaddc141de81b756b68698f_postrotate.log (3.37 KB)

spurious_loop_peeling-2632ba6a358a62c5cbaddc141de81b756b68698f_prerotate.log (2.8 KB)

Hi,

Hi,

We’ve noticed some codegen regressions due to spurious peeling in the simple loop unrolling pass. This happens when the incoming values of the phi nodes causing the peeling are equivalent, since GVN only runs after the loop peeling. Ironically, this prevents vectorization on our target due to the loss of alignment in the loop body caused by the peeling. The spurious peeling seems to affect AArch64 as well, so it’s possible other targets may be affected.

Thanks for sharing the example. IIUC the problem here is that there is phi node which becomes invariant after the first iteration and which causes the loop to be peeled (small example here https://godbolt.org/z/Ej4Y19). In your case, the PHI node is actually redundant because the incoming values are actually the same, just different bit casts of the same value, so peeling does not really add any benefits.

Attached is an example test case and the corresponding codegen as of 2632ba6a358a62c5cbaddc141de81b756b68698f (with and without 35b3989a30eefa66cd6edca4c6e1ec061c05ad96 reverted). Compiling with -O3 -S -target aarch64-linux-gnu gives the attached 3 logs. Prerotate is just before loop rotation. The post IIUrotate logfile shows the IR after the rotate loop pass and a few more passes have run (the logs after the rotate loop pass only show the loop body). Loop rotate introduces two elidable phis:

%33 = phi i8* [ %9, %22 ], [ %14, %32 ]

%34 = phi %class.HomemadeVector.0* [ %8, %22 ], [ %13, %32 ]

This leads to loop peeling inside simple loop unrolling to peel the first iteration of the loop. Prior to 35b3989a30eefa66cd6edca4c6e1ec061c05ad96 the phi would remain until GVN runs and would remove them since %8 and %13 are the same values. The same would happen to %9 and %14.

Running GVN earlier would fix this issue, but I suspect it would have other regressions. Does anyone know how to address this?

Adjusting the position of GVN to address the issue at hand is probably not a good idea. I think there are at least a few small improvements we could make to improve the current situation:

  1. Only peel if the PHI that becomes invariant has any concrete users; if the only users of a phi that becomes invariant are things like llvm.assume, peeling should be very unlikely to be beneficial (note that currently peeling also seems to happily peel for completely unused phis)
  2. Instcombine before peeling already simplifies the IR so that both incoming values are bit casts of the same value. It probably would be trivial to also have instcombine simplify pointer phis if the incoming values striped by pointer casts are equal. (There might be some other reason why we are not doing this at the moment though).
  3. For targets very sensitive to the number of iterations, perhaps it would be worth adding a TTI hook to express that.
  4. Perhaps peeling should also be a bit more careful when a known trip count isn’t a power-of-2 (or some similar constraint) any more, but was before peeling.

Cheers,
Florian

Hi,

Oh right, there are multiple redundant phis! So that does indeed not apply in this case. It might still be beneficial in general.

  1. Instcombine before peeling already simplifies the IR so that both incoming values are bit casts of the same value. It probably would be trivial to also have instcombine simplify pointer phis if the incoming values striped by pointer casts are equal. (There might be some other reason why we are not doing this at the moment though).

As mentioned above, it’s not as simple as bitcast of the same pointer so this would not work here. One would have to go look at whether the loads are equivalent which is a more involved check.

That’s true, but I think instcombine already CSE’d the loads. So if we simplify such phis, the unnecessary peeling should not happen https://reviews.llvm.org/D98058

  1. For targets very sensitive to the number of iterations, perhaps it would be worth adding a TTI hook to express that.

In this case it’s not so much the number of iterations (it gets lower which is not a problem) but rather the code bloat and resulting loss of alignment from peeling.

Oh right.

  1. Perhaps peeling should also be a bit more careful when a known trip count isn’t a power-of-2 (or some similar constraint) any more, but was before peeling.

This would be a good idea in general indeed, but that would not solve the problem of pointless extra code bloat (code bloat itself would not be a problem at O2 if it resulted in better performance of course but here the peeling should not happen).

Sure it should not happen, but on most platforms the unnecessary peeling in this case should have a negligible impact.

Naive question: can GVN work in an incremental way (i.e. only process what changed)? If yes it could maybe run twice, where the second time would be more lightweight.

I don’t think so.

Cheers,
Florian

Hi,

I’m not sure I follow here. For your example (spurious_loop_peeling.cpp), it looks like there’s no peeling happening any more after the patch landed, at least when building for ARM64: https://godbolt.org/z/q6d6Kn . Is there anything else that’s going wrong?

Cheers,
Florian

2. Instcombine before peeling already simplifies the IR so that both incoming values are bit casts of the same value. It probably would be trivial to also have instcombine simplify pointer phis if the incoming values striped by pointer casts are equal. (There might be some other reason why we are not doing this at the moment though).

As mentioned above, it's not as simple as bitcast of the same pointer so this would not work here. One would have to go look at whether the loads are equivalent which is a more involved check.

That’s true, but I think instcombine already CSE’d the loads. So if we simplify such phis, the unnecessary peeling should not happen https://reviews.llvm.org/D98058

I tried the patch (thanks) but that did not remove any of the PHI (the 2 loads are still there and thus the bitcast don't appear to have the same source). I'll try tolook at InstCombine to see why loads are not CSE'd.

I’m not sure I follow here. For your example (spurious_loop_peeling.cpp), it looks like there’s no peeling happening any more after the patch landed, at least when building for ARM64: https://godbolt.org/z/q6d6Kn . Is there anything else that’s going wrong?

The testcase I sent is indeed fixed with your commit. However the code it is inspired from still shows unwanted peeling. I'm going to investigate what causes the difference.

Best regards,

Thomas

I tried the patch (thanks) but that did not remove any of the PHI (the 2 loads are still there and thus the bitcast don't appear to have the same source). I'll try tolook at InstCombine to see why loads are not CSE'd.

I’m not sure I follow here. For your example (spurious_loop_peeling.cpp), it looks like there’s no peeling happening any more after the patch landed, at least when building for ARM64: Compiler Explorer . Is there anything else that’s going wrong?

The testcase I sent is indeed fixed with your commit. However the code it is inspired from still shows unwanted peeling. I'm going to investigate what causes the difference.

Best regards,

Thomas

Sorry for the late reply. FYI the difference is because the original code is using pointer rather than reference parameter (see attachment). This leads to LICM not hoisting the load out of the outermost loop due to isSafeToExecuteUnconditionally returning false. This happens because the base pointer of the GEP used by the load is not sufficiently aligned. isDereferenceableAndAlignedPointer() from Loads.cpp calls Value::getPointerAlignment which returns an alignment of 1 and deduced that the alignment is not enough compared to the load requirement.

In my case however I know for certain that the this pointer is sufficiently aligned. Unfortunately I could not find a way to indicate it to the compiler. I tried to use __builtin_assume_aligned on the this pointer and use the return value for all access but that did not make any difference.

So to summarize:

Load whose base is a function parameter gets duplicated by loop rotate, LICM cannot hoist it out completely (it does get hoisted out of the innerloop) due to alignment issue which means a phi remains in the innerloop when loop peeling happens. This leads to code bloat and in our case lack of vectorization.

However clearly GVN thinks the load outside the loop is the same as the one in the loop and so the one in the loop can be removed. That seems inconsistent with the behaviour of LICM so I'm gonna try to look into this.

Best regards,

Thomas

spurious_loop_peeling2.cpp (656 Bytes)