[RFC] Add more powerful pass to set Load/Store alignment

Currently Load/Store alignment is only set in InstCombine pass based on computeKnownBits function.

computeKnownBits is stateless and recursive, therefore it is limited in how much analysis it can do and it will give up after 6 level or after 1 phi and 1 level. (I add reported the issue some time back here: Alignment analysis)

This cause the alignment information to often be pessimistic. On some target having a correct alignment information is important for performance. For instance on NVPTX load4/store4 instructions need to be aligned on 16bytes and the backend will be forced to scalarize those if the alignment cannot be proven.

I understand that the limitations of computeKnownBits are there to avoid stack overflow and prevent compile time increase.

I would like to add a new pass that does more expensive analysis in order to get a more accurate alignment information. This would need to be able to handle longer chain of instructions as well as cycles in the IR.
This can be done with minimal changes to computeKnownBits by allowing to pass KnownBits information already calculated to the function. Once we have that we can compute KnownBits for all the integer Values of a function in O(n) time if it doesn’t have cycles.

To handle cycles we can start by assuming that PhiNodes have all bits set to 0 and run the analysis until we reach a fix point. Then the complexity is based on the number of cycles in the IR.

I uploaded a draft of what I think the pass looks like:

In the patch I sent I only enabled it for NVPTX backend at the beginning of the codegeneration. Since it may be expensive and may not help all targets I thought it could be a good place to call but it really depends how useful this is overall.

FYI @Artem-B about the NVPTX usage (as this is the path I’m interested in)

1 Like

Of course the devil is in the details, but doing an important analysis like known bits eagerly instead of lazily, and doing a proper fixpoint iteration instead of the backwards recursive thing, seems like a no-brainer. I’d super love to see the effect this has on known bits precision and on compile time.

You could also put it in the Attributor as an AA and use it in AAAlign. It does not use known bits at all right now but only offset + modulo logic.
Benefit is that you get fixpoint logic and lots of other things for free, e.g., the ability to look through memory and work inter-procedurally. The result can also reused by all other AAs as we require it.

FWIW, we run the Attributor already for the AMD GPU backend and we could reasonably do it for all GPU backends if it’s worth it. Note that one does not need to run all AAs but can down select to what is useful and stable. An AA for address space deduction is under review right now too, so more GPU stuff is coming.

I looked into something similar a long time ago, though with a different motivation: The current code for inferring alignment in InstCombine is very slow, because it runs many, many times during the optimization pipeline. We have something like ten runs of InstCombine, and we’ll try to infer load/store alignment using known bits every time, possibly multiple times due to fixpoint iteration.

The end result is that we spend 1-2% of end-to-end compile-time on inferring alignment in InstCombine – alignments that nobody actually needs. Alignment is mostly only relevant for the backend, not middle-end optimization passes. The few passes that do rely on alignment will try to increase it themselves.

So the plan was to instead have a single run of an InferAlignment pass towards the end of the middle-end optimization pipeline, rather than doing this in InstCombine. The compile-time cost of such a pass is around ~0.1%, which is much lower than doing it in InstCombine. Turns out that calling computeKnownBits() on all load/store arguments is rather cheap if you don’t repeat it dozens of times during the optimization pipeline.

All this is to say, replacing alignment inferral in InstCombine with a dedicated pass, even if it does more powerful fixpoint iteration, is likely going to be a pretty big win in terms of compile-time. (I’m somewhat uncertain how much cost the fixpoint iteration adds, but I would guess “not that much”.)

It’s maybe worth pointing out that we also have an AlignmentFromAssumptions pass, which is SCEV based. I once tried to use that for more general SCEV-based alignment inferral (rather than only for assumptions), but that had large negative compile-time impact.

Thanks for all the very insightful responses. It sounds like there would be a benefit moving alignment analysis completely out of instcombine which is great. So I’ll keep digging and see if I can get to a solution that is both faster and more powerful.

I realize my current fix point solution is not fully correct in how it handles cycles so I’ll try out few more things.

I didn’t know about Attributor, I’ll look into it. What does the AMD GPU use it for?

That’s a good point, I’ll try out see if I can use SCEV to handle cycles. Do you know why what you tried had a negative compile time impact? Was it the case even if you ran it once close to the backend?

Moving alignment inference out of instcombine into a dedicated pass with a real dataflow analysis instead of the on-demand analysis seems like a good idea to me. The compile time savings of removing this from instcombine should give you enough budget to run the inference pass more than once. Perhaps both early and late, once to power the optimizer, and again at the end for the code generator.

./llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp uses it to deduce function properties interprocedurally.

./llvm/lib/Transforms/IPO/OpenMPOpt.cpp does interprocedural OpenMP-aware optimizations.

Attributor.h has the API and all AbstractAttributes (incl. AAAlign), AttributorAttributes.cpp has the implementation of these AAs. Check getKnownAlignForUse and AAAlignFloating::updateImpl in the AttributorAttributes.cpp. That’s where the base logic lives. The alignment of AssociatedValue is determined by following uses. The “genericValueTraversal” (which is about to be replaced but only with something “better”) will look through PHIs, selects, arguments, loads, etc. where possible. So the callback in updateImpl only sees “leaf” values. Feel free to ping me if you have questions.