Looking at scenarios where load widening in IR can improve loop inlining/unrolling/vectorization cost model. Currently we miss on some of these opportunities.
I have come across other forums where this was discussed earlier (Load combine pass)
We may only want the simple forms where it is obviously profitable (perhaps by asking the backend if it should be done, only doing it for patterns that reduce the instruction count, not doing it aggressively, etc).
The cost model being incorrect for loop unrolling/inlining/vectorization is the main thing we are targeting. There might be other ways to fix that, but maybe combining would be the right thing to do.
Not sure what the question is?
FWIW, it’s unclear whether load widening is correct at IR level. It can be using vector types, but with plain integers is a can of worms.
Question really is if there a way we can handle it in the IR. Cause there are some effects of not doing it on various IR passes particularly around cost models. So sort of thinking if this can be handled via a feedback from the backend suggesting that this transformation is simple and cost effective.
There was also a load combine pass earlier which could be enabled by a compiler flag. Not sure if it was doing widening aggressively and affecting other passes. Can we think around that again?
A small brain dump on load widening. No particular order, all from memory, might be wrong in nasty ways.
We used to do some widening at IR and stopped. As a starting point for understanding, look at the reviews around ⚙ D29394 [DAGCombiner] Support non-zero offset in load combine
, and ⚙ D27861 [DAGCombiner] Match load by bytes idiom and fold it into a single load. Attempt #2.. See also ⚙ D24096 Do not widen load for different variable in GVN.
One major issue is that we can not perform widening for atomics without loosing information. Two adjacent 8 byte atomic loads can be merged into a single 16 byte atomic load, but that transform can’t be reversed. We’d need to extend the IR, with some notion of “element wise atomicity”.
If I remember correctly, load combine in IR was interfering with some other transform. I want to say PRE? But I don’t remember, you should be able to find details in reviews and by digging through commit history for the relevant places.
Note that load widening can cause bus errors on some architectures.
E.g. Load widening OK when reading/writing to RAM.
Load widening causes bus error when reading/writing to device memory.
So, some way to enable/disable this feature from C source code, would be helpful.
To last post, IO memory is volatile, and this transform is not legal for volatile.
But in that case it should be with a “volatile” keyword?
Similarly on atomicity if the 4 consecutive 1-byte loads are non-atomic can I not transform it to a non-atomic 4-byte load?
So are we not safe if any of the original loads are atomic or volatile, we don’t do anything.
I agree this should be possible. I think there are two possible caveats in LLVM IR: (1) Alignment – perhaps you’re not allowed to introduce unaligned loads. (2) You have to use
<4 x i8> as type in the general case. If you use
i32 and only one of the 1-byte loads was poison, then the entire
i32 load becomes poison.
For non-volatile/non-atomic accesses, it should always be correct to create an unaligned load in IR, as long as the align attribute is set correctly. It might not be a profitable optimization in all cases, however, especially if the load will be split into byte-sized accesses during instruction selection on a particular architecture.
@rotateright @Thorsten @efriedma-quic @davemgreen
I have this current patch which is doing such an implementation in InstCombine particularly on the scenario where the pattern is indicative of the fact that the loads are being merged to a wider load and the only use of this pattern is with a wider load. Also making sure that we handle it only for non-atomic/non-volatile cases.
It would be good to have a discussion on whether it covers the tricky situations and as Sanjay has suggested if an AggressiveInstCombine() implementation would be better thing to do.
The first question that comes to my mind is: is this pattern any frequent? Did you find this in the wild? The pattern seems quite specific.
The correctness question is important. You can only reuse loads if the whole value is used. For example, if you have a
%v = load i16, %p, you can’t replace
%v2 = load i8, %p with
%v2 = trunc i16 %v to i8 due to poison propagation rules.
So while this optimization reduces the number of loads for a specific case (read a word byte-by-byte), it prevents further optimizations where loads from the same pointer are combined (e.g., in GVN).
So the optimization is not always a clear win.
I am finding this pattern in few scenarios like dsp libraries and also some Open source benchmarks like snappy(see snappy/snappy-stubs-internal.h at main · google/snappy · GitHub).
Additionally gcc implements it ok as can seen by an example Compiler Explorer
Right but will the poison propagation rule apply if each of these byte-by-byte loads have a single use only and that is to generate a wider load. So we do not propose that we do the load combine for all scenarios generically which could affect other passes but only when it is indicative that a larger load is being read by repetitive smaller ones AND that one and the only use of these smaller loads is to generate a larger load.
@preames @jcdutton @nhaehnle @jyknight @tschuett
I have not heard anything further. Maybe a review of the patch or a try of the patch on specific architecture(where such a scenario may have caused degradation in the past) can confirm if this looks fine.
Potentially gcc is able to handle some of these as provided in the examples and we are looking at how to optimize this for llvm.
Hi @bipmis, it doesn’t seem like folks are set against this, but you haven’t addressed some of the major points that were raised. For example, I haven’t seen you address the poison-related concerns.
If you want to combine the loads in your @loadCombine_4consecutive test case, I believe the resulting IR would have to look something like:
%0 = load <4 x i8>, ptr %p
%1 = freeze <4 x i8> %0
%2 = bitcast <4 x i8> %1 to i32 # Does this actually work reliably?
# I expect at least potential endian issues
ret i32 %2
The point is that a <4 x i8> load doesn’t propagate poison between the bytes, unlike an i32 load.
@nhaehnle I agree to the requirement of poison propagation handling if these byte loads go for subsequent computes which influences a result.
However if the byte loads indicate by a pattern that they are used only to derive a wider load I dont see why this should be a requirement. I am handling only these scenarios and not a common case of load widening as was with gvn earlier.
Also showing the alive evaluation of the same Compiler Explorer
Okay, thank you for clarifying, that addresses the poison point.
I’m not sure about the other points, e.g. InstCombine vs. AggressiveInstCombine. I hope somebody else can weigh in.