Load Widening in IR

Looking at scenarios where load widening in IR can improve loop inlining/unrolling/vectorization cost model. Currently we miss on some of these opportunities.

I have come across other forums where this was discussed earlier (Load combine pass)

We may only want the simple forms where it is obviously profitable (perhaps by asking the backend if it should be done, only doing it for patterns that reduce the instruction count, not doing it aggressively, etc).

The cost model being incorrect for loop unrolling/inlining/vectorization is the main thing we are targeting. There might be other ways to fix that, but maybe combining would be the right thing to do.

Not sure what the question is?

FWIW, it’s unclear whether load widening is correct at IR level. It can be using vector types, but with plain integers is a can of worms.

Question really is if there a way we can handle it in the IR. Cause there are some effects of not doing it on various IR passes particularly around cost models. So sort of thinking if this can be handled via a feedback from the backend suggesting that this transformation is simple and cost effective.
There was also a load combine pass earlier which could be enabled by a compiler flag. Not sure if it was doing widening aggressively and affecting other passes. Can we think around that again?

A small brain dump on load widening. No particular order, all from memory, might be wrong in nasty ways.

We used to do some widening at IR and stopped. As a starting point for understanding, look at the reviews around ⚙ D29394 [DAGCombiner] Support non-zero offset in load combine
, and ⚙ D27861 [DAGCombiner] Match load by bytes idiom and fold it into a single load. Attempt #2.. See also ⚙ D24096 Do not widen load for different variable in GVN.

One major issue is that we can not perform widening for atomics without loosing information. Two adjacent 8 byte atomic loads can be merged into a single 16 byte atomic load, but that transform can’t be reversed. We’d need to extend the IR, with some notion of “element wise atomicity”.

If I remember correctly, load combine in IR was interfering with some other transform. I want to say PRE? But I don’t remember, you should be able to find details in reviews and by digging through commit history for the relevant places.

1 Like

Hi,

Note that load widening can cause bus errors on some architectures.
E.g. Load widening OK when reading/writing to RAM.
Load widening causes bus error when reading/writing to device memory.
So, some way to enable/disable this feature from C source code, would be helpful.

To last post, IO memory is volatile, and this transform is not legal for volatile.

But in that case it should be with a “volatile” keyword?

Similarly on atomicity if the 4 consecutive 1-byte loads are non-atomic can I not transform it to a non-atomic 4-byte load?

So are we not safe if any of the original loads are atomic or volatile, we don’t do anything.

I agree this should be possible. I think there are two possible caveats in LLVM IR: (1) Alignment – perhaps you’re not allowed to introduce unaligned loads. (2) You have to use <4 x i8> as type in the general case. If you use i32 and only one of the 1-byte loads was poison, then the entire i32 load becomes poison.

For non-volatile/non-atomic accesses, it should always be correct to create an unaligned load in IR, as long as the align attribute is set correctly. It might not be a profitable optimization in all cases, however, especially if the load will be split into byte-sized accesses during instruction selection on a particular architecture.