Load combine pass

Hi,

I’m trying to optimize a pattern like this into a single i16 load:
%1 = bitcast i16* %pData to i8*
%2 = load i8, i8* %1, align 1
%3 = zext i8 %2 to i16
%4 = shl nuw i16 %3, 8
%5 = getelementptr inbounds i8, i8* %1, i16 1
%6 = load i8, i8* %5, align 1
%7 = zext i8 %6 to i16
%8 = shl nuw nsw i16 %7, 0
%9 = or i16 %8, %4

I came across load combine pass which is motivated by virtualliy the same pattern. Load combine optimizes the pattern by combining adjacent loads into one load and lets the rest of the optimizer cleanup the rest. From what I see on the initial review for load combine (https://reviews.llvm.org/D3580) it was not enabled by default because it caused some performance regressions. It’s not very surprising, I see how this type of widening can obscure some facts for the rest of the optimizer.

I can’t find any backstory for this pass, why was it chosen to optimize the pattern in question in this way? What is the current status of this pass?

I have an alternative implementation for it locally. I implemented an instcombine rule similar to recognise bswap/bitreverse idiom. It relies on collectBitParts (Local.cpp) to determine the origin of the bits in a given or value. If all the bits are happen to be loaded from adjacent locations it replaces the or with a single load or a load plus bswap.

If the alternative approach sounds reasonable I’ll post my patches for review.

Artur

There's a bit of additional context worth adding here...

Up until very recently, we had a form of widening implemented in GVN. We decided to remove this in https://reviews.llvm.org/D24096 precisely because its placement in the pass pipeline was inhibiting other optimizations. There's also a major problem with doing widening at the IR level which is that widening a pair of atomic loads into a single wider atomic load can not be undone. This creates a major pass ordering problem of its own.

At this point, my general view is that widening transformations of any kind should be done very late. Ideally, this is something the backend would do, but doing it as a CGP like fixup pass over the IR is also reasonable.

With that in mind, I feel both the current placement of LoadCombine (within the inliner iteration) and the proposed InstCombine rule are undesirable.

Philip

I’m really glad to see that this is gone in GVN - it will reduce our diffs a lot when we do the next import. The GVN load widening is not sound in the presence of hardware-enforced spacial memory safety, so we ended up with the compiler inserting things that caused hardware bounds checks to fail and had to disable it in a few places.

If you’re reintroducing it, please can we have a backend option to specify whether it’s valid to widen loads beyond the extents (for example, for us it is not valid to widen an i16 load followed by an i8 load to an i32 load). Ideally, we’d also want the back end to supply some cost function: on MIPS, even when this is valid, the i32 load followed by the masks and shifts is a lot more expensive than the two loads (and increases register pressure, so has knock-on effects that mean that turning off this bit of GVN gave us *much* better code, even without any of the later optimisation opportunities being exploited).

David

One of the arguments for doing this earlier is inline cost perception of the original pattern. Reading i32/i64 by bytes look much more expensive than it is and can prevent inlining of interesting function.

Inhibiting other optimizations concern can be addressed by careful selection of the pattern we’d like to match. I limit the transformation to the case when all the individual have no uses other than forming a wider load. In this case it’s less likely to loose information during this transformation.

I didn’t think of atomicity aspect though.

Artur

Hi Artur,

> One of the arguments for doing this earlier is inline cost
> perception of the original pattern. Reading i32/i64 by bytes look much
> more expensive than it is and can prevent inlining of interesting
> function.

I don't think this is just a perception issue -- if the loads have not
been widened then inlining the containing function _is_ expensive, and
the inliner cost analysis is doing the right thing.

> Inhibiting other optimizations concern can be addressed by careful
> selection of the pattern we’d like to match. I limit the
> transformation to the case when all the individual have no uses other
> than forming a wider load. In this case it’s less likely to loose
> information during this transformation.

I agree -- I think widening loads in "obvious" cases like:

   i16 *a = ...
   i32 val = a[i] | (a[i + 1] << 16)

is more defensible than trying to widen the example that broke in
https://llvm.org/bugs/show_bug.cgi?id=29110.

Regarding atomicity, the only real optimization that we'll lose (that
I can think of) is PRE. Additionally, it may be more expensive to
lower wider atomic loads / stores, but that can be indicated by a
target hook. For instance, on x86, I don't think:

   load atomic i16, i16* %ptr, unordered

is cheaper than

   load atomic i32, i32* %ptr.bitcast, unordered

so from a lowering perspective there is no reason to prefer the former.

Given this, perhaps scheduling load widening after one pass of GVN/PRE
is fine?

-- Sanjoy

>
> I didn’t think of atomicity aspect though.
>
> Artur
>
>>
>> There's a bit of additional context worth adding here...
>>
>> Up until very recently, we had a form of widening implemented in GVN. We decided to remove this in https://reviews.llvm.org/D24096 precisely because its placement in the pass pipeline was inhibiting other optimizations. There's also a major problem with doing widening at the IR level which is that widening a pair of atomic loads into a single wider atomic load can not be undone. This creates a major pass ordering problem of its own.
>>
>> At this point, my general view is that widening transformations of any kind should be done very late. Ideally, this is something the backend would do, but doing it as a CGP like fixup pass over the IR is also reasonable.
>>
>> With that in mind, I feel both the current placement of LoadCombine (within the inliner iteration) and the proposed InstCombine rule are undesirable.
>>
>> Philip
>>
>>> Hi,
>>>
>>> I'm trying to optimize a pattern like this into a single i16 load:
>>> %1 = bitcast i16* %pData to i8*
>>> %2 = load i8, i8* %1, align 1
>>> %3 = zext i8 %2 to i16
>>> %4 = shl nuw i16 %3, 8
>>> %5 = getelementptr inbounds i8, i8* %1, i16 1
>>> %6 = load i8, i8* %5, align 1
>>> %7 = zext i8 %6 to i16
>>> %8 = shl nuw nsw i16 %7, 0
>>> %9 = or i16 %8, %4
>>>
>>> I came across load combine pass which is motivated by virtualliy the same pattern. Load combine optimizes the pattern by combining adjacent loads into one load and lets the rest of the optimizer cleanup the rest. From what I see on the initial review for load combine (https://reviews.llvm.org/D3580) it was not enabled by default because it caused some performance regressions. It's not very surprising, I see how this type of widening can obscure some facts for the rest of the optimizer.
>>>
>>> I can't find any backstory for this pass, why was it chosen to optimize the pattern in question in this way? What is the current status of this pass?
>>>
>>> I have an alternative implementation for it locally. I implemented an instcombine rule similar to recognise bswap/bitreverse idiom. It relies on collectBitParts (Local.cpp) to determine the origin of the bits in a given or value. If all the bits are happen to be loaded from adjacent locations it replaces the or with a single load or a load plus bswap.
>>>
>>> If the alternative approach sounds reasonable I'll post my patches for review.

Hi David,

>> At this point, my general view is that widening transformations of any kind should be done very late. Ideally, this is something the backend would do, but doing it as a CGP like fixup pass over the IR is also reasonable.
>
> I’m really glad to see that this is gone in GVN - it will reduce our
> diffs a lot when we do the next import. The GVN load widening is not
> sound in the presence of hardware-enforced spacial memory safety, so
> we ended up with the compiler inserting things that caused hardware
> bounds checks to fail and had to disable it in a few places.
>
> If you’re reintroducing it, please can we have a backend option to
> specify whether it’s valid to widen loads beyond the extents (for
> example, for us it is not valid to widen an i16 load followed by an i8
> load to an i32 load). Ideally, we’d also want the back end to supply

Don't you have to mark the functions you generate as
"Attribute::SanitizeAddress"? We should definitely make the
speculative form of this transform (i.e. "load i32, load i16" -->
"load i64") predicated on Attribute::SanitizeAddress.

-- Sanjoy

Just to chime in that we have been seeing a bunch of instances on aarch64 as well where the load widening prevented LICM and probably worse inserted additional shifts instructions into the critical path. This was showing especially after r246985 (see for example http://llvm.org/PR25321) which leads to additional widening possibilities.

So I’d also strongly suggest to only optimize obvious cases and stay away from variations that lead to additional shift instructions being inserted.

(I missed D24096, but will check whether that fixes our performance problems as well).

  • Matthias

Nope, we’re not using the address sanitiser. Our architecture supports byte-granularity bounds checking in hardware.

Note that even without this, for pure MIPS code without our extensions, load widening generates significantly worse code than when it doesn’t happen. I’m actually finding it difficult to come up with a microarchitecture where a 16-bit load followed by an 8-bit load from the same cache line would give worse performance than a 32-bit load, a mask and a shift. In an in-order design, it’s more instructions to do the same work, and therefore slower. In an out-of-order design, the two loads within the cache line will likely be dispatched simultaneously and you’ll have less pressure on the register rename engine.

It seems like something that will only give wins if it exposes optimisation opportunities to other transforms and, as such, should probably be exposed as an analysis so that the later transforms can do the combine if it actually makes sense to do so.

David

Hi Artur,

> One of the arguments for doing this earlier is inline cost
> perception of the original pattern. Reading i32/i64 by bytes look much
> more expensive than it is and can prevent inlining of interesting
> function.

I don't think this is just a perception issue -- if the loads have not
been widened then inlining the containing function _is_ expensive, and
the inliner cost analysis is doing the right thing.

> Inhibiting other optimizations concern can be addressed by careful
> selection of the pattern we’d like to match. I limit the
> transformation to the case when all the individual have no uses other
> than forming a wider load. In this case it’s less likely to loose
> information during this transformation.

I agree -- I think widening loads in "obvious" cases like:

i16 *a = ...
i32 val = a[i] | (a[i + 1] << 16)

is more defensible than trying to widen the example that broke in
https://llvm.org/bugs/show_bug.cgi?id=29110.

Regarding atomicity, the only real optimization that we'll lose (that
I can think of) is PRE. Additionally, it may be more expensive to
lower wider atomic loads / stores, but that can be indicated by a
target hook. For instance, on x86, I don't think:

load atomic i16, i16* %ptr, unordered

is cheaper than

load atomic i32, i32* %ptr.bitcast, unordered

so from a lowering perspective there is no reason to prefer the former.

BTW, do we really need to emit an atomic load if all the individual components are bytes?

Artur

Hi David,

David Chisnall wrote:
> Nope, we’re not using the address sanitiser. Our architecture supports byte-granularity bounds checking in hardware.

I mentioned address sanitizer since (then) I thought your architecture
would have to prohibit the same kinds of transforms that address
sanitizer has to prohibit.

However, on second thought, I think I have a counter-example to my
statement above -- I suppose your architecture only checks bounds and
not that the location being loaded from is initialized?

> Note that even without this, for pure MIPS code without our
> extensions, load widening generates significantly worse code than when
> it doesn’t happen. I’m actually finding it difficult to come up with
> a microarchitecture where a 16-bit load followed by an 8-bit load from
> the same cache line would give worse performance than a 32-bit load, a
> mask and a shift. In an in-order design, it’s more instructions to do
> the same work, and therefore slower. In an out-of-order design, the
> two loads within the cache line will likely be dispatched
> simultaneously and you’ll have less pressure on the register rename
> engine.

That makes sense, but what do you think of Artur's suggestion of
catching only the obvious patterns? That is, catching only cases like

   i16* ptr = ...
   i32 val = ptr[0] | (ptr[1] << 16);

==> // subject to endianess

   i16* ptr = ...
   i32 val = *(i32*) ptr;

To me that seems like a win (or at least, not a loss) on any
architecture. However, I will admit that I've only ever worked on x86
so I have a lot of blind spots here.

Thanks!
-- Sanjoy

Hi Artur,

Artur Pilipenko wrote:

> BTW, do we really need to emit an atomic load if all the individual
> components are bytes?

Depends -- do you mean at the at the hardware level or at the IR
level?

If you mean at the IR level, then I think yes; since otherwise it is
legal to do transforms that break byte-wise atomicity in the IR, e.g.:

   i32* ptr = ...
   i32 val = *ptr

=> // Since no threads can be legally racing on *ptr

   i32* ptr = ...
   i32 val0 = *ptr
   i32 val1 = *ptr
   i32 val = (val0 & 1) | (val1 & ~1);

If you're talking about the hardware level, then I'm not sure; and my
guess is that the answer is almost certainly arch-dependent.

-- Sanjoy

>
> Artur
>> Given this, perhaps scheduling load widening after one pass of GVN/PRE
>> is fine?
>>
>> -- Sanjoy
>>
>>> I didn’t think of atomicity aspect though.
>>>
>>> Artur
>>>
>>>>
>>>> There's a bit of additional context worth adding here...
>>>>
>>>> Up until very recently, we had a form of widening implemented in GVN. We decided to remove this in https://reviews.llvm.org/D24096 precisely because its placement in the pass pipeline was inhibiting other optimizations. There's also a major problem with doing widening at the IR level which is that widening a pair of atomic loads into a single wider atomic load can not be undone. This creates a major pass ordering problem of its own.
>>>>
>>>> At this point, my general view is that widening transformations of any kind should be done very late. Ideally, this is something the backend would do, but doing it as a CGP like fixup pass over the IR is also reasonable.
>>>>
>>>> With that in mind, I feel both the current placement of LoadCombine (within the inliner iteration) and the proposed InstCombine rule are undesirable.
>>>>
>>>> Philip
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm trying to optimize a pattern like this into a single i16 load:
>>>>> %1 = bitcast i16* %pData to i8*
>>>>> %2 = load i8, i8* %1, align 1
>>>>> %3 = zext i8 %2 to i16
>>>>> %4 = shl nuw i16 %3, 8
>>>>> %5 = getelementptr inbounds i8, i8* %1, i16 1
>>>>> %6 = load i8, i8* %5, align 1
>>>>> %7 = zext i8 %6 to i16
>>>>> %8 = shl nuw nsw i16 %7, 0
>>>>> %9 = or i16 %8, %4
>>>>>
>>>>> I came across load combine pass which is motivated by virtualliy the same pattern. Load combine optimizes the pattern by combining adjacent loads into one load and lets the rest of the optimizer cleanup the rest. From what I see on the initial review for load combine (https://reviews.llvm.org/D3580) it was not enabled by default because it caused some performance regressions. It's not very surprising, I see how this type of widening can obscure some facts for the rest of the optimizer.
>>>>>
>>>>> I can't find any backstory for this pass, why was it chosen to optimize the pattern in question in this way? What is the current status of this pass?
>>>>>
>>>>> I have an alternative implementation for it locally. I implemented an instcombine rule similar to recognise bswap/bitreverse idiom. It relies on collectBitParts (Local.cpp) to determine the origin of the bits in a given or value. If all the bits are happen to be loaded from adjacent locations it replaces the or with a single load or a load plus bswap.
>>>>>
>>>>> If the alternative approach sounds reasonable I'll post my patches for review.

Hi Artur,

Artur Pilipenko wrote:

> BTW, do we really need to emit an atomic load if all the individual
> components are bytes?

Depends -- do you mean at the at the hardware level or at the IR
level?

If you mean at the IR level, then I think yes; since otherwise it is
legal to do transforms that break byte-wise atomicity in the IR, e.g.:

i32* ptr = ...
i32 val = *ptr

=> // Since no threads can be legally racing on *ptr

i32* ptr = ...
i32 val0 = *ptr
i32 val1 = *ptr
i32 val = (val0 & 1) | (val1 & ~1);

If you're talking about the hardware level, then I'm not sure; and my
guess is that the answer is almost certainly arch-dependent.

I meant the case when we have a load by bytes pattern like this:
i8* p = ...
i8 b0 = *p++;
i8 b1 = *p++;
i8 b2 = *p++;
i8 b3 = *p++;
i32 result = b0 << 24 | b1 << 16 | b2 << 8 | b << 0;

When we fold it to a i32 load, should this load be atomic?

Artur

Hi Artur,

Artur Pilipenko wrote:
>>
>> Hi Artur,
>>
>> Artur Pilipenko wrote:
>>
>>> BTW, do we really need to emit an atomic load if all the individual
>>> components are bytes?
>> Depends -- do you mean at the at the hardware level or at the IR
>> level?
>>
>> If you mean at the IR level, then I think yes; since otherwise it is
>> legal to do transforms that break byte-wise atomicity in the IR, e.g.:
>>
>> i32* ptr = ...
>> i32 val = *ptr
>>
>> => // Since no threads can be legally racing on *ptr
>>
>> i32* ptr = ...
>> i32 val0 = *ptr
>> i32 val1 = *ptr
>> i32 val = (val0& 1) | (val1& ~1);
>>
>> If you're talking about the hardware level, then I'm not sure; and my
>> guess is that the answer is almost certainly arch-dependent.
> I meant the case when we have a load by bytes pattern like this:
> i8* p = ...
> i8 b0 = *p++;
> i8 b1 = *p++;
> i8 b2 = *p++;
> i8 b3 = *p++;
> i32 result = b0<< 24 | b1<< 16 | b2<< 8 | b<< 0;
>
> When we fold it to a i32 load, should this load be atomic?

If we do fold it to a non-atomic i32 load, then it would be legal for
LLVM to do the IR transform I mentioned above. That breaks the
byte-wise atomicity you had in the original program.

That is, in:

   i8* p = ...
   i8 b0 = *p++;
   i8 b1 = *p++;
   i8 b2 = *p++;
   i8 b3 = *p++;
   // Note: I changed this to be little endian, and I've assumed
   // that we're compiling for a little endian system
   i32 result = b3<< 24 | b2<< 16 | b1<< 8 | b0<< 0;

say all of p[0..3] are 0, and you have a thread racing to set b0 to
-1. Then result can either be 0 or 255.

However, say you first transform this to a non-atomic i32 load:

   i8* p = ...
   i32* p.i32 = (i32*)p
   i32 result = *p.i32

and we do the transform above

   i8* p = ...
   i32* p.i32 = (i32*)p
   i32 result0 = *p.i32
   i32 result1 = *p.i32
   i32 result = (result0 & 1) | (result1 & ~1);

then it is possible for result to be 254 (by result0 observing 0 and
result observing 255).

-- Sanjoy

Correcting obvious typo:

Sanjoy Das wrote:
[snip]

i32 result0 = *p.i32
i32 result1 = *p.i32
i32 result = (result0 & 1) | (result1 & ~1);

then it is possible for result to be 254 (by result0 observing 0 and
result observing 255).

Should be: "by result0 observing 0 and result1 observing 255"

-- Sanjoy

Hi Artur,

Artur Pilipenko wrote:
>>
>> Hi Artur,
>>
>> Artur Pilipenko wrote:
>>
>>> BTW, do we really need to emit an atomic load if all the individual
>>> components are bytes?
>> Depends -- do you mean at the at the hardware level or at the IR
>> level?
>>
>> If you mean at the IR level, then I think yes; since otherwise it is
>> legal to do transforms that break byte-wise atomicity in the IR, e.g.:
>>
>> i32* ptr = ...
>> i32 val = *ptr
>>
>> => // Since no threads can be legally racing on *ptr
>>
>> i32* ptr = ...
>> i32 val0 = *ptr
>> i32 val1 = *ptr
>> i32 val = (val0& 1) | (val1& ~1);
>>
>>
>> If you're talking about the hardware level, then I'm not sure; and my
>> guess is that the answer is almost certainly arch-dependent.
> I meant the case when we have a load by bytes pattern like this:
> i8* p = ...
> i8 b0 = *p++;
> i8 b1 = *p++;
> i8 b2 = *p++;
> i8 b3 = *p++;
> i32 result = b0<< 24 | b1<< 16 | b2<< 8 | b<< 0;
>
> When we fold it to a i32 load, should this load be atomic?

If we do fold it to a non-atomic i32 load, then it would be legal for
LLVM to do the IR transform I mentioned above. That breaks the
byte-wise atomicity you had in the original program.

That is, in:

i8* p = ...
i8 b0 = *p++;
i8 b1 = *p++;
i8 b2 = *p++;
i8 b3 = *p++;
// Note: I changed this to be little endian, and I've assumed
// that we're compiling for a little endian system
i32 result = b3<< 24 | b2<< 16 | b1<< 8 | b0<< 0;

say all of p[0..3] are 0, and you have a thread racing to set b0 to
-1. Then result can either be 0 or 255.

However, say you first transform this to a non-atomic i32 load:

i8* p = ...
i32* p.i32 = (i32*)p
i32 result = *p.i32

and we do the transform above

i8* p = ...
i32* p.i32 = (i32*)p
i32 result0 = *p.i32
i32 result1 = *p.i32
i32 result = (result0 & 1) | (result1 & ~1);

then it is possible for result to be 254 (by result0 observing 0 and
result observing 255).

I see. For some reason I was assuming byte-wise atomicity for non-atomic loads.

So, if any of the components are atomic, the resulting load must be atomic as well.

Artur

We have seen cases where it is profitable to widen at a given point in compilation, but after optimizations such as inlining (with more surrounding code), future opts after inlining do a better job with the non-widened code [1]

Couple of concerns:

  1. having inst combine rules for widening would need to take care of endian-ness, and interactions with other optimizations, since instcombine is run multiple times. From the above discussion there are couple of passes, such as PRE, LICM (probably more), that can be affected negatively by the widening. I’m not sure if it’s a good idea to add as instcombine rule.

  2. I think identifying enough obvious cases to warrant a late fix-up pass, which would be improvements/neutral for all architectures may be difficult :slight_smile: There’s also the question of maintainability - adding new rules to the pass. Widening rules are perhaps best suited as back end optimizations, in the presence of profitability cost model specific to the architecture. For example, there is the store widening in Hexagon arch. I’m not sure if we have more of this in place.

[1] http://lists.llvm.org/pipermail/llvm-dev/2016-June/101789.html

Anna

We've talked about the need for element wise atomic vectors in other contexts. This sounds like maybe we need a element wise atomic notion on non-vectors as well. The "element type" is merely a byte. Alternatively, we could re-frame widening as producing a vector or struct type which is element wise atomic, but that seems like a lot of complexity.

Philip

This is one of the cases that actually is a loss on many MIPS platforms, unless ptr[0] is 4-byte aligned. If it is 2-byte aligned (which is the only place I’ve seen this crop up in hot loops of benchmark code, which is the only place where I’ve looked seriously to see why we’re generating bad code), we end up widening two 2-byte, 2-byte-aligned, loads to a 4-byte, 2-byte-aligned load. The latter sequence is more expensive.

Even on x86, this holds dynamically: you’ll hit a slower microcode path on a lot of x86 implementations if you’re dispatching an unaligned 4-byte load than if you dispatch a pair of aligned 2-byte loads. If the two two-byte loads are adjacent and are 4-byte aligned (and therefore within the same cache line), then you’ll often see micro-op fusion combine them back into a 4-byte load, depending on the surrounding insturctions.

Even then, it’s only a win on x86 if val is the only use of the result. We were widening loads that had independent uses, so we ended up loading into one register and then splitting out portions of the result, which put a lot more pressure on the register rename unit[1]. This has a decidedly non-linear performance outcome: often it’s fine, but if it’s in a loop and it crosses a threshold then you see about a 50% performance decrease (on recent x86 microarchitectures).

David

[1] A few years ago, I would have said cache usage was the most important thing to optimise for as a first pass. With modern superscalar chips, register rename pressure is now even more important. This was really driven home to me a few months ago when I made a loop run twice as fast on the two most recent Intel microarchitectures by inserting an xor %rax, %rax at the top - explicitly killing the false dependency between loop iterations made more of a difference than all of the other optimisations that we’d done, combined.

From: llvm-dev [mailto:llvm-dev-bounces@lists.llvm.org] On Behalf Of David
Chisnall via llvm-dev
Sent: 30 September 2016 09:18
To: Sanjoy Das
Cc: llvm-dev; Sanjoy Das
Subject: Re: [llvm-dev] Load combine pass

>
> That makes sense, but what do you think of Artur's suggestion of
> catching only the obvious patterns? That is, catching only cases like
>
> i16* ptr = ...
> i32 val = ptr[0] | (ptr[1] << 16);
>
> ==> // subject to endianess
>
> i16* ptr = ...
> i32 val = *(i32*) ptr;
>
> To me that seems like a win (or at least, not a loss) on any
> architecture. However, I will admit that I've only ever worked on x86
> so I have a lot of blind spots here.

This is one of the cases that actually is a loss on many MIPS platforms, unless
ptr[0] is 4-byte aligned. If it is 2-byte aligned (which is the only place I’ve
seen this crop up in hot loops of benchmark code, which is the only place
where I’ve looked seriously to see why we’re generating bad code), we end
up widening two 2-byte, 2-byte-aligned, loads to a 4-byte, 2-byte-aligned
load. The latter sequence is more expensive.

Even on x86, this holds dynamically: you’ll hit a slower microcode path on a
lot of x86 implementations if you’re dispatching an unaligned 4-byte load
than if you dispatch a pair of aligned 2-byte loads. If the two two-byte loads
are adjacent and are 4-byte aligned (and therefore within the same cache
line), then you’ll often see micro-op fusion combine them back into a 4-byte
load, depending on the surrounding insturctions.

Even then, it’s only a win on x86 if val is the only use of the result. We were
widening loads that had independent uses, so we ended up loading into one
register and then splitting out portions of the result, which put a lot more
pressure on the register rename unit[1]. This has a decidedly non-linear
performance outcome: often it’s fine, but if it’s in a loop and it crosses a
threshold then you see about a 50% performance decrease (on recent x86
microarchitectures).

David

To expand on David's point, alignment matters heavily on MIPS. An unaligned load
has to be split into two loads (load word left/ load word right). For MIPSR6 it's a single
instruction, but if the load crosses certain thresholds such has cache line/page boundary
some implementations may generate an address exception which causes the operating
system to emulate the load which faulted. (MIPSR6 supports unaligned accesses for
user code, but is implementation dependant as to the combination of hardware/OS
software support).

For cases where the widened load is sufficiently aligned and the values have separate
uses, for MIPS64, the component values might require canonicalization. E.g. combining
two i32 loads into an i64 load. If those i32s are used arithmetically, bit 31 has to be
replicated into bits 32 to 63. This is a hard ISA requirement for MIPS64 as it doesn't do
subregister accesses. (32 bit arithmetic operations on MIPS64 expect a 32 bit value sign
extended to 64 bits, otherwise the result is undefined.) This is bad for code density
as the normal 32 bit loads will perform the sign extension anyway.

In summary, I think load combining would need some sort of target cost model /
target profitability query to determine if combining some set of loads for some
chain of operations is actually an optimization.

Thanks,
Simon

Philip and I talked about this is person. Given the fact that load widening in presence of atomics is irreversible transformation we agreed that we don't want to do this early. For now it can be implemented as a peephole optimization over machine IR. MIR is preferred here because X86 backend does GEP reassociation at MIR level and it can make information about addresses being adjacent available.

Inline cost of the original pattern is a valid concern and we might want to revisit our decision later. But in order to do widening earlier we need to have a way to undo this transformation.

I’m going to try implementing MIR optimization but not in the immediate future.

Artur