RFC: Missing canonicalization in LLVM

So, we’ve run into some test cases which are pretty alarming.

When inlining code in various different paths we can end up with this IR:

define void @f(float* %value, i8* %b) {

entry:
%0 = load float* %value, align 4
%1 = bitcast i8* %b to float*
store float %0, float* %1, align 1
ret void
}

define void @g(float* %value, i8* %b) {

entry:
%0 = bitcast float* %value to i32*
%1 = load i32* %0, align 4
%2 = bitcast i8* %b to i32*
store i32 %1, i32* %2, align 1
ret void
}

Now, I don’t really care one way or the other about these two IR inputs, but it’s pretty concerning that we get these two equivalent bits of code and nothing canonicalizes to one or the other.

So, the naive first blush approach here would be to canonicalize on the first – it has fewer instructions after all – but I don’t think that’s the right approach for two reasons:

  1. It will be a very narrow canonicalization that only works with overly specific sets of casted pointers.
  2. It doesn’t effectively move us toward the optimizer treating IR with different pointee types for pointer types indistinguishably. Some day, I continue to think we should get rid of the pointee types entirely.

To see why #1 and #2 are problematic, assume another round of inlining took place and we suddenly had the following IR:

AFAICT, this is the same and we still don’t have a good canonicalization story.

What seems like the obvious important and missing canonicalization is that when we have a loaded value that is only used by storing it back into memory, we don’t canonicalize the type of that value (ignoring the pointer types) to a single value type.

So, the only really suitable type for this kind of stuff is ‘iN’ where N matches the number of bits loaded or stored.

I have this change implemented. It is trivial and unsurprising. However, the effects of this are impossible to predict so I wanted to make sure it made sense to others. Essentially, I expect random and hard to track down performance fluctuations across the board. Some things may get better, others may get worse, and they will probably all be bugs elsewhere in the stack.

So, thoughts?
-Chandler

So, we've run into some test cases which are pretty alarming.

When inlining code in various different paths we can end up with this IR:

define void @f(float* %value, i8* %b) {
entry:
  %0 = load float* %value, align 4
  %1 = bitcast i8* %b to float*
  store float %0, float* %1, align 1
  ret void
}

define void @g(float* %value, i8* %b) {
entry:
  %0 = bitcast float* %value to i32*
  %1 = load i32* %0, align 4
  %2 = bitcast i8* %b to i32*
  store i32 %1, i32* %2, align 1
  ret void
}

Now, I don't really care one way or the other about these two IR inputs,
but it's pretty concerning that we get these two equivalent bits of code
and nothing canonicalizes to one or the other.

So, the naive first blush approach here would be to canonicalize on the
first -- it has fewer instructions after all -- but I don't think that's
the right approach for two reasons:

1) It will be a *very* narrow canonicalization that only works with overly
specific sets of casted pointers.
2) It doesn't effectively move us toward the optimizer treating IR with
different pointee types for pointer types indistinguishably. Some day, I
continue to think we should get rid of the pointee types entirely.

To see why #1 and #2 are problematic, assume another round of inlining
took place and we suddenly had the following IR:

And the missing IR example:

define void @f(i8* %value, i8* %b) {
entry:
  %0 = bitcast i8* %value to float*
  %1 = load float* %0, align 4
  %2 = bitcast i8* %b to float*
  store float %0, float* %1, align 1
  ret void
}

define void @g(i8* %value, i8* %b) {
entry:
  %0 = bitcast i8* %value to i32*
  %1 = load i32* %0, align 4
  %2 = bitcast i8* %b to i32*
  store i32 %1, i32* %2, align 1
  ret void
}

The first thing that springs to mind is that I don’t trust the backend to get this right. I don’t think it will understand when an i32 load/store would have been preferable to a float one or vice versa. I have no evidence of this, but given how strongly typed tablegen is, I don’t think it can make a good choice here.

So I think we probably need to teach the backend how to undo whatever canonical form we choose if it has a reason to. And the best long term solution is for tablegen to have sized load/stores, not typed ones.

One (potentially expensive) way to choose the canonical form here is to look at the users of the load and see what type works best. If we load an i32, but bit cast and do an fp operation on it, then a float load was best. If we just load it then store, then in theory either type works.

Pete

The first thing that springs to mind is that I don’t trust the backend to
get this right. I don’t think it will understand when an i32 load/store
would have been preferable to a float one or vice versa. I have no
evidence of this, but given how strongly typed tablegen is, I don’t think
it can make a good choice here.

I don't think tablegen is relevant to making a good choice here. This only
comes up when we have a load which is only ever stored. See below, I'll
come back to this after clarifying...

So I think we probably need to teach the backend how to undo whatever
canonical form we choose if it has a reason to. And the best long term
solution is for tablegen to have sized load/stores, not typed ones.

One (potentially expensive) way to choose the canonical form here is to
look at the users of the load and see what type works best. If we load an
i32, but bit cast and do an fp operation on it, then a float load was
best. If we just load it then store, then in theory either type works.

We actually already do this. =] I tought instcombine to do this recently.
The way this works is as follows:

If we find a load with a single bitcast of its value to some other type, we
try to load that type instead. We rely on the fact that if there is in fact
a single type that the load is used as, some other part of LLVM will fold
the redundant bitcasts. I can easily handle redundant bitcasts if you like.

If we find a store of a bitcasted value, we try to store the original value
instead.

This works really well *except* for the case when the only (transitive)
users of the loaded value are themselves stores. In that case, there is no
"truth" we can rely on from operational semantics. We need to just pick a
consistent answer IMO. Integers seem like the right consistent answer, and
this isn't the only place LLVM does this -- we also lower a small memcpy as
an integer load and store.

As for the backend, I agree I don't trust them to do anything clever here,
but I think you may be missing how clever they would need to be. =D The
only way this matters is that, for example, you have a store-to-load
forwarding unit in your CPU that only works within a register class, and
thus a stored integer will fail to get forwarded in the CPU to a load of a
floating point value at the same address. If LLVM is ever able to forward
this in the IR, it should re-write all the types to match whatever
operation we have.

I don't think any backend today (or in the foreseeable future) would have
such smarts. But if CPUs are actually impacted by this kind of thing (and I
have no evidence that they are), we might be getting lucky some of the time.

Personally, I don't think this is compelling enough to delay making a
change because i have test cases where we are getting *unlucky* today, and
picking a canonical form will either directly fix them or make it much
easier to fix them.

From: "Pete Cooper" <peter_cooper@apple.com>
To: "Chandler Carruth" <chandlerc@gmail.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Wednesday, January 21, 2015 4:43:47 PM
Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM

So, we've run into some test cases which are pretty alarming.

When inlining code in various different paths we can end up with this
IR:

define void @f(float* %value, i8* %b) {

entry:
%0 = load float* %value, align 4
%1 = bitcast i8* %b to float*
store float %0, float* %1, align 1
ret void
}

define void @g(float* %value, i8* %b) {

entry:
%0 = bitcast float* %value to i32*
%1 = load i32* %0, align 4
%2 = bitcast i8* %b to i32*
store i32 %1, i32* %2, align 1
ret void
}

Now, I don't really care one way or the other about these two IR
inputs, but it's pretty concerning that we get these two equivalent
bits of code and nothing canonicalizes to one or the other.

So, the naive first blush approach here would be to canonicalize on
the first -- it has fewer instructions after all -- but I don't
think that's the right approach for two reasons:

1) It will be a *very* narrow canonicalization that only works with
overly specific sets of casted pointers.
2) It doesn't effectively move us toward the optimizer treating IR
with different pointee types for pointer types indistinguishably.
Some day, I continue to think we should get rid of the pointee types
entirely.

To see why #1 and #2 are problematic, assume another round of
inlining took place and we suddenly had the following IR:

And the missing IR example:

define void @f(i8* %value, i8* %b) {

entry:
%0 = bitcast i8* %value to float*
%1 = load float* %0, align 4
%2 = bitcast i8* %b to float*
store float %0, float* %1, align 1
ret void
}

define void @g(i8* %value, i8* %b) {

entry:
%0 = bitcast i8* %value to i32*
%1 = load i32* %0, align 4
%2 = bitcast i8* %b to i32*
store i32 %1, i32* %2, align 1
ret void
}

AFAICT, this is the same and we still don't have a good
canonicalization story.

What seems like the obvious important and missing canonicalization is
that when we have a loaded value that is *only* used by storing it
back into memory, we don't canonicalize the type of that *value*
(ignoring the pointer types) to a single value type.

So, the only really suitable type for this kind of stuff is 'iN'
where N matches the number of bits loaded or stored.

I have this change implemented. It is trivial and unsurprising.
However, the effects of this are impossible to predict so I wanted
to make sure it made sense to others. Essentially, I expect random
and hard to track down performance fluctuations across the board.
Some things may get better, others may get worse, and they will
probably all be bugs elsewhere in the stack.

So, thoughts? The first thing that springs to mind is that I don’t
trust the backend to get this right. I don’t think it will
understand when an i32 load/store would have been preferable to a
float one or vice versa. I have no evidence of this, but given how
strongly typed tablegen is, I don’t think it can make a good choice
here.

So I think we probably need to teach the backend how to undo whatever
canonical form we choose if it has a reason to. And the best long
term solution is for tablegen to have sized load/stores, not typed
ones.

You still need to make sure, in many cases, that you load into the correct register file to avoid expensive cross-domain moves. So *something* need to look at the eventual users, either the IR-level optimizer or the backend. I'd prefer the IR-level optimizer if possible.

-Hal

define void @f(float* %value, i8* %b) {

entry:

%0 = load float* %value, align 4

%1 = bitcast i8* %b to float*

store float %0, float* %1, align 1

ret void

}

define void @g(float* %value, i8* %b) {

entry:

%0 = bitcast float* %value to i32*

%1 = load i32* %0, align 4

%2 = bitcast i8* %b to i32*

store i32 %1, i32* %2, align 1

ret void

}

I don’t think these are equivalent representations. The one with the float loads and stores has the potential of FP exceptions

both during the load and during the store, depending on what exact instruction set is in use. For example, floating point loads when done with x87, aren’t even guaranteed of giving the same bit pattern when using due to signaling Nans potentially being converted into quiet Nans by the fld instruction.

Thus, I don’t think this is an example of a missing canonicalization, nor even a legal one in all circumstances. But maybe there is a subtlety of LLVM IR semantics that I am unaware of.

Kevin B. Smith

Yeah, thinking about this more, integers do seem like the right answer. If a backend wanted to do something special then its up to them to handle it. For example, x86 might load balance issue ports by turning an i32 load/store in to SSE insert/extract, although i cannot imagine any time this would actually be a good idea.

Sounds good to me. Integers it is then.

Pete

FYI, I’m going to add this behind a flag as the implementation really is trivial and this should be it quite a bit easier to discuss.

LLVM explicitly doesn't support FP exceptions on loads and stores. And it
would break the world to add it.

See my response to Pete, but in general this only applies in a case where
there is no use of the load other than a store. I already taught
instcombine to canonicalize loads which are *used* as floating point values
correctly.

OK.

Then I think the clear choice ought to be to canonicalize to the same sized int representation to account for architectures

whose FP may not exactly match LLVMs assumptions. Even when fp exceptions are masked, X87 load changes bit pattern of SNan to

QNan, so it is going to be generally incorrect to convert an integer load to an FP load, as that won’t properly preserve the all possible bit patterns

of the integer, unless great pains are taken not to use fld instruction to implement FP load, which just seems wrong as well.

Kevin Smith

FYI, thanks, I'm just going to commit this then. It seems we're all in
essential agreement. We can revert it and take a more cautious approach if
something terrible happens. =]

No problem :slight_smile: I don’t expect any issues, but then i don’t know the quirks of all of our targets so it’ll be fun to see what happens.

Pointee? I'd think that eliminating multiple pointer types would be a better thing to "normalize". Having a tag on a load stating the loaded type would be useful for instruction selection. Unless I'm not reading this correctly...

-Krzysztof

So this change did indeed have an effect! J

I’m seeing regressions in a number of benchmarks mainly due to a host of extra bitcasts that get introduced. Here’s the problem I’m seeing in a nutshell:

  1. There is a Phi with input type double

  2. Polly demotes the phi into a load/store of type double

  3. InstCombine canonicalizes the load/store to use i64 instead of double

  4. SROA removes the load/store & inserts a phi back in, using i64 as the type. Inserts bitcast to get to double.

  5. The bitcast sticks around and eventually get translated into FMOVs (for AArch64 at least).

The function findCommonType() in SROA.cpp is used to obtain the type that should be used for the new alloca that SROA wants to create. It’s decision process is essentially – if all loads/stores of alloca are the same, use that type; else use the corresponding integer type. This causes bitcasts to be inserted in a number of places, most all of which stick around.

I’ve copied a reduced version of an instance of the problem below. I’m looking for comments on what others think is the right solution here. Make SROA more intelligent about picking the type?

The code is below with all unnecessary code removed for easy consumption.

Daniel

Before Polly – Prepare code for polly we have code that looks like:

while.cond473: ; preds = %while.cond473.outer78, %while.body475

%p_j_x452.0 = phi double [ %105, %while.body475 ], [ %p_j_x452.0.ph82, %while.cond473.outer78 ]

while.body475: ; preds = %while.cond473

%sub480 = fsub fast double %64, %p_j_x452.0

%105 = load double* %x485, align 8, !tbaa !25

After Polly – Prepare code for polly we have:

while.cond473: ; preds = %while.cond473.outer78, %while.body475

%p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem

while.body475: ; preds = %while.cond473

%sub480 = fsub fast double %64, %p_j_x452.0.reload

%110 = load double* %x485, align 8, !tbaa !25

store double %110, double* %p_j_x452.0.reg2mem

After Combine redundant instructions :

while.cond473: ; preds = %while.cond473.outer78, %while.body475

%p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem, align 8

while.body475: ; preds = %while.cond473

%sub480 = fsub fast double %74, %p_j_x452.0.reload

%x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0, i32 0

%194 = bitcast double* %x485 to i64*

%195 = load i64* %194, align 8, !tbaa !25

%200 = bitcast double* %p_j_x452.0.reg2mem to i64*

store i64 %195, i64* %200, align 8

After SROA :

while.cond473: ; preds = %while.cond473.outer78, %while.body475

%p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 = phi i64 [ %p_j_x452.0.ph73.reg2mem.sroa.0.0.load368, %while.cond473.outer78 ], [ %178, %while.body475 ]

%173 = bitcast i64 %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 to double

while.body475: ; preds = %while.cond473

%sub480 = fsub fast double %78, %173

%x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0, i32 0

%177 = bitcast double* %x485 to i64*

%178 = load i64* %177, align 8, !tbaa !25

Hi Daniel

Thanks for the excellent breakdown of whats going on here.

Earlier in the thread on this I made this comment:

"The first thing that springs to mind is that I don’t trust the backend to get this right. I don’t think it will understand when an i32 load/store would have been preferable to a float one or vice versa. I have no evidence of this, but given how strongly typed tablegen is, I don’t think it can make a good choice here.

So I think we probably need to teach the backend how to undo whatever canonical form we choose if it has a reason to”

Without seeing the machine instructions, its hard to be 100% certain, but the case you’ve found may be simple enough that the backend can actually fix this. However, such a fixup would be quite target specific (such a target would need different register classes for integers and doubles in this case), and we’d need such a pass for all targets which isn’t ideal.

So i wouldn’t rule out a backend solution, but i have a preference for your suggestion to improve SROA.

In this particular case, it makes sense for SROA to do effectively the same analysis InstCombine did here and work out when a load is just raw data vs when its data is used as a specific type. The relevant piece of InstCombine is this:

// Try to canonicalize loads which are only ever stored to operate over
  // integers instead of any other type. We only do this when the loaded type
  // is sized and has a size exactly the same as its store size and the store
  // size is a legal integer type.
  if (!Ty->isIntegerTy() && Ty->isSized() &&
      DL.isLegalInteger(DL.getTypeStoreSizeInBits(Ty)) &&
      DL.getTypeStoreSizeInBits(Ty) == DL.getTypeSizeInBits(Ty)) {
    if (std::all_of(LI.user_begin(), LI.user_end(), [&LI](User *U) {
          auto *SI = dyn_cast<StoreInst>(U);
          return SI && SI->getPointerOperand() != &LI;
        })) {
...

After ignoring load/stores which satisfy something like the above code, you can always fallback to the current code of choosing an integer type, so in the common case there won’t be any behavior difference.

Cheers
Pete

Thanks for the reply Pete. Unfortunately, I don’t think it is going to be as simple as ignoring those loads which only store. In findCommonType(), only one alloca is passed in at a time. So, while you could find those cases where that alloca was loaded from and stored elsewhere, you can’t find those places that store to that alloca from somewhere else (at least not easily that I can see).

So in my particular example, I could catch the case of only load → store. However, there are other stores that use the alloca address, but it is not readily apparent if they come directly & only from a load.

Still trying to figure out the best way forward.

Daniel

I think I can solve this by both adding BitCast checks to the loads, in addition to the Stores, and also checking the Stores to ensure they are fed by a Load and that the Load only feeds it. I’ll test this solution some more and try to make a patch.

Daniel

FYI, on vacation and then at a conference but will actually look at this on Monday next week.

After trying various means of undoing the canonicalization at the SROA pass, I’m thinking an easier/better approach is to simply be much more conservative when doing the original canonicalization in the first place. My proposal is

if (std::all_of(LI.user_begin(), LI.user_end(), [&LI](User *U) {

auto *SI = dyn_cast(U);

return SI && SI->getPointerOperand() != &LI &&

SI->getPointerOperand()->hasOneUse();

})) {

where the addition of checking to ensure the StoreInst’s pointer operand is only used once. What I’ve found is that when it is used multiple times, it is usually due to the demotion of phis into loads/stores. There are multiple of them because several blocks are using the storage location essentially as a phi. I originally tried to ignore the cases where the loads only stored during SROA. But that really doesn’t work unless you follow all the other places that also load from those same locations, accounting for all the inserted bitcasts. It got way too messy trying to catch all the cases, and it still wasn’t able to undo what needed to be undone.

I’ve found I can get back most all the regression loss I see by being more conservative at the canonicalization step.

Would this still meet the cases you originally wrote this canonicalization for Chandler?

Daniel