Overzealous PromoteCastOfAllocation

Hi all,

I'm currently running into some problems with instcombine changing the type of
alloca instructions. In particular, the PromoteCastOfAllocation looks at any
allocation instruction that is used by a bitast.

It does a few checks, but basically tries to change the type of the alloca
instruction to the type pointed to by the bitcasted type. The current
heuristic for determining if this is a good idea, is "do it if the aligment of
the new type is larger than before" (Note that this is not just a safety
check, since when the aligments are equal, the alloca is not changed). This
heuristic seems to have been introduced to prevent infinite loops when there
are multiple uses of the alloca.

At first glance, changing the alloca type to remove a bitcast seems a sensible
idea. However, the transformation seems quite ignorant of other uses of the
alloca. It does properly update them by inserting extra bitcasts, but in a lot
of cases, this means you can remove a single (or perhaps a few) bitcasts,
while introducing a bunch of other bitcasts. IMHO, a better policy is needed.

Another problem that occurs here, is that this transformation can replace a
struct alloca with a single integer alloca. This, in turn, can confuse other
passes like scalarrepl, and produce but ugly code. I've assembled a small
example of this.

  declare void @bar(i16, i16)

  define internal void @empty(i32* %X) {
    ret void
  }

  define void @foo(i16 %A, i16 %B) {
  entry:
    ;; Alloc a struct
    %V = alloca {i16, i16}
    ;; Save our arguments into it
    %V1 = getelementptr {i16, i16}* %V, i32 0, i32 0
    %V2 = getelementptr {i16, i16}* %V, i32 0, i32 1
    store i16 %A, i16* %V1
    store i16 %B, i16* %V2

    ;; Load the values again
    %R1 = load i16* %V1
    %R2 = load i16* %V2
    ;; Pass them to an external function to make them used
    call void @bar(i16 %R1, i16 %R2)

    ;; Bitcast the alloca to i32&
    %V32 = bitcast {i16, i16}* %V to i32*
    ;; And make it appear used (at first glance, deadargelim will
    ;; can remove this use later on
    call void @empty(i32* %V32)
    ret void
  }

The above program does almost nothing, but contains an alloca of {i16, i16}
which is bitcasted to i32* somewhere. This i32* appears used, but can later be
removed (you'll see why).

Now, when I run instcombine over this code, I get the following (other
functions left out, they are unchanged).
  define void @foo(i16 %A, i16 %B) {
  entry:
          %V = alloca i32 ; <i32*> [#uses=3]
          %tmpcast = bitcast i32* %V to { i16, i16 }* ; <{ i16, i16 }*> [#uses=1]
          %V1 = bitcast i32* %V to i16* ; <i16*> [#uses=2]
          %V2 = getelementptr { i16, i16 }* %tmpcast, i32 0, i32 1 ; <i16*> [#uses=2]
          store i16 %A, i16* %V1, align 4
          store i16 %B, i16* %V2
          %R1 = load i16* %V1, align 4 ; <i16> [#uses=1]
          %R2 = load i16* %V2 ; <i16> [#uses=1]
          call void @bar( i16 %R1, i16 %R2 )
          call void @empty( i32* %V )
          ret void
  }

Here, the alloca is replaced by an i32 alloca. This doesn't seem like an
improvement to me, since now the proper uses of the struct (geps) need a
bitcast first. Though this looks like a minor problem at best, it confuses
scalarrepl.

When I run -deadargelim -die -scalarrepl over the instcombined code (the first
two to remove the i32* use and bitcast), I get the following ugly code:
  define void @foo(i16 %A, i16 %B) {
  entry:
          %A1 = zext i16 %A to i32 ; <i32> [#uses=1]
          %A12 = shl i32 %A1, 16 ; <i32> [#uses=1]
          %V.in.mask = and i32 undef, 65535 ; <i32> [#uses=1]
          %A12.ins = or i32 %V.in.mask, %A12 ; <i32> [#uses=1]
          %B9 = zext i16 %B to i32 ; <i32> [#uses=1]
          %V.in8.mask = and i32 %A12.ins, -65536 ; <i32> [#uses=1]
          %B9.ins = or i32 %V.in8.mask, %B9 ; <i32> [#uses=2]
          %R14 = lshr i32 %B9.ins, 16 ; <i32> [#uses=1]
          %R15 = trunc i32 %R14 to i16 ; <i16> [#uses=1]
          %R27 = trunc i32 %B9.ins to i16 ; <i16> [#uses=1]
          call void @bar( i16 %R15, i16 %R27 )
          call void @empty( )
          ret void
  }

However, when I run -deadargelim -die -scalarrepl directly on the original code, I get the result I expect:
  define void @foo(i16 %A, i16 %B) {
  entry:
          call void @bar( i16 %A, i16 %B )
          call void @empty( )
          ret void
  }

Now, I know that the shift/zext/trunc mess above might be supposed to be
cleaned up by instcombine again (it isn't currently), I think that the actual
problem lies with PromoteCastOfAllocation in instcombine. IMHO, the struct
alloca shouldn't have been replaced with an i32 alloca in the first place.

Also, in the code where I originally noticed this problem, the resulting code
is more complex and less likely to be simplified again. In particular, instead
of the @empty function I have a call to memmove, whose destination is passed
as an argument to another function. ArgumentPromotion is able to split up this
struct argument and remove the memmove, but by then the alloca is already
screwed up by instcombine and scalarrepl is no longer able to properly expand
it.

Any comments on this problem? Do you agree that the current replacement policy
of PromoteCastOfAllocation is a bit too wide?

I think that some other policy should be implemented in
PromoteCastOfAllocation, probably one that takes into account the number of
bitcasts that can be removed and the number that would be introduced, or
perhaps just refuse to change a struct alloca to an integer alloca, or...

Gr.

Matthijs

Hi Matthijs,

Changing PromoteCastOfAllocation to not replace aggregate allocas with
non-aggregate allocas if they have GEP users sounds reasonable to me.

Finding the maximum alignment is sometimes still useful though, so
it would be nice to update the alignment field of the alloca even if
its type is left unchanged.

Dan

Hi Dan,

Changing PromoteCastOfAllocation to not replace aggregate allocas with
non-aggregate allocas if they have GEP users sounds reasonable to me.

This sounds reasonable indeed, but still a bit arbitrary. Haven't figured out
anything better yet, though.

Finding the maximum alignment is sometimes still useful though, so
it would be nice to update the alignment field of the alloca even if
its type is left unchanged.

The maximizing of the alignment is done only looking at the type's ABI
alignment, the actual alignment of the alloca instruction is not used.

But what you suggest is that if the alloca is casted to some type that has
higher alignment, you want that higher allignment propagated to the alloca
instruction? I can see why this is useful, since bitcasting to a type with
higher alignment can actually produce invalid code, I think? Or how does this
work exactly?

Gr.

Matthijs

I haven't studied the code you're talking about in detail. I was
just observing that the compiler allocates the storage for alloca,
so it can use whatever alignment it wants, so it makes sense to
use the greatest value that any of the users would want.

Dan

Oh, ok. So code that takes an alloca, bitcasts the address to a higher
alignment, and then does a load or store at the higher alignment, is
invoking undefined behavior. The alignment attribute on a load or store
is an assertion about the actual alignment of the memory.

Dan

Hi Dan,

Oh, ok. So code that takes an alloca, bitcasts the address to a higher
alignment,

Since alignment is not a property of a type, I don't think you can "bitcast to
a higher alignment" as such. I still understand what you mean, though :slight_smile:

and then does a load or store at the higher alignment, is
invoking undefined behavior. The alignment attribute on a load or store
is an assertion about the actual alignment of the memory.

Should this undefined behaviour be caught by the verifier, or is it ok for it
to exist?

In any case, since this behaviour is undefined, it's not quite needed for the
instcombine pass to propagate the ABI alignment from a type bitcasted to
upwards into the alloca instruction, as you suggested. The frontend generating
the IR should have done this already when needed (when a load with that higher
alignment uses the bitcasted value, or the bitcasted value is passed to
another function, which requires ABI alignment).

I've now created a patch that makes sure that any struct typed alloca is never
changed by instcombine, unless all it's uses are bitcasts to the same type.
This seems like the most sane way to do things. This change did expose some
alignment-related bug in llvm-gcc, see PR2823. This means that committing
preserve-struct.diff will result in a test-suite failure (at least
SingleSource/UnitTests/Vector/divides, I haven't finished testing Multisource
yet), but this is caused by llvm-gcc, not instcombine.

Related to this issue, I've created two more patches. I'm including them here
for review, comments are welcome. I'm not sure if these patches actually make
a lot of difference in real world applications and they are not yet sufficient
to make scalarrepl do its work completely in our example code (lots of struct
passing and the resulting memmoves), but they will at least make the resulting
code more logical and make it possible to get there.

I've tested these patches against the dejagnu tests and the test-suite,
finding only two failures. The first is the one regarding PR2823 referenced
above. The second one is concerning
MultiSource/Benchmarks/MiBench/security-blowfish, for some reason opt emits an
invalid bitcode file (the bitcode reader asserts). From playing with the
debugger, it seems a store instruction is asked to store a an i8* to an
[8 x i8]*. However, since the verifier doesn't catch this before outputting
the bitcode, it's probably something more involved. I'll look into this issue
a bit closer later (currently I don't even know which of the three patches
causes it).

Anyway, review of these patches and opinions as to their usefulness would be
welcome!

Gr.

Matthijs

firstclass.diff (1.56 KB)

preserve-struct.diff (2.39 KB)

vector-gep.diff (4.17 KB)

and then does a load or store at the higher alignment, is
invoking undefined behavior. The alignment attribute on a load or store
is an assertion about the actual alignment of the memory.

Should this undefined behaviour be caught by the verifier, or is it ok for it
to exist?

I don't think so. It's not the Verifier's job to diagnose undefined
behavior. The IR in this case is valid, so the verifier should
accept it.

For better or worse, stores to null pointers are pretty common in
LLVM IR. The most common source of these is unreachable code, and
bugpoint, but those are important use cases :-).

However, it might be interesting to have a separate
undefined-behavior-detector pass. An LLVM pass wouldn't likely be
usable as an user-level bug finder; clang's static analysis is
much better positioned to do that kind of thing. But an LLVM pass
would be able to find bugs in front-ends and optimization passes.

In any case, since this behaviour is undefined, it's not quite needed for the
instcombine pass to propagate the ABI alignment from a type bitcasted to
upwards into the alloca instruction, as you suggested. The frontend generating
the IR should have done this already when needed (when a load with that higher
alignment uses the bitcasted value, or the bitcasted value is passed to
another function, which requires ABI alignment).

Your analysis sounds correct. I misunderstood the situation before.

I've now created a patch that makes sure that any struct typed alloca is never
changed by instcombine, unless all it's uses are bitcasts to the same type.
This seems like the most sane way to do things. This change did expose some
alignment-related bug in llvm-gcc, see PR2823. This means that committing
preserve-struct.diff will result in a test-suite failure (at least
SingleSource/UnitTests/Vector/divides, I haven't finished testing Multisource
yet), but this is caused by llvm-gcc, not instcombine.

Related to this issue, I've created two more patches. I'm including them here
for review, comments are welcome. I'm not sure if these patches actually make
a lot of difference in real world applications and they are not yet sufficient
to make scalarrepl do its work completely in our example code (lots of struct
passing and the resulting memmoves), but they will at least make the resulting
code more logical and make it possible to get there.

I've tested these patches against the dejagnu tests and the test-suite,
finding only two failures. The first is the one regarding PR2823 referenced
above. The second one is concerning
MultiSource/Benchmarks/MiBench/security-blowfish, for some reason opt emits an
invalid bitcode file (the bitcode reader asserts). From playing with the
debugger, it seems a store instruction is asked to store a an i8* to an
[8 x i8]*. However, since the verifier doesn't catch this before outputting
the bitcode, it's probably something more involved. I'll look into this issue
a bit closer later (currently I don't even know which of the three patches
causes it).

Anyway, review of these patches and opinions as to their usefulness would be
welcome!

I haven't read your patches yet, but it sounds like you're heading in a
positive direction here :-).

Dan

I haven't read your patches yet, but it sounds like you're heading in a
positive direction here :-).

Would you manage to have a look at them today? I'll be working again tomorrow,
so feedback would be appreciated (though I think that your today will be
sunday, while my tomorrow will be monday :slight_smile:

Anyway, I'll also send these patches llvm-commits, for others to review.

Gr.

Matthijs

and then does a load or store at the higher alignment, is
invoking undefined behavior. The alignment attribute on a load or
store
is an assertion about the actual alignment of the memory.

Should this undefined behaviour be caught by the verifier, or is it
ok for it
to exist?

I don't think so. It's not the Verifier's job to diagnose undefined
behavior. The IR in this case is valid, so the verifier should
accept it.

Right.

For better or worse, stores to null pointers are pretty common in
LLVM IR. The most common source of these is unreachable code, and
bugpoint, but those are important use cases :-).

An important thing to consider is that valid and well defined C code can expose cases of undefined behavior in unreachable code after optimizations. It is fine to turn these cases into "unreachable" instructions (which is undefined by definition!) but the compiler can't reject or crash on it.

However, it might be interesting to have a separate
undefined-behavior-detector pass. An LLVM pass wouldn't likely be
usable as an user-level bug finder; clang's static analysis is
much better positioned to do that kind of thing. But an LLVM pass
would be able to find bugs in front-ends and optimization passes.

Right. From a user perspective, such a tool wouldn't be able to report back to the user with enough fidelity to give them any info on how to fix the problem.

-Chris