Optimizing out redundant alloca involving byval params

Hello all,

I’m trying to find the pass that would convert from:

define void @main(%struct* byval %ptr) {
%val = load %struct* %ptr
%val.ptr = alloca %struct
store %struct %val, %struct* %val.ptr
call void @extern_func(%struct* byval %val.ptr)
ret void
}

to this:

define void @main(%struct* byval %ptr) {
call void @extern_func(%struct* byval %ptr)
ret void
}

First, am I missing something - would this be a correct optimization?

Thank you,
Mircea.

I think lib/Transforms/Scalar/MemCpyOptimizer.cpp might be the right place for this, considering that most frontends will use memcpy for that copy anyway. It already has some logic for byval args.

Reid is right that this would go in memcpyopt, but... we there's an active discussion on the commit list which will solve this through a different mechanism. There's an active desire to avoid teaching GVN and related pieces (of which memcpyopt is one) about first class aggregates. We don't have enough active users of the feature to justify and maintain the complexity.

If you haven't already seen it, this background may help: http://llvm.org/docs/Frontend/PerformanceTips.html#avoid-loads-and-stores-of-large-aggregate-type

The current proposal is to convert such aggregate loads and stores into their component pieces. If that happens, you're example should come "for free" provided that the same example works when you break down the FCA into it's component pieces. If it doesn't, please say so.

Philip

Thanks!

Philip, do you mean I should transform the original IR to something like this? (…which is what -expand-struct-regs can do, when applied to my original input)

define void @main(%struct* byval %ptr) {
%val.index = getelementptr %struct* %ptr, i32 0, i32 0
%val.field = load i32* %val.index
%val.index1 = getelementptr %struct* %ptr, i32 0, i32 1
%val.field2 = load i32* %val.index1
%val.ptr = alloca %struct
%val.ptr.index = getelementptr %struct* %val.ptr, i32 0, i32 0
store i32 %val.field, i32* %val.ptr.index
%val.ptr.index4 = getelementptr %struct* %val.ptr, i32 0, i32 1
store i32 %val.field2, i32* %val.ptr.index4
call void @extern_func(%struct* byval %val.ptr)
ret void
}

If so, would you mind pointing me to the phase that would reduce this? (I’m assuming that’s what you meant by “for free” - there’s an existing phase I could use)

Thank you.
Mircea.

Yes. Sorry, what? This doesn’t appear to be a pass in ToT. Are you using an older version of LLVM? If so, none of my comments will apply. I would expect GVN to get this. If you can run this through a fully -O3 pass order and get the right result, isolating the pass in question should be easy.

Sorry, that phase is part of the PNaCl toolchain. This would be LLVM 3.6, would your comments still apply?

I tried -O3 to no avail. I suppose I’ll get llvm 3.7, see if I can optimize the latest snippet there (the one avoiding load/store), and see from there.

Thanks!

errata: I am on 3.6 full stop. I thought there was a 3.7 available, based on the title of http://llvm.org/docs/ (“LLVM 3.7 documentation”). I suppose the docs are ahead of the release schedule?

I dug a bit more. It appears the succession -memcpyopt -instcombine can convert this:

%struct.Str = type { i32, i32, i32, i32, i32, i32 }

define void @_Z4test3Str(%struct.Str* byval align 8 %s) {

entry:

%agg.tmp = alloca %struct.Str, align 8

%0 = bitcast %struct.Str* %agg.tmp to i8*

%1 = bitcast %struct.Str* %s to i8*

call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 24, i32 4, i1 false)

call void @_Z6e_test3Str(%struct.Str* byval align 8 %agg.tmp)

ret void

}

Into this:

define void @_Z4test3Str(%struct.Str* byval align 8 %s) {

entry:

call void @_Z6e_test3Str(%struct.Str* byval align 8 %s)

ret void

}

Which is great. This isn’t however happening with a GEP and load/store - based IR (so a total of 6 sets of GEP on %s, load, then GEP on %agg.tmp + store , like the one discussed earlier in this thread).

I see 2 options:

  1. convert the pass I’m working on to produce memcpy instead of load/store successions, which would allow the resulting IR to fit in the canonical patterns optimized today, or

  2. add support (probably to memcpyopt) for converting load/store successions into memcpy, then let the current optimizations reduce the resulting IR.

I’m looking for feedback as to which path to take. Are there known instances of successive load/store that would benefit from being replaced with memcpy (option 2)?

Thank you,
Mircea.

I dug a bit more. It appears the succession -memcpyopt -instcombine can
convert this:

%struct.Str = type { i32, i32, i32, i32, i32, i32 }

define void @_Z4test3Str(%struct.Str* byval align 8 %s) {

entry:

  %agg.tmp = alloca %struct.Str, align 8

  %0 = bitcast %struct.Str* %agg.tmp to i8*

  %1 = bitcast %struct.Str* %s to i8*

  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 24, i32 4, i1
false)

  call void @_Z6e_test3Str(%struct.Str* byval align 8 %agg.tmp)

  ret void

}

Into this:

define void @_Z4test3Str(%struct.Str* byval align 8 %s) {

entry:

  call void @_Z6e_test3Str(%struct.Str* byval align 8 %s)

  ret void

}

Which is great. This isn't however happening with a GEP and load/store -
based IR (so a total of 6 sets of GEP on %s, load, then GEP on %agg.tmp +
store , like the one discussed earlier in this thread).

I see 2 options:

1) convert the pass I'm working on to produce memcpy instead of load/store
successions, which would allow the resulting IR to fit in the canonical
patterns optimized today, or

I'd say that if you are copying an object and it requires more than 2 loads
and stores, use memcpy. This is what Clang does for aggregate copies when
there is no copy ctor.

2) add support (probably to memcpyopt) for converting load/store
successions into memcpy, then let the current optimizations reduce the
resulting IR.

We should do this as a separate pass (I thought we did?), but it's hard to
do when there is interior padding in the struct. It's hard to know if the
interior padding of the destination needs to retain the data that was
originally there.