llvm.memcpy for struct copy

Hi all
I’m new here, and I have some question about llvm.memcpy intrinsic.

why does llvm.memcpy intrinsic only support i8* for first two arguments? and does clang will also transform struct copy into llvm.memcpy ? what format does IR looks like?
Thanks !



The i8 type in the pointers doesn’t matter a whole lot. There’s a long term plan to remove the type from all pointers in llvm IR.

Yes, clang will use memcpy for struct copies. You can see example IR here https://godbolt.org/g/8gQ18m. You’ll see that the struct pointers are bitcasted to i8* before the call.


Thanks !
so for this example
void foo(X &src, X &dst) {
dst = src;
and the IR:

define void @foo(X&, X&)(%struct.X* dereferenceable(8), %struct.X* dereferenceable(8)) #0 {
%3 = alloca %struct.X*, align 8
%4 = alloca %struct.X*, align 8
store %struct.X* %0, %struct.X** %3, align 8
store %struct.X* %1, %struct.X** %4, align 8
%5 = load %struct.X*, %struct.X** %3, align 8
%6 = load %struct.X*, %struct.X** %4, align 8
%7 = bitcast %struct.X* %6 to i8*
%8 = bitcast %struct.X* %5 to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %7, i8* align 4 %8, i64 8, i1 false)
ret void

how can I transform the llvm.memcpy into data move loop IR and eliminate the bitcast instruction ?



The pointers must always be i8* the alignment is independent and is controlled by the attributes on the arguments in the call to memcpy.

Hi Craig
Thank you very much !

Hi Ma,

how can I transform the llvm.memcpy into data move loop IR and eliminate the bitcast instruction ?

I’m not sure why you are concerned about memcpy and bitcasts, but if you call MCpyInst->getSource() and MCpyInst->getDest() it will look through casts and give you the ‘true’ source/destination.

If you want to get rid of memcpy altogether, you can take a look at this pass: https://github.com/seahorn/seahorn/blob/master/lib/Transforms/Scalar/PromoteMemcpy.cc .


1 Like

Hi Jakub
thanks, I saw the pass with code:

There are at least four different places in LLVM where memcpy intrinsics are expanded to either sequences of instructions or calls:

- InstCombine does it for very small memcpys (with a broken heuristic).

- PromoteMemCpy does it mostly to expose other optimisation opportunities.

- SelectionDAG does it (though in a pretty terrible way, because it can’t create new basic blocks and so can’t emit small loops)

- Some back ends do it in cooperation with SelectionDAG to provide their own implementation.

Whether you want a memcpy intrinsic or a sequence of loads and stores depends a little bit on what optimisation you’re doing next - some work better treating individual fields separately, some prefer to have a blob of memory that they can treat as a single entity.

It’s also worth noting that LLVM’s handling of padding in structure fields is particularly bad. LLVM IR has two kinds of struct: packed an non-packed. The documentation doesn’t make it clear whether non-packed structs have padding at the end (and clang assumes that it doesn’t, some of the time). Non-padded structs do have padding in between fields for alignment. When lowering from C (or a language needing to support a C ABI), you sometimes end up with padding fields inserted by the front end. Optimisers have no way of distinguishing these fields from non-padding fields and so we only get rid of them if SROA extracts them and finds that they have no side-effect-free consumers. In contrast, the padding between fields in non-packed structs disappears as soon as SROA runs. This can lead to violations of C semantics, where padding fields should not change (because C defines bitwise comparisons on structs using memcmp). This can lead to subtly different behaviour in C code depending on the target ABI (we’ve seen cases where trailing padding is copied in one ABI but not in another, depending solely on pointer size).


Hi David
tks a lot, that makes much more clear!


The IR type of an alloca isn't supposed to affect the semantics; it's just a sizeof(type) block of bytes. We haven't always gotten this right in the past, but it should work correctly on trunk, as far as I know. If you have an IR testcase where this still doesn't work correctly, please file a bug.


It’s not an IR test case. We have a C struct that is {void*, int}. On a system with 8-byte pointers, this becomes an LLVM struct { i8*, i8 }. On a system with 16-byte pointers, clang lowers it to { i8*, i8, [12 x i8] }. From the perspective of SROA, the [12 x i8] is a real field. When a function is called with the struct, it is lowered to taking an explicit [12 x i8] argument, whereas the other version takes only i8* and i8 in registers. This means that if the callee writes the data out to memory and then performs a memcmp, the 8-byte-pointer version may not have the same padding, whereas the 16-byte-pointer version will.

In the code that we were using (the DukTape JavaScript interpreter), the callee didn’t actually look at the padding bytes in either case, so we just ended up with less efficient code in the 16-byte-pointer case, but the same could equally have generated incorrect code for the 8-byte-pointer case.


I wonder it is possible the explicitly mark the padding bytes such that the later optimization know the padding bytes and do some optimizations.