Why does clang do a memcpy? Is the cast not enough? (ABI function args)

I'm implementing function arguments and tested this code in C:

// clang \-emit\-llvm ll\_struct\_arg\.c \-S \-o /dev/tty
typedef struct vpt\_data \{
    char a;
    int b;
    float c;
\} vpt\_data;

void vpt\_test\( vpt\_data vd \) \{

int main\(\) \{
    vpt\_data v;

This emits an odd LLVM structure that casts to the desired struct type,
but also memcpy's to a temporary structure. I'm unsure of why the memcpy
is done as opposed to just casting directly?

define i32 @main\(\) \#0 \{
  %v = alloca %struct\.vpt\_data, align 4
  %1 = alloca \{ i64, float \}, align 4
  %2 = bitcast \{ i64, float \}\* %1 to i8\*
  %3 = bitcast %struct\.vpt\_data\* %v to i8\*
  call void @llvm\.memcpy\.p0i8\.p0i8\.i64\(i8\* %2, i8\* %3, i64 12, i32

4, i1 false)
%4 = getelementptr inbounds { i64, float }, { i64, float }* %1,
i32 0, i32 0
%5 = load i64, i64* %4, align 4
%6 = getelementptr inbounds { i64, float }, { i64, float }* %1,
i32 0, i32 1
%7 = load float, float* %6, align 4
call void @vpt_test(i64 %5, float %7)
ret i32 0

Because you are passing the parameter by value? It *should* copy the
data. In this particular case it will probably be elided if you turn on
optimization, but it is more logical to pass structs via a const
reference or pointer.


I understand it's passing by value, that's what I'm testing here. The
question is why does it copy the data rather than just casting and
loading values from the original variable (%v) ? It seems like the
copying is unnecessary.

Not all struct's result in the copy, only certain forms -- others are
just cast directly as I was expecting. I'm just not clear on what the
differences are, and whether I need to do the same thing.

It is a matter of the calling convention. It would specify what structs are passed in registers, and which are passed through stack.


Yes, I understand that as well (it's what I'm trying to recreate in my
language now).

I'm really wondering why it does the copy, since from what I can tell it
could just as easily cast the original value and do the load without the
memcpy operation.

That is, the question is about the memcpy and extra alloca -- I
understand what it's doing, just not why it's doing it this way.

This is the standard way of copying memory in the IR. Backends can expand the memcpy into loads/stores if they want.


Yes, but why is it even copying the memory? It already has a pointer
which it can cast and load from -- and does so in other scenarios.

I'm wondering whether this copying is somehow required and I'm missing
something, or it's just an artifact of the clang emitter. That is, could
it not omit the memcpy and cast the original variable?

It needs to LOAD the data. It is FASTER to do a memcpy (if the data is large enough) than to do a “load”. If you actually convince the compiler to do a load, it will produce enough 32- or 64-bit LOAD/STORE pairs to copy the data. Not only does this bloat the code, it is also likely slower than running memcpy as a loop.

For SMALL copies, memcpy gets replaced by simple load/store instructions anyway in the memcpy optimisation pass, so it is not an overhead.

I know this, because I had to implement a similar thing in my Pascal compiler to avoid it exploding when trying to use a “record” (Pascal’s “struct”) with an array of 16000 int - it generated several thousand LOAD and STORE instructions for each function call. Which made the whole thing take almost forever, and the code generated was terrible. Calling memcpy instead solved the problem.

I believe the memcpy is there just as a consequence of Clang’s design - different parts of the compiler own different pieces of this, so in some sense one hand doesn’t see what the other is doing. Part of it is “create an argument” (memcpying the local variable into an unnamed value) and then the next part is “oh, but that argument gets passed in registers, so decompose it into registers again”.

Clang doesn’t need to produce perfectly optimal IR - because the optimization pipeline of LLVM will clean things up. So in many cases it’s just easier (& not a significant impediment to performance) to have some of these sort of redundancies/oddities in output, and just let the LLVM optimization pipeline clean them up later.

Thanks. That kind of makes sense.

I see that a lot in my code as well: poor IR structures that aren't
worth the effort to clean up since the LLVM passes do such a fine job of it.

Turns out I now have the same copying structure in my ABI support code,
though I use Store instead. :slight_smile: