struct copy

Given this program:

struct S { int x; };
void f() {
  struct S q1;
  struct S q2 = q1;
}

clang emits this:

    %struct.S = type <{ i32 }>

define void @f(...) nounwind {
entry:
    %q1 = alloca %struct.S, align 4 ; <%struct.S*> [#uses=1]
    %q2 = alloca %struct.S, align 4 ; <%struct.S*> [#uses=1]
    %tmp = bitcast %struct.S* %q2 to i8* ; <i8*> [#uses=1]
    %tmp1 = bitcast %struct.S* %q1 to i8* ; <i8*> [#uses=1]
    call void @llvm.memmove.i32(i8* %tmp, i8* %tmp1, i32 4, i32 4)
    br label %return

return: ; preds = %entry
    ret void
}

Would it be better if it produced this instead ?:

define void @f(...) nounwind {
entry:
    %q1 = alloca %struct.S, align 4 ; <%struct.S*> [#uses=1]
    %q2 = alloca %struct.S, align 4 ; <%struct.S*> [#uses=1]
    %tmp = load %struct.S* %q1
    store %struct.S %tmp, %struct.S* %q2
    br label %return

return: ; preds = %entry
    ret void
}

It's simpler and offers higher level information for those that want to examine the bitcode.

Isn't the job of the backend to interpret
    %tmp = load %struct.S* %q1
    store %struct.S %tmp, %struct.S* %q2
in an appropriate way according to target, thus not requiring the frontend to explicitly encode it as low-level block-of-bytes copy ?

-Argiris

Given this program:

struct S { int x; };
void f() {
struct S q1;
struct S q2 = q1;
}

clang emits this:

Would it be better if it produced this instead ?:

Yes, that would be better. llvm-gcc has logic to do this. It is somewhat tricky, because you only want to do it when the struct is small, and can't always do it when you have unions.

It's simpler and offers higher level information for those that want to
examine the bitcode.

Isn't the job of the backend to interpret
   %tmp = load %struct.S* %q1
   store %struct.S %tmp, %struct.S* %q2
in an appropriate way according to target, thus not requiring the
frontend to explicitly encode it as low-level block-of-bytes copy ?

Clang does a block copy right now because it is a) always correct and b) required for the large struct copy case. Improving this code to use the same heuristics as llvm-gcc would be useful :slight_smile:

-Chris

Chris Lattner wrote:

Yes, that would be better. llvm-gcc has logic to do this. It is somewhat tricky, because you only want to do it when the struct is small, and can't always do it when you have unions.

Is it out of the question to let the backend target decide whether to do a block copy or not ? Couldn't there be a Pass to do that kind of transformation ?
To be more specific, the block copy is not appropriate for the MSIL target, it's better to convert the load/store to direct "load local variable"/"store local variable" MSIL code, instead of having a byte-block copy.

-Argiris

Chris Lattner wrote:

Yes, that would be better. llvm-gcc has logic to do this. It is somewhat tricky, because you only want to do it when the struct is small, and can't always do it when you have unions.

Is it out of the question to let the backend target decide whether to do a block copy or not ? Couldn't there be a Pass to do that kind of transformation ?
To be more specific, the block copy is not appropriate for the MSIL target, it's better to convert the load/store to direct "load local variable"/"store local variable" MSIL code, instead of having a byte-block copy.

In theory, it should be safe and fast for the compiler to always produce a memcpy and then let the backend lower it however it wants. It is not good to always lower it into field copies (for large structs) and is not always safe (for certain union cases).

In practice, using memcpy for small structs is bad, which is why llvm-gcc lowers it. The badness stems from two reasons: 1) current deficiencies in the optimizer, and 2) not being able to model the fact that a memcpy doesn't need to copy the "holes" in a struct.

-Chris

Chris Lattner <clattner@apple.com> writes:

In practice, using memcpy for small structs is bad, which is why llvm-
gcc lowers it. The badness stems from two reasons: 1) current
deficiencies in the optimizer, and 2) not being able to model the fact
that a memcpy doesn't need to copy the "holes" in a struct.

How would not copying the holes improve anything? They are there for
alignment, but moving entire machine words is no slower than
individual bytes.

However, if using the memcpy primitive means losing alignment
information and making pessimistic assumptions, then that is of course
an argument against memcpy.

This is true on the micro-level, but is false in the macro level. For example, if the caller of a function does a one byte store into a struct field, and the callee does a memcpy (ending up with a 32-bit read), you get a store forwarding speculation failure on most out of order processors.

-Chris

Chris Lattner <clattner@apple.com> writes:

This is true on the micro-level, but is false in the macro level. For
example, if the caller of a function does a one byte store into a
struct field, and the callee does a memcpy (ending up with a 32-bit
read), you get a store forwarding speculation failure on most out of
order processors.

Thank you, I didn't think of that. I wonder to what extent that effect
depends on the existence of holes in the struct layout. Should structs
always be copied member-by-member, even if they consist of many small
bytes?

struct S {
    char c[7];
    double d;
}

Would seven byte copies really be faster than one 64-bit word copy?
If there were eight bytes in the array, there would be no hole but
the problem remains - at least if the entire array was not written to
before the struct copy.

Chris Lattner wrote:

In theory, it should be safe and fast for the compiler to always produce a memcpy and then let the backend lower it however it wants.

What is wrong with having the compiler always produce a load/store ?

Having:
    %tmp = load %struct.S* %q1
    store %struct.S %tmp, %struct.S* %q2
conveys the information "take a value of type %struct.S from here and store it there."

Having a memcpy:
     call void @llvm.memmove.i32(i8* %tmp, i8* %tmp1, i32 12, i32 4)
conveys the information "copy 12 bytes from here to there", which is a lowered form of load/store.

It's very easy to lower load/store to memcpy, going the other way around is not so easy.

-Argiris

Chris Lattner wrote:

In theory, it should be safe and fast for the compiler to always produce a memcpy and then let the backend lower it however it wants.

What is wrong with having the compiler always produce a load/store ?

For small structs, almost nothing! It has the same semantics as an element-by-element copy. For large structs, you really don't want to do this.

The problem is that you need the same heuristic: you need to know that the struct is "small" and the struct has no holes in the LLVM type that some other element of the C type (e.g. through a union) contain data in.

Mattias wrote:

Chris Lattner <clattner@apple.com> writes:

This is true on the micro-level, but is false in the macro level. For
example, if the caller of a function does a one byte store into a
struct field, and the callee does a memcpy (ending up with a 32-bit
read), you get a store forwarding speculation failure on most out of
order processors.

Thank you, I didn't think of that. I wonder to what extent that effect
depends on the existence of holes in the struct layout. Should structs
always be copied member-by-member, even if they consist of many small
bytes?

struct S {
   char c[7];
   double d;
}

Would seven byte copies really be faster than one 64-bit word copy?
If there were eight bytes in the array, there would be no hole but
the problem remains - at least if the entire array was not written to
before the struct copy.

There is no easy answer, it depends a lot on other environmental effects, like whether there is an access to c[3] or c[i] shortly after the store to the struct.

In the absence of such an access that is close enough to matter, you're right that it would be better to copy the 7 bytes with an 8 byte load and store instead of 7 one by load/stores. The idea of the current heuristic is that the code generator should theoretically be able to merge together neighboring load/stores into wider loads. The two caveats here are 1) the codegen doesn't do this yet, and 2) even when it does, it won't know when it is safe for it to load/store *more* data than is requested. In this case, it wouldn't know that it is safe to load/store 8 bytes, because only 7 are being accessed.

Most optimization work in LLVM is demand driven. If you find a real world testcase that this impacts, we can devise some solutions.

-Chris

Chris Lattner wrote:

What is wrong with having the compiler always produce a load/store ?

For small structs, almost nothing! It has the same semantics as an element-by-element copy. For large structs, you really don't want to do this.

The problem is that you need the same heuristic: you need to know that the struct is "small" and the struct has no holes in the LLVM type that some other element of the C type (e.g. through a union) contain data in.

Give these types:

union U { int u1; char u2; };
struct S {
   int x1;
   char x2;
   union U u;
};

Here's what llvm-gcc produces:
    %struct.S = type { i32, i8, %struct.U }
    %struct.U = type { i32 }

Isn't this enough information for the backend to know where the holes are in the struct and also to deal with unions (unions considered as the type of the largest field) ?
Can't the backend get a load/store of a large struct and based on type information as the above , lower it accordingly (memcpy, field copies, etc..) ?

-Argiris

Sure, that case is fine. Try something like this:

struct a {
   char A;
   double D;
};

union c {
   struct a a;
   int I;
};

Which compiles to:
  %struct.a = type { i8, double }
  %struct.c = type { %struct.a }

Note that the llvm type for struct.c doesn't have members that fully overlap "int I".

-Chris

Chris Lattner wrote:

Sure, that case is fine. Try something like this:

struct a {
  char A;
  double D;
};

union c {
  struct a a;
  int I;
};

Which compiles to:
    %struct.a = type { i8, double }
    %struct.c = type { %struct.a }

Note that the llvm type for struct.c doesn't have members that fully overlap "int I".

Ah ok, "unions as the biggest field" is bad. How about if we consider unions as arrays (as clang does):

    %struct.a = type { i8, double }
    %struct.c = type [12 x i8]

That would solve this problem, but also isn't a universally good thing to do. Doing things like this can break the (critical for some codes) Scalar Replacement of Aggregates optimization. In general, it is good for the front-end to preserve as much accurate type info in the LLVM IR as possible.

-Chris

Wouldn't it make sense to have union types in the IR? It might be
useful with other source languages too. Seems to me like union types
would fit well within the structural equivalence type model described
in the documentation.

There is no specific reason to avoid them, other than additional complexity in the IR. However, representing unions directly doesn't provide any value to the optimizers over a type cast. They both directly say that the memory is being reinterpreted in a different way (when dereferenced). Since we have to have casts, there has (so far) been no reason to add unions.

-Chris

Chris Lattner wrote:

Ah ok, "unions as the biggest field" is bad. How about if we consider unions as arrays (as clang does):

  %struct.a = type { i8, double }
  %struct.c = type [12 x i8]

That would solve this problem, but also isn't a universally good thing to do. Doing things like this can break the (critical for some codes) Scalar Replacement of Aggregates optimization. In general, it is good for the front-end to preserve as much accurate type info in the LLVM IR as possible.

This makes sense, but having something like this:

    %struct.a = type { i8, double }
    %struct.c = type { %struct.a }

is certainly bad, right ? I mean, %struct.a is a type with holes, using it as a field for a union is fundamentally wrong.

-We could consider unions as arrays only in the case when the biggest field is a struct.

-Or take the scalar field of the union and add padding.
    %struct.c = type <{ int32, [8 x i8] }>

-Or represent the union as the struct itself, filling the holes:
    %struct.a = type { i8, double }
    %struct.c = type <{ i8, [7 x i8], double }>

-Or have something like this:
    %struct.a = type { i8, double }
    %struct.a.pad = type <{ i8, [7 x i8], double }>
    %struct.c = type { %struct.a.pad }

In general, the LLVM IR type for a union should not have fields with holes, is this correct ?
And to return to the original question, if we can assume that LLVM types for unions are "hole-less", what is wrong then in having load/store represent a copy of a large struct ?

-Argiris

Hi Chris,

> Ah ok, "unions as the biggest field" is bad. How about if we
> consider unions as arrays (as clang does):
>
> %struct.a = type { i8, double }
> %struct.c = type [12 x i8]
That would solve this problem, but also isn't a universally good thing
to do. Doing things like this can break the (critical for some codes)
Scalar Replacement of Aggregates optimization.

Can you give an actual example of this? By my reasoning, if a struct is really
part of a union, and the value is thus used in some other way, scalarrepl
wouldn't be able to split it up anyway, right?

If an alloca of a union is in fact only used as its struct type and never as
any of the other types, then instcombine will change the i8 array alloca to
a struct alloca and scalarrepl will still be able to do its work just fine.

In general, it is good for the front-end to preserve as much accurate type
info in the LLVM IR as possible.

I agree with you here. However, I'm not so sure that picking one type a union
can become over another, is much more accurate than just picking an i8 array
always.

What does load/store currently mean exactly for a struct with gaps? The
langref doesn't seem to mention this explictly, but I think that currently a
load and store of a struct type are not allowed to access the gaps?

One could consider saying that a load and store of a struct types are allowed
to access the gaps, but not required to. In this case, it makes sense to keep
the struct load/store around for as long as possible, so the code generator
can in the end decide how to lower them.

Are there usecases where a load or store of a struct with gaps must really not
access the gaps? I can imagine this is more likely for the padding at the end
though. In any case, perhaps this could/should be an extra flag to load and
store or something?

Gr.

Matthijs

Hi Chris,

Ah ok, "unions as the biggest field" is bad. How about if we
consider unions as arrays (as clang does):

%struct.a = type { i8, double }
%struct.c = type [12 x i8]

That would solve this problem, but also isn't a universally good thing
to do. Doing things like this can break the (critical for some codes)
Scalar Replacement of Aggregates optimization.

Can you give an actual example of this? By my reasoning, if a struct is really
part of a union, and the value is thus used in some other way, scalarrepl
wouldn't be able to split it up anyway, right?

Right. By picking one of the types though, you do catch the case where the union is only used as that type.

If an alloca of a union is in fact only used as its struct type and never as
any of the other types, then instcombine will change the i8 array alloca to
a struct alloca and scalarrepl will still be able to do its work just fine.

Hrm... this is a big maybe. The instcombine xform has a lot of constraints on it.

In general, it is good for the front-end to preserve as much accurate type
info in the LLVM IR as possible.

I agree with you here. However, I'm not so sure that picking one type a union
can become over another, is much more accurate than just picking an i8 array
always.

Why is there so much focus on this? Is there a specific problem you're trying to solve? There are lots of real problems to tackle too :slight_smile:

-Chris

Also, holes can be large. Also, there is (can be) more aliasing information embodied by the structure copy than a memcpy, this information can be used to move load/stores across the structure copy.

Hi Chris,

If an alloca of a union is in fact only used as its struct type and never
as any of the other types, then instcombine will change the i8 array alloca
to a struct alloca and scalarrepl will still be able to do its work just
fine.

Hrm... this is a big maybe. The instcombine xform has a lot of constraints
on it.

I don't think there are a lot of problems here, but you might be right that
instcombine doesn't solve this problem well enough yet. In particular, I
suspect that having multiple uses of an alloca (even when they are all the
same bitcast) and having a part of the union that is smaller than the entire
union, might throw instcombine off here. However, it might be worthwile to
improve this inscombine xform for this purpose, since clang uses this as well.

Why is there so much focus on this? Is there a specific problem you're
trying to solve? There are lots of real problems to tackle too :slight_smile:

It's not the core problem for me, but I do think that the bitcode resulting
from unions is more clear and readable when using the clang way.

But, the problem we were trying to solve was how to model the copying of a
struct, ie use a memmove or load/store, etc. When unions are involved, the
copying of gaps in a struct becomes essential for correct functioning.
However, in the normal struct case, the gaps can be omitted when copying
(which is useful for things like scalarrepl), though it does not hurt to
include them as well (which might be beneficial for performance).

So, in this light, perhaps you can comment on what I wrote earlier:

> What does load/store currently mean exactly for a struct with gaps? The
> langref doesn't seem to mention this explictly, but I think that currently a
> load and store of a struct type are not allowed to access the gaps?
>
> One could consider saying that a load and store of a struct types are allowed
> to access the gaps, but not required to. In this case, it makes sense to keep
> the struct load/store around for as long as possible, so the code generator
> can in the end decide how to lower them.
>
> Are there usecases where a load or store of a struct with gaps must really not
> access the gaps? I can imagine this is more likely for the padding at the end
> though. In any case, perhaps this could/should be an extra flag to load and
> store or something?

Gr.

Matthijs