llvm-gcc + abi stuff

<moving this to llvmdev instead of commits>

Okay, well we already get many other x86-64 issues wrong already, but
Evan is chipping away at it. How do you pass an array by value in C?
Example please,

I find the x86-64 ABI hard to interpret, but it seems to say that
aggregates are classified recursively, so it looks like a struct
containing a small integer array should be passed in integer registers.

Right. For x86-64 in particular, this happens, but only if the struct is <= 128 bits.

Also, it is easy in Ada: the compiler passes small arrays by value,

Ok, we should make sure this works when we think x86-64 is "done" :slight_smile:

Can you please clarify the roles of llvm-gcc and the code generators
in getting ABI compatibility.

Sure! This also mixes into the discussion in PR1937. The basic problem we have is that ABI decisions can be completely arbitrary, and are defined in terms of the source level type system. In our desire to preserve the source-language-independence of llvm, we can't just give the code generators an AST and tell it to figure out what to do.

While it would theoretically be useful to hand the target an llvm type and tell it to figure it out, this also doesn't work. The most trivial example of this is that some ABIs require different handling for _Complex double, and "struct {double,double}", both of which lower to the same LLVM type. This means that the LLVM type system isn't rich enough by itself to fully express how the target is supposed to handle something.

Right now, the LLVM IR now has two ways to express argument passing + return value:

1) pass/return with a first class value type, like i32, float, <4 x i32>, etc.
2) pass/return with a pointer to the place to do things and use byval/stret.

It's useful to notice that the formulation of something in the IR doesn't force the code generator to do anything (e.g. x86-32 passes almost everything in the stack regardless of whether you use scalars or byval), but it does have an impact on the optimizers and compile time.

For example, consider a target where everything is passed on the stack. In this case, from a functionality perspective, it doesn't matter whether you use byval to pass the argument or pass it as scalar values. However, picking the right one *can* have code quality and QOI impact. For example, if passing a 100K struct by value, it is much better (in terms of compile time and generated code) to use byval than the scalarize it and pass all the elements.

OTOH, passing an argument 'byval' on this target prevents it from being SROA'd on the callee and caller side. If the argument is small (say a 32-bit struct), this can cause significant performance degradation. As a QOI issue, it is better to pass a small aggregate like this as a scalar in this case.

In practice, most targets have more complex abi's than the theoretical one above. For example, x86-32 passes scalar vectors in registers up to a point, for example. On that target, the code generator contract is that 'byval' arguments are always passed in memory, SSE-compatible vectors are passed in XMM registers (up to a point), and everything else is passed in memory.

This has somewhat interesting implications: it means that it is okay to pass a {i32} struct as i32, and it means passing a _Complex float as two floats is also fine (yay for SROA). However, it means that that lowering a struct with two vectors in it into two vectors would actually break the ABI because the codegen would pass them in XMM regs instead of memory. This is a funny dance which means that the front-end needs to be fully parameterized by the backend to do the lowering.

When generating IR for x86-64, llvm-gcc
sometimes chops by-value structs into pieces, and sometimes passes the
struct as a byval parameter. Since it chops up all-integer structs,
and this corresponds more or less to what the ABI says, I assumed this
was an attempt to get ABI correctness. Especially as the code generators
don't seem to bother themselves with following the details of the ABI (yet),
and just push byval parameters onto the stack.

X86-64 is a much more complex abi than x86-32. The basic form of correctness is that the code generator:

1. Lowers byval arguments to memory.
2. passes integer and fp and vector arguments in GPRs and XMM regs where available.

This has an interesting impact on the C front-end. In particular, #1 is great for by value aggregates > 128 bits. However, aggregates <= 128 bits have a variety of possible cases, including:

1. some aggregates are passed in memory.
2. Others treat the aggregate as 2 64-bit hunks, where either 64-bit hunk can be:
   2a. Passed in a GPR.
   2b. Passed in an XMM register.

If you consider a struct like {float,float,float,float}, the interesting thing about this ABI is that it says this struct is passed in 2 xmm regs, where two floats are each passed as the low two elements of the XMM regs. To lower this struct optimally, llvm-gcc should codegen this as two vector inserts + two xmm registers. Codegen'ing it as a byval struct would be incorrect, because that would pass it in the stack.

I moved a big digression to the end of the mail.

I'll be the first to admit that this solution is suboptimal, but it is much better than what we had before. Unresolved issues include: what alignment do we pass things on the stack with. Evan recently fought with some crazy cases on x86-64 which currently require looking at the LLVM Type. I'm not thrilled with this, but it seems like an ok thing to do for now. If we find out it isn't, we'll have to extend the model somehow.

This is an optimization, not a correctness issue

I guess this means that the plan is to teach the codegenerators how to
pass any aggregate byval in an ABI conformant way (not the case right now),
but still do some chopping up in the front-end to help the optimizers.

Right. Currently, x86-32 attempts to pass 32-bit and 64-bit structs "better" than just using byval as an optimization for some common small cases. However, the problem is that it doesn't generate "nice" accesses into the struct: it just bitcasts the pointer and does a 32/64-bit load, which can often prevent SROA itself. This needs to be fixed to get really good code, but this is an optimization, not a correctness issue. Disabling this and passing these structs byval on x86-32 would generate equally correct code.

Of course this chopping up needs to be done carefully so the final result
squirted out by the codegenerators (once they are ABI conformant) is the
same as if the chopping had not been done...

Right, and all this is target-specific, yuck. :slight_smile:

Is this chopping really a
big win? Is it not possible to get an equivalent level of optimization
by enhancing alias analysis?

Nope, AA isn't involved here, because you can't know who called you in general. For example, consider this contrived example:

struct s { int x; };
int foo(struct s S) { return S.x; }

With byval, this turns into a load + return at the IR level. Without byval this is just a return. There is no amount of alias analysis you can do on this, because we don't know who calls it. Without changing the prototype of the IR function to not be byval, you can't eliminate the explicit load.

The Digression:

Incidentally, on x86-64, we're currently lowering this code to suboptimal (but correct) code that passes this as two doubles and goes through memory to get it into floats instead of using vector extracts:

struct a { float w, x, y, z; };
float foo(struct a b) { return b.w+b.x+b.y+b.z; }

  %struct.a = type { float, float, float, float }

define float @foo(double %b.0, double %b.1) nounwind {
entry:
  %b_addr = alloca { double, double } ; <{ double, double }*> [#uses=4]
  %tmpcast = bitcast { double, double }* %b_addr to %struct.a* ; <%struct.a*> [#uses=3]
  %tmp1 = getelementptr { double, double }* %b_addr, i32 0, i32 0 ; <double*> [#uses=1]
  store double %b.0, double* %tmp1, align 8
  %tmp3 = getelementptr { double, double }* %b_addr, i32 0, i32 1 ; <double*> [#uses=1]
  store double %b.1, double* %tmp3, align 8
  %tmp5 = bitcast { double, double }* %b_addr to float* ; <float*> [#uses=1]
  %tmp6 = load float* %tmp5, align 8 ; <float> [#uses=1]
  %tmp7 = getelementptr %struct.a* %tmpcast, i32 0, i32 1 ; <float*> [#uses=1]
  %tmp8 = load float* %tmp7, align 4 ; <float> [#uses=1]
  %tmp9 = add float %tmp6, %tmp8 ; <float> [#uses=1]
  %tmp10 = getelementptr %struct.a* %tmpcast, i32 0, i32 2 ; <float*> [#uses=1]
  %tmp11 = load float* %tmp10, align 4 ; <float> [#uses=1]
  %tmp12 = add float %tmp9, %tmp11 ; <float> [#uses=1]
  %tmp13 = getelementptr %struct.a* %tmpcast, i32 0, i32 3 ; <float*> [#uses=1]
  %tmp14 = load float* %tmp13, align 4 ; <float> [#uses=1]
  %tmp15 = add float %tmp12, %tmp14 ; <float> [#uses=1]
  ret float %tmp15
}

This yields correct but suboptimal code:
_foo:
  subq $16, %rsp
  movsd %xmm0, (%rsp)
  movsd %xmm1, 8(%rsp)
  movss (%rsp), %xmm0
  addss 4(%rsp), %xmm0
  addss 8(%rsp), %xmm0
  addss 12(%rsp), %xmm0
  addq $16, %rsp
  ret

We really want:

_foo:
  movaps %xmm0, %xmm2
  shufps $1, %xmm2, %xmm2
  addss %xmm2, %xmm0
  addss %xmm1, %xmm0
  shufps $1, %xmm1, %xmm1
  addss %xmm1, %xmm0
  ret

-Chris

It's useful to notice that the formulation of something in the IR
doesn't force the code generator to do anything (e.g. x86-32 passes
almost everything in the stack regardless of whether you use scalars
or byval), but it does have an impact on the optimizers and compile
time.

I still have the crazy idea that it should be possible to represent
the stack frame explicitly in llvm IL and have the FE handle 99% of
the ABI. But that will have to wait a bit :slight_smile:

Cheers,

I still have the crazy idea that it should be possible to represent
the stack frame explicitly in llvm IL and have the FE handle 99% of
the ABI. But that will have to wait a bit :slight_smile:

People writing front-ends and going to just love that :slight_smile:

Ciao,

Duncan.

People writing front-ends and going to just love that :slight_smile:

It can sure be in a library, so that clang and llvm-gcc don't have to
duplicate that, but it would make it easier for a language with a very
non-C like ABI to be implemented with llvm.

Ciao,

Duncan.

Cheers,