Current preferred approach for handling 'byval' struct arguments

As many of you will know, handling of struct arguments in LLVM is
actually slightly more fiddly than you might expect. For targets where
aggregates are always passed on the stack it's easy enough, the Clang
ABI code marks these arguments as byval and the call lowering code in
LLVM will copy the object to the stack when needed. There are more
options for when the target has more complex ABI rules, e.g. that as
much of a struct that will fit should be passed in argument registers.
One option is to let Clang (or the frontend of your choice) mark the
argument as byval, but then have the LLVM call lowering code assign
registers where possible. Alternatively, you can alter the frontend's
ABI handling so it decomposes the struct for you - the challenge being
ensuring that the appropriate packing/alignment is maintained. Because
of this for some targets byval does mean "pass on the stack", but not
others.

There's been some discussion of this issue previously:
* In the context of PowerPC, which will coerce structs below a certain
size to an integer array
<http://lists.llvm.org/pipermail/llvm-dev/2015-March/083554.html&gt;
* Previous suggestions that we really want an "onstack" attribute
<http://lists.llvm.org/pipermail/llvm-dev/2012-May/049406.html&gt;

I'm working with a calling convention
(https://github.com/riscv/riscv-elf-psabi-doc/blob/master/riscv-elf.md)
where structs of up to two words in length should be passed in
registers, but otherwise on the stack (EXCEPT in the case where there
is only one argument register left, in which case the struct is split
between that register and the stack).

Is there a consensus now on how this should be handled?

Thanks,

Alex

Today, the vast majority of target in Clang coerce aggregates passed this way into appropriate word-sized types. They all use their own custom heuristics to compute the LLVM types used for the coercions. It’s terrible, but this is the current consensus.

I would like to improve the situation so that passing LLVM aggregates directly does the right thing when the LLVM struct type and C struct type are effectively the same, so that custom frontend lowering is required for hard cases involving things like _Complex and unions.

Thanks for the response Reid. Looking more closely, it appears that
the relationship between the target's ABI and whether aggregates are
represented as byval structs, and whether these are coerced to
something else in Clang's ABI handling)is more complex than I first
described.

For instance, looking at X86 code in clang/lib/CodeGen/TargetInfo.cpp
I see that small aggregates are coerced as long as all argument
registers are known to be used (i.e. we know the backend will place it
on the stack, as demanded by the calling convention). ARM will also
coerce structures below a certain size, however the call lowering code
in ARMISelLowering still has logic to split a byval aggregate between
the stack and registers (why not I have to say looking at
AArch64ISelLowering and the clang code it's not immediately obvious to
me where aggregates get split between the stack and registers (which
is quite clear in MipsTargetLowering::passByValArg). What am I missing
here?

It seems to me there are a few possibilities for targets where the ABI
indicates aggregates may be passed in registers:
* Clang always passes aggregates as byval, LLVM call lowering may
assign some or all of the aggregate to registers. Seemingly nobody
does this
* Clang's ABI lowering code is aware of how many argument registers
have been used. If they have been exhausted, then leave the aggregate
as byval. If aggregate will be partially in registers and partially on
the stack, then coerce to two arguments - one byval and one direct.
Seemingly nobody does this.
* Split responsibilities between the Clang ABI lowering and the LLVM
backend lowering. If an aggregate is below a certain size, then coerce
and pass it direct. Depending on the ABI, the LLVM backend still has
the possibility that a byval aggregate may be passed partially in
registers and the stack and should handle that appropriately. This
seems to be more common

Best,

Alex

There are also tradeoffs for passing large scalar values, which I
thought I'd share here in case it's useful for someone else (and
indeed, if anyone has extra input).

In the RISC-V calling convention, large scalars (larger than 2 GPRs)
are passed indirect, just like large aggregates. e.g. an i128 or a
long double on a 32-bit platform. It's tempting to let the frontend
emit i128 and fp128 arguments/return values, however making the
argument indirect is somewhat easier in the frontend. This is because
by the time you get to the LLVM CC code the type has already been
legalised and so converted to a series of word-sized values. You can
detect that the arguments were formed by splitting a larger value and
fix it up so it all works properly (see CC_SystemZ_I128Indirect in
SystemZCallingConv.h) but this is more hassle than just having the
frontend pass it indirect in the first place. When return values have
the same rules, meaning an implicit parameter has to be generated it's
even more complex (in fact it's not immediately obvious to me how to
do that in a tablegen-based calling convention implementation).

Best,

Alex

Sorry to respond to myself again so soon, but as usual I've spotted
another issue. libcalls (e.g. fp128 equality) will be emitted in
TargetLowering, which won't do the necessary pass-indirect conversion
for you. Therefore the following C would generate something sensible:
`long double f_fp_scalar_3(long double x) { return x; }`. But if your
code tries to do anything with fp128 values (e.g. `fcmp une fp128 %1,
%2` gets generated), you're stuck. I think I'm now understanding why
the SystemZ backend make the choices it did.

Best,

Alex

This is a great example for why the responsibilities are currently weirdly
split between the frontend and backend. The more ABI lowering you do in the
frontend, the more information is available for the mid-level optimizer to
hack on. If the backend was responsible for creating a temporary i128 value
in memory and taking its address, the mid-level would never have an
opportunity to optimize those memory loads and stores, or realize that the
callee never modifies its argument, making the copy is unnecessary.

Of course, there are many drawbacks to the current split of
responsibilities, so it's definitely a tradeoff.

I agree that's how it is right now, but that tradeoff doesn't sound
intrinsic to the problem.

A transform from highish-level function ABI to low-level function ABI could
be done entirely within LLVM as an early target-specific pass, if that's
necessary for performance. It doesn't need to be pushed all the way up into
the frontend!