[RFC] The future of the va_arg instruction

# The future of the va_arg instruction

## Summary
LLVM IR currently defines a va_arg instruction, which can be used to access
a vararg. Few Clang targets make use of it, and it has a number of
limitations. This RFC hopes to promote discussion on its future - how 'smart'
should va_arg be? Should we be aiming to transition all targets and LLVM
frontends to using it?

## Background on va_arg
The va_arg instruction is described in the language reference here
<http://llvm.org/docs/LangRef.html#int-varargs&gt; and here
<http://llvm.org/docs/LangRef.html#i-va-arg&gt;\. When it's possible to use
va_arg, it frees the frontend from worrying about manipulation of the
target-specific va_list struct. This also has the potential to make analysis
of the IR more straight-forward. However, va_arg can't currently be used
with an aggregate type (such as a struct). The difficulty of adding support
for aggregates is discussed later in this email.

Which Clang targets generate va_arg?
* PNaCl always uses va_arg, even for aggregates. Their ExpandVarArgs pass
replaces it with appropriate loads and stors.
* AArch64/Darwin generates va_arg if possible. When not possible, as for
aggregates or illegal vector types, it generates the usual va_list
manipulation code. It is not used for other AARch64 platforms.
* A few other targets such as MSP430, Lanai and AVR seem to use it due to
DefaultABIInfo

Which in-tree backends support va_arg?
* AArch64, ARM, Hexagon, Lanai, MSP430, Mips, PPC, Sparc, WebAssembly, X86,
XCore

It's worth nothing there has been some relevant prior discussion, see these
messages from Will Dietz and Renato Golin
<http://lists.llvm.org/pipermail/llvm-dev/2011-August/042505.html&gt;
<http://lists.llvm.org/pipermail/llvm-dev/2011-August/042509.html&gt;\.

## Options for the future of va_arg

Option 1: Discourage use of va_arg and aim to remove it in the future
  * Most targets frontends have to directly manipulate va_list in at least
  some cases. You could argue we'd be better off by having varargs
  handled in a uniform manner, even if va_list manipulation is more explicit
  and target specific?

Option 2: Status quo
  * va_arg is there. Most backends can at least expand it, though it's not
  clear how heavily tested this is.
  * There's still a question of what the reccomendation should be for
  frontends. If we keep va_arg as-is, would it be beneficial to
  modify Clang to use it when possible, while falling back to explicit
  manipulation if necessary like on Darwin/AArch64? Alternatively, casting may
  allow va_arg to be used for a wider variety of types.

Option 3: Teach va_arg to handle aggregates
  * In this option, va_arg might reasonably be expected to handle a struct,
  but would not be expected to have detailed ABI-specific knowledge. e.g. it
  won't automagically know whether a value of a certain size/type is passed
  indirectly or not. In a sense, this would put support for aggregates passed
  as varargs on par with aggregates passed in named arguments.
  * Casting would be necessary in the same cases casting is required
for named args
  * Support for aggregates could be implemented via a new module-level
pass, much like PNaCl.
  * Alternatively, the conversion from the va_arg instruction to
  SelectionDAG could be modified. It might be desirable to convert the vaarg
  instruction to a number of loads and a new node that is responsible only for
  manipulating the va_list struct.

Option 4: Expect va_arg to handle all ABI details
  * In this more extreme option, va_arg with any type would expected to
  generate ABI-compliant code. e.g. a va_arg with i128 would "do the right
  thing", regardless of whether an i128 is passed indirectly or not for the
  given ABI.
  * This would be nice, but probably only makes sense as part of a larger
  effort to reduce the ABI lowering burden on frontends. This sort of effort
  has been discussed many times, and is not a small project.

## Next steps
I'd really appreciate any input on the issues here. Do people have strong
feelings about the future direction of va_arg? Will GlobalISel have any effect
on the relative difficulty or desirability of these options?

Thanks,

Alex

I don't feel strongly about it, though since it is really an ABI issue I think it lives at a higher level than LLVM IR (Front-End language semantics).

We don't use 'va_arg' in our TableGen descriptions, but we do have special handling for 'ISD::VAARG' during lowering to handle various vector lengths for which we don’t have native register support, but which should still be extracted to and from a particular register class. For example, 'v2i8' which we map to the lower half of a 32-bit SIMD register and 'v2i32' which we map to the lower half of a 128-bit SIMD register. The TTI (or TRI perhaps) would need to be able to describe these special register interactions in another way to remove the need for custom handling of these optimisations if a generic target agnostic implementation was preferred.

We also have optimisations for vectors that are larger than our registers can handle, which the default implementation does not provide an optimal solution.

I think the memory load/store handling could be made generic, but the optimal destination/source register(s) is not so straight-forward.

Curiously, I have a group of test failures to do with 'va_arg' and aggregates that I haven't solved. Always assumed they were my fault, but perhaps not from what you describe below.

  MartinO

Option 3: Teach va_arg to handle aggregates
   * In this option, va_arg might reasonably be expected to handle a struct,
   but would not be expected to have detailed ABI-specific knowledge. e.g. it
   won't automagically know whether a value of a certain size/type is passed
   indirectly or not. In a sense, this would put support for aggregates passed
   as varargs on par with aggregates passed in named arguments.
   * Casting would be necessary in the same cases casting is required
for named args
   * Support for aggregates could be implemented via a new module-level
pass, much like PNaCl.
   * Alternatively, the conversion from the va_arg instruction to
   SelectionDAG could be modified. It might be desirable to convert the vaarg
   instruction to a number of loads and a new node that is responsible only for
   manipulating the va_list struct.

We could automatically split va_arg on an LLVM struct type into a series of va_arg calls for each of the elements of the struct. Not sure that actually helps anyone much, though.

Anything more requires full type information, which isn't currently encoded into IR; for example, on x86-64, to properly lower va_arg on a struct, you need to figure out whether the struct would be passed in integer registers, floating-point registers, or memory.

## Next steps
I'd really appreciate any input on the issues here. Do people have strong
feelings about the future direction of va_arg? Will GlobalISel have any effect
on the relative difficulty or desirability of these options?

For GlobalISel, the important bit is the mostly orthogonal question of *when* we lower va_arg. If we do it sometime before isel, we save a bit of implementation work.

-Eli

If converting va_arg {i8, i8} to two va_arg i8, you'd ideally ensure
this results loading the two i8 values from the same slot in the
vararg save area. Of course when passing structs direct for named
arguments, we currently rely on the frontend coercing structs for
cases like this. As such, the naive conversion shouldn't be any worse
than the status quo for named arguments.

Best,

Alex

I've been thinking more about this. Firstly, if anyone has insight in
to any cases where the va_arg instruction actually provides better
optimisation opportunities, please do share. The va_arg IR instruction
has been supported in LLVM for over a decade, but Clang doesn't
generate it for the vast majority of the "top tier" targets. I'm
trying to determine if it just needs more love, or if perhaps it
wasn't really the right thing to express at the IR level. Is the main
motivation of va_arg to allow such argument access to be specified
concisely in IR, or is there a particular way it makes life easier for
optimisations or analysis (and if so, which ones and at which point in
compilation?).

va_arg really does three things:
* Calculates how to load a value of the given type
* Increments the appropriate fields in the va_list struct
* Loads a value of the given type

The problem I see is it's fairly difficult to specialise its behaviour
depending on the target. In one of the many previous threads about ABI
lowering, I think someone commented that in LLVM it happens both too
early and too late (in the frontend, and on the SelectionDAG). This
seems to be the case here, to support targets with a more complex
va_list struct featuring separate save areas for GPRs and FPRs,
splitting a va_arg in to multiple operations (one per element of an
aggregate) doesn't seem like it could work without heroic gymnastics
in the backend.

Converting the va_arg instruction to a new GETVAARG SelectionDAG node
plus a series of LOADs seems like it may provide a straight-forward
path to supporting aggregates on targets that use a pointer for
va_list. Of course this ends up exposing loads plus offset generation
in the SelectionDAG, just hiding the va_list increment behind
GETVAARG. For such an approach to work, you must be able to load the
given type from a contiguous region of memory, which won't always be
true for targets with a more complex va_list struct.

Best,

Alex

Option 3: Teach va_arg to handle aggregates
    * In this option, va_arg might reasonably be expected to handle a
struct,
    but would not be expected to have detailed ABI-specific knowledge. e.g.
it
    won't automagically know whether a value of a certain size/type is
passed
    indirectly or not. In a sense, this would put support for aggregates
passed
    as varargs on par with aggregates passed in named arguments.
    * Casting would be necessary in the same cases casting is required
for named args
    * Support for aggregates could be implemented via a new module-level
pass, much like PNaCl.
    * Alternatively, the conversion from the va_arg instruction to
    SelectionDAG could be modified. It might be desirable to convert the
vaarg
    instruction to a number of loads and a new node that is responsible
only for
    manipulating the va_list struct.

We could automatically split va_arg on an LLVM struct type into a series of
va_arg calls for each of the elements of the struct. Not sure that actually
helps anyone much, though.

Anything more requires full type information, which isn't currently encoded
into IR; for example, on x86-64, to properly lower va_arg on a struct, you
need to figure out whether the struct would be passed in integer registers,
floating-point registers, or memory.

I've been thinking more about this. Firstly, if anyone has insight in
to any cases where the va_arg instruction actually provides better
optimisation opportunities, please do share. The va_arg IR instruction
has been supported in LLVM for over a decade, but Clang doesn't
generate it for the vast majority of the "top tier" targets. I'm
trying to determine if it just needs more love, or if perhaps it
wasn't really the right thing to express at the IR level. Is the main
motivation of va_arg to allow such argument access to be specified
concisely in IR, or is there a particular way it makes life easier for
optimisations or analysis (and if so, which ones and at which point in
compilation?).

We don't have any optimizations that touch va_arg, as far as I know. It's an instruction mostly because it got added when LLVM was first written, and nobody has bothered to try to get rid of it.

va_arg really does three things:
* Calculates how to load a value of the given type
* Increments the appropriate fields in the va_list struct
* Loads a value of the given type

The problem I see is it's fairly difficult to specialise its behaviour
depending on the target. In one of the many previous threads about ABI
lowering, I think someone commented that in LLVM it happens both too
early and too late (in the frontend, and on the SelectionDAG). This
seems to be the case here, to support targets with a more complex
va_list struct featuring separate save areas for GPRs and FPRs,
splitting a va_arg in to multiple operations (one per element of an
aggregate) doesn't seem like it could work without heroic gymnastics
in the backend.

Converting the va_arg instruction to a new GETVAARG SelectionDAG node
plus a series of LOADs seems like it may provide a straight-forward
path to supporting aggregates on targets that use a pointer for
va_list. Of course this ends up exposing loads plus offset generation
in the SelectionDAG, just hiding the va_list increment behind
GETVAARG. For such an approach to work, you must be able to load the
given type from a contiguous region of memory, which won't always be
true for targets with a more complex va_list struct.

Really, IMO, we shouldn't have a va_arg instruction at all, but deprecating it is too much work to be worthwhile. :slight_smile:

If we are going to keep it around, though, we should really do the lowering in IR, before we hit SelectionDAG. Like you explained, it's just a bunch of load and store operations, so there isn't any reason to wait, and transforming IR is much easier than lowering in SelectionDAG.

-Eli

We don't have any optimizations that touch va_arg, as far as I know. It's
an instruction mostly because it got added when LLVM was first written, and
nobody has bothered to try to get rid of it.

I couldn't find any optimisations that directly touch it either, and
it doesn't sound like people are rushing forwards with examples where
generating IR with explicit va_list manipulation results in pessimised
codegen.

va_arg really does three things:
* Calculates how to load a value of the given type
* Increments the appropriate fields in the va_list struct
* Loads a value of the given type

The problem I see is it's fairly difficult to specialise its behaviour
depending on the target. In one of the many previous threads about ABI
lowering, I think someone commented that in LLVM it happens both too
early and too late (in the frontend, and on the SelectionDAG). This
seems to be the case here, to support targets with a more complex
va_list struct featuring separate save areas for GPRs and FPRs,
splitting a va_arg in to multiple operations (one per element of an
aggregate) doesn't seem like it could work without heroic gymnastics
in the backend.

Converting the va_arg instruction to a new GETVAARG SelectionDAG node
plus a series of LOADs seems like it may provide a straight-forward
path to supporting aggregates on targets that use a pointer for
va_list. Of course this ends up exposing loads plus offset generation
in the SelectionDAG, just hiding the va_list increment behind
GETVAARG. For such an approach to work, you must be able to load the
given type from a contiguous region of memory, which won't always be
true for targets with a more complex va_list struct.

Really, IMO, we shouldn't have a va_arg instruction at all, but deprecating
it is too much work to be worthwhile. :slight_smile:

If we are going to keep it around, though, we should really do the lowering
in IR, before we hit SelectionDAG. Like you explained, it's just a bunch of
load and store operations, so there isn't any reason to wait, and
transforming IR is much easier than lowering in SelectionDAG.

I agree. It seems there's an argument that va_arg could be much more
useful in the future, as part of an IR-level ABI lowering. Until that
exists it's perhaps not a big deal either way. I'm CCing Tim Northover
who committed the Clang AArch64 Darwin ABI lowering, and perhaps has a
view on whether there's much value in using va_arg when possible.

va_list manipulation doesn't produce that much noise in the IR when
va_list is just a pointer. I suspect it's more noisy when va_list is a
struct, but there's not a clear path for expanding va_arg to handle
aggregates for those cases outside of an IR-level transform. I'm also
adding in Will Dietz, who has been involved in previous discussions
around this topic.

Best,

Alex