[RFC] Desugar variadics. Codegen for new targets, optimisation for existing

Variadic functions are an ABI concern handled between clang and target backends. They currently block various optimisations (e.g inlining) and require target specific lowering. They also interfere with some analysis, e.g. stack size required by call trees. I remember writing the backend call lowering for a target which only really wanted a functional printf and am presently trying to avoid writing variadic call lowering in SDag and GlobalISel for amdgpu.

For targets which do not have an ABI locking them into an existing variadic calling convention, a reasonably simple target independent pass can rewrite all variadic functions to ones that take a void* instead of … and patch up the call sites to match. This could look like the work in progress D158246. With something equivalent to that in tree, targets that want working variadics and aren’t too worried about the details can cross them off their todo list in two lines of code. That would have brought me joy in a past role.

This RFC is because an emergent feature of this transform is unblocking all the optimisations that are currently thwarted by the variadic type. I’m interested in feedback on the current patch (rewriting the function has more edge cases than I expected) and in thoughts on changing how variadics are currently handled on other targets, and on whether this is indeed worth generalising as an optimisation on arbitrary targets.

Plan of record:

  • I fix remaining handling in D158246
  • Ship as a codegen pass for amdgpu, rewriting all functions in the module
  • Optionally ship likewise on nvptx (would be helpful for libc testing)
  • Escape analysis on va_list values to enable the same rewrite on some functions in other targets
  • Evaluate abi-preserving translation on call sites on other targets
  • See whether some existing backend lowering could be replaced with this sort of pass
  • Refactor pass based on which parts generalised poorly
  • When satisfied, move pass from codegen to transforms and enable it for all targets

In particular, I’d like minimal wrapping functions around vprintf that are popular for logging to inline and then constant fold, even on targets that use va_list representations other than void*.

The pass structure of is-safe-to-transform separated from apply-transform lends itself well to calling either from codegen to transform everything or from opt to transform only functions which are safe (don’t escape) and profitable (likely to be inline afterwards).

The general premise of replacing a variadic call with stashing the arguments on the caller stack and then branching to a modified callee which expects a void * is sound. Using it as the ABI on amdgpu unblocks libc and an internal ticket and gives a testing ground to get the details of the transform right before applying it more generally.

Thanks!

2 Likes

Thanks for working on this. We really want the functionality in the middle end, our current lack of optimizations for varargs is scary.

In general, the layout of the variadic argument buffer is target-specific, so converting the variadic argument list to a buffer needs to be target-specific. Given that, generalizing the transform is going to be complicated. But having the functionality would probably be useful.

I think there are two reasons for optimism in the face of target specific and unaltered layout.

One is that some functions can be rewritten anyway, because the va_list structure doesn’t escape. E.g. any leaf function.

Two is that lots of targets use a void* to an array of ints as the representation, and only vary in whether some int slots are skipped on alignment grounds.

Unfortunately x64 is one of the platforms I’m interested in and that structure is not simple. Still thinking about that one. A target specific hook that puts N values in a struct that is ABI compatible with a va_list might be reasonable, especially if the default works for most targets.

I don’t have line of sight on the optimal design choices yet. The space looks tractable though.

Update. As originally envisioned, this doesn’t work. Revised design does work and should be ready for review shortly.

The idea was to use a simple target agnostic lowering which happened to be the right thing for the GPUs, using escape analysis to apply that calling convention to arbitrary targets where valid. The lowering then unblocks inlining etc and reduces applicable variadic calls to zero cost.

That is mostly thwarted by the escape analysis which ends up the most complicated part of the pass. I didn’t anticipate people wrapping va_list in classes or generally copying them around. It’s also difficult to lower va_arg in clang then choose a calling convention in the middle end.

What does work is choosing to convert foo(…) to foo(va_list) where the va_list is still target dependant, and then setting up a call frame which looks enough like the native one for clang’s va_arg lowering to do the right thing.

It turns out the native variadic frame layout is excitingly target dependant but also heavily constrained by the design space. It’s therefore possible to target amdgpu and x64 Linux with almost the same code, despite the differences in va_list. Arm and nvptx are likewise different and still OK.

As an optimisation pass on x64, this reduces my unit test file to ret 0. Everything can be dissolved by existing passes after the transform. Noinline lets one check the execution is sound in machine code too.

The insight is to build a function taking va_list instead of …, fix up all the known call sites to call the new function, and leave a foo(…) that calls into the new function to preserve indirect calls / general escapes.

For amdgpu&nvptx, we can also RAUW before codegen which means the backend doesn’t ever need to implement vararg lowering.

Testing looks like split-function, expand-calls, rewrite-abi as separate steps in a pipeline which helps tame the combinatorial explosion. In particular split-function has a lot of complexity to deal with making an equivalent function and is wholly architecture independent.

Hoping to run the remaining gpu hang weirdness to ground shortly. X64 is there modulo working out how to distinguish Microsoft vs Linux calling conv in the middle end.

1 Like

First patch at [transforms] Inline simple variadic functions by JonChesterfield · Pull Request #81058 · llvm/llvm-project · GitHub is a refined version of the above.

A variadic function that happens to be a trivial wrapper around a va_list can be ‘inlined’ by building a target-specific va_list value and emitting a call to that va_list taking function instead. This occurs in practice occasionally, e.g.

int vfprintf(FILE *restrict f, const char *restrict fmt, va_list ap);
int fprintf(FILE *restrict f, const char *restrict fmt, ...)
{
  va_list ap;
  va_start(ap, fmt);
  int ret = vfprintf(f, fmt, ap);
  va_end(ap);
  return ret;
}

The vast majority of the target specific logic can be constrained to creating a va_list value which contains the values passed to the variadic function.

Most variadic functions don’t match this form. Those can be split into one internal function that does the same work as the original but takes a va_list instead and one function which replaces the original with a call to va_start. The va_start in the variadic function can be replaced with a va_copy of the new va_list trailing argument. This is more complicated than anticipated but is almost target independent.

Finally the codegen equivalent involves additionally rewriting calls to unknown variadic functions and being careful with the external symbols. That can be run on x64/aarch64 under tests, or as a whole program optimisation in the statically linked case from lld.

There’s a tail of edge cases but I think the outlined strategy is right:

  1. convert arbitrary variadic into a trivial one
  2. target specific inlining of trivial variadic
  3. use the same transform for codegen
1 Like

This has landed for some targets. AMDGPU and NVPTX have it live, Webassembly has it implemented and used for testing the implementation but disabled by default. X64 and AArch64 are on local branches that I need to revive and post for review.

1 Like