Experience with [[clang::musttail]]

Hi, I’m drafting a paper to propose standardization of [[clang::musttail]], as (for me it’s really useful). It’s also implemented in GCC and today I have learned a MSVC backend got it too.

Preliminary draft is here: D3939R0: tail recursion enforcement

Basically I wonder if there are issues I’m not aware and you are, or if you would do it differently now.

1 Like

Not sure about issues, or things that should be done differently, but I want to emphasize that this feature is incredibly useful for optimizing key routines. Some background of the motivating usages that led to this attribute:

The second of these has some specific reflections on the approach taken, and is likely relevant to how best to standardize this. Josh Haberman can likely provide more direct thoughts, I’ll point him at this thread.

But this technique has been adopted much more widely than even these mentions. For example, even Carbon’s lexer uses it: carbon-lang/toolchain/lex/lex.cpp at trunk · carbon-language/carbon-lang · GitHub

We didn’t adopt this without reason – there was nothing comparable to achieve the level of performance, small code size, and clear implementation. While this may seem "niche” or hard to motivate, I think this is a really, deeply important performance primitive to provide. And that is without any of the benefits it affords for compile time programming where it allows a categorically more powerful way to manage compile time evaluation stack depth.

3 Likes

FYI the C committee is working on a TS proposing tail call under the syntax return goto. I would prefer that both C and C++ were consistent here https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3582.pdf

There are issues with putting a mandatory behavior behind an attribute - i.e attributes may not be supported, which would defeat the intent of the feature.

@AaronBallman might be more aware of the C design and underlying discussions there - I’m not sure what the status of the TS is either.

I agree with Chandler that there is definitively a user need for something in that space.

+1; there’s precedent for an experiment to start out as a vendor attribute and then be standardized as a keyword ([[clang::require_constant_initialization]] turned into constinit for example). On the C side of things, the committee went with return goto.

My understanding from the Brno meeting is that we’re sending TS 25007 out for balloting. So it’s not yet published. I’ve not seen the ballot come out for it yet though, so uncertain of the timing of publication.

We’ve not discussed whether we intend to implement TS 25007 (whole or in part) in the language wg meeting yet, but I would be surprised if we didn’t at least do the constexpr function calls bit of the TS. And if we did that, I’d personally push to do return goto as well, if it maps sufficiently well to our existing facilities, just so we can get better feedback to WG14 on the design.

I wonder if guaranteed tail calls would allow std::task to use symmetric transfer

From P3801R0, section 3.1

A potential fix is adding symmetric transfer to ex::task with an operator co_await overload. However, while this would solve the example above, it would not solve the general problem of stack overflow when awaiting other senders. A thorough fix is non-trivial and requires support for guaranteed tail calls.

In my memory, the cases where we didn’t apply symmetric transfer is, there are hardwares didn’t support tail call

Hi, I implemented [[clang::musttail]] and wrote the blog posts Chandler referenced above.

Overall I love the idea of standardizing this feature. As Chandler mentioned, it’s been a crucial tool for optimizing a lot of performance-critical code.

In my view, the trickiest issue is that it is platform-dependent whether a given tail call can be optimized or not. For example, a tail call that works fine on x86-64 (which passes arguments in registers) might be impossible to tail call on x86 (which passes arguments on the stack).

The existing [[clang::musttail]] attribute tries to solve this problem by defining a set of rules about how the caller and callee function signatures must match. This set of rules tries to provide a “portable” guarantee:

The target function must have the same number of arguments as the caller. The types of the return value and all arguments must be similar according to C++ rules (differing only in cv qualifiers or array size), including the implicit “this” argument, if any. Any variables in scope, including all arguments to the function and the return value must be trivially destructible. The calling convention of the caller and callee must match, and they must not be variadic functions or have old style K&R C function declarations.

[…]

clang::musttail provides assurances that the tail call can be optimized on all targets, not just one.

I would argue that this design has not worked well in practice. After this attribute landed in Clang, two main things happened:

  1. Some backends failed to optimize certain tail calls, even though they followed the “portable” rules. This manifested as compiler crashes in the backend (1, 2, 3, 4, etc).
  2. Users complained that the compiler rejected [[clang::musttail]] even on calls that were clearly possible to optimize on the current platform.

In other words, the current design set of constraints is simultaneously too strict and not strict enough. It’s not strict enough to provide the guarantees it wants to provide, but also too strict compared to what users want.

I think the best solution to this conundrum is to do what the proposed C standard does: just make it completely architecture-dependent whether a given musttail is accepted or not. (Incidentally, I also like the return goto syntax proposed for C.)

In practice, this will mean that each project has to manually put #ifdef around any code that uses tail calls, and manually manage which set of platforms use the return goto path. This is what Protobuf (for example) does now.

This is not the most elegant solution, but it is simple and transparent. It gives the user the full capabilities of the current platform, while empowering them to write a fallback path that does not require tail calls.

If we were trying to do the most elegant thing, we might wish for a constexpr function like constexpr bool std::can_tail_call<From, To>(), so that you could use if constexpr () or templates to precisely target the set of platforms where a given tail call will be possible. But this would require that target-specific information be fed into the Clang frontend for performing semantic analysis. This would introduce coupling between Clang and LLVM, and would cause divergence from C (which would still need to use #ifdef). Overall, I think the complexity of this would make it far more difficult to implement, and would make it less likely that compilers would support it even if it was standardized.

So overall, I propose to relax the constraints of musttail, at both the Clang and LLVM level. The backend can issue an error diagnostic if it finds that the given tail call is not possible on the current platform.

3 Likes

I’m also a big fun of [[clang::musttail]] and if standardized that’s even better. For my projects it doesn’t matter if it is going to be standardized in some completely different form.

However, I’ve encountered some issues with the current implementation of [[clang::musttail]]. E.g. it doesn’t work well with noexcept: [[musttail]] does not work in trivial noexcept functions · Issue #53087 · llvm/llvm-project · GitHub.

I think I’m responsible for coming up with the LLVM constraint that prototypes must match, but it might have been a group effort, it’s hard to remember. I pushed an unfinished RFC draft proposing that we relax the rules in LLVM, but it’s more like a blog post, and I’ve got too many open proposals on my plate to push it further.

I mostly agree with @haberman that we should relax the constraints, and that’s what I proposed in the draft RFC, but I paused after I realized that, without the verifier rules, it means that valid middle-end transforms can transform calls in ways that make them impossible to lower. This is what I wrote:

The next important consideration is that, if you want musttail calls to be
reliable, you need to ensure that they endure through mid-level transformations.
The prototype match verifier rule was a powerful tool for auditing LLVM for
transforms like argument promotion which can change the prototype in ways that
break the backend’s ability to emit the tail call. For example, argument
promotion is a kind of interprocedural scalar-replacement-of-aggregates (SROA),
and it can easily take a tail call that would only use registers, to one that
passes arguments in memory, and would cause the backend to fail to emit the tail
call that the user requested. If you dig through the logs, you can find many
instances of folks powering down IPO transforms in the presence of musttail
calls thanks to this verifier check.

And it is true, if you grep llvm/lib/Transforms, there are lots of IPO transforms that care about musttail. If we relax the langref rules, we can keep all of those checks, but they become unprincipled backend compatibility hacks, instead of IR invariant preservation rules. The “matching prototype” constraint makes it clear exactly what kinds of dead argument elimination transforms are OK, for example.

This was my unfinished conclusion:

So where should we go from here?

Before we do anything, we need to improve the backend diagnostic. LLVM has
facilities for reporting backend errors for stack frame size overflow and inline
assembly emission errors, and we should be using those APIs to emit readable
error messages instead of an internal error stack spew. This should reduce the
rate of bug reports over the long run since we’ll stop prompting users to file
bugs on impossible musttail calls.

Personally, my view is that we should compromise on portability and just give the performance tweakers what they want, and remove all of the frontend and verifier safety checks from the feature.

… but I’m not super confident in that final sentence. Read it as internal debate.

Could LLVM have a custom tail call calling convention that allows tail calls regardless of the exact signature?

On the LLVM side, we already have tailcc, which supports arbitrary tail calls. It would be straightforward to expose it as a clang function attribute. But then the user would need to mark all the relevant functions with the attribute.

Thank you so much, this is super valuable perspective and makes a lot of sense.

I would very against any proposal that does not take this feedback into account.

How does tailcc compare to preserve_none? There’s been a fair bit of effort in the last few years to optimize preserve_none to work well for tail calls, and it is already exposed as a Clang attribute. I wrote some details in my blog entry: A Tail Calling Interpreter For Python (And Other Updates)

Many existing musttail users are using preserve_none today, including Protobuf and the experimental Python tail calling interpreter. I don’t know if it can tail call in as many cases as tailcc does, but if not I’d be happy to see it expanded.

Calling conventions can be a useful tool for making tail calls more optimal or more likely to succeed, but I’m not sure we’d want to require them for musttail, especially since the latter can hopefully be standardized but the calling conventions probably wouldn’t make sense to.

So I still tend to think that the semantic should be: tail call if possible on the current platform, and error out otherwise. Applying preserve_none (or tailcc) would increase the likelihood of success, but even that will never be a guarantee (eg. WebAssembly 1.0 can never support them).

I’m glad Reid jumped in though, because I don’t know what’s needed to actually implement this in LLVM.

The important property of tailcc is that it makes the function callee-pop. Each call is expected to clean up its arguments, in addition to whatever it allocates for itself. If the callee of a musttail call is tailcc, the amount of stack allocated for the caller’s argument list isn’t relevant, and we don’t need to enforce any restrictions based on the relationship between the caller signature and the callee signature.

Without something like tailcc, we need to impose a constraint that the argument lists of the caller and callee use the same amount of stack. And that’s weird to specify: the amount of stack the argument list of a call uses is extremely target-specific and hard to calculate. We don’t really want to say in the clang manual “please refer to your target’s System V Supplement to see if your specific combination of caller and callee signatures is legal”. And it’s not something we currently track at the LLVM IR level, which makes it hard to enforce consistently (for frontends, and for optimizations). Requiring that the signature exactly matches solves this in a way that’s easy to explain and enforce.

preserve_none doesn’t really solve anything along that dimension; it’s a a caller-pop convention, with the same issues as other caller-pop conventions.


Most of current bugs with musttail under the “matching-signature” constraint are simply about missing code to manage arguments passed on the stack. There’s no fundamental reason it doesn’t work, it just requires a bit of target-specific code. I think the only target with an exotic ABI issue is PowerPC ([PowerPC] [LLD][Compiler] support tail call across modules for ELFv2 · Issue #98859 · llvm/llvm-project · GitHub).

Caller-pop sounds like a nice way to handle arguments that don’t fit in registers, but I’m still not seeing how we could standardize musttail in a way that requires the callee to have a special calling convention.

Unless you’re proposing to also standardize the tailcc calling convention, there would be no way of using the standard musttail (or return goto) in a portable program – it would always require a compiler-specific calling convention. It seems like we need to make it legal to use musttail with any calling convention (including the default convention).

For optimizations, could we just say that it’s not safe for the optimizer to rewrite function signatures annotated with musttail unless the callee is known to use a caller-pop convention? That doesn’t guarantee that the tail call will work, but it can at least guarantee that the optimizer didn’t break a tail call that would have worked.

I agree that it makes it platform-specific whether a given tail call will be possible or not, but there are various ways of dealing with this (including reaching for a caller-pop convention) if desired.

It’s not that much of a stretch to standardize something like tailcc. I mean, the C++ standard wouldn’t need to say very much about the actual mechanics of it, just “this function uses a calling convention that is suitable for ‘return goto’”. And it doesn’t need to do anything at all on targets that don’t support “return goto”.

LLVM optimizations currently impose a strong restriction on optimizing functions that contain musttail calls; we can continue to enforce the same restrictions, but it would be harder to catch bugs because the IR verifier can’t tell if a construct is legal.


I’m generally unhappy with standardizing “it’s implementation-defined whether tail calls work on Friday the 13th”.

2 Likes

I’m not opposed to standardizing a tail call convention, but I do think it’s important that we don’t let our quest for predictability compromise any of the efficiency gains that motivated musttail to begin with.

In particular, we want all the efficiency benefits that we are getting from preserve_none today, specifically:

  1. preserve_none has no callee-save registers. This is important because it allows a series of tail called functions to use all available registers without having to save/restore them. In practice, this removes a lot of the prologue/epilogue we would otherwise get. (Example: Compiler Explorer. Notice that preserve_none avoids having to push/pop r14 and rbx).
  2. preserve_none allocates far more registers for arguments, which allows us to keep more state in registers between functions (Example: Compiler Explorer).
  3. preserve_none allocates the registers that would normally be callee-save as its first arguments. This means that those arguments can stay in registers if we call a function using a normal calling convention. (Example: Compiler Explorer). [1]

Could tailcc be modified to support all of these patterns, if it doesn’t already? I couldn’t easily check for myself since it’s not exposed to Clang currently.


  1. I think preserve_none could still be improved on this point; see: Try to use non-volatile registers for `preserve_none` parameters by brandtbucher · Pull Request #88333 · llvm/llvm-project · GitHub ↩︎

1 Like

The current tailcc is more similar to the C calling convention in terms of registers saved, rather than preserve_none. But that’s just what’s currently implemented. There isn’t any fundamental connection between whether a function is callee-pop, and how many registers are callee-save.

You can experiment with __attribute__((swiftasynccall)) , which is similar.

Revisiting this after the holidays, I’ve changed my mind. I think Clang and LLVM should keep the draconian requirements for prototype match for caller-pop conventions. The main value of the rule is that it makes the semantics for what the middle-end is allowed to do super clear.

I agree with Ralf and others that we should promote tailcc and other callee-pop conventions as the practical solution for making heterogeneous tail call chains work. If the programmer is serious about experimenting with guaranteed tail calls, musttail, return goto, etc, they would greatly benefit from the flexibility of a callee-pop convention. If we wanted to be really helpful, we could even find a way to work it into the compiler error message (“; consider using a callee-pop convention such as tailcc”).

I don’t think we should try to move heaven and earth to feed SysV calling convention details into a constexpr value and TargetTransformInfo data that every IPO pass that touches prototypes needs to check.

From reviewing the C and C++ proposals, it sounds like this limitation conforms, an implementation is allowed to reject a tail call that it can’t lower, and that’s what we do today.

I would, however, really like to see us improve the backend diagnostic. report_fatal_error is a bad way to tell the user about backend limitations like this. We should use something more like an inline assembly register allocation failure diagnostic that points to the original call source location when available.

1 Like

Strongly agreed.

Another idea is to look at how __attribute__((error)) is implemented; that’s a frontend attribute which generates a diagnostic from the backend and uses the actual diagnostics engine in Clang to do so.