[RFC] Replacing inalloca with llvm.call.setup and preallocated

Hello all,

A few years ago, I added the inalloca feature to LLVM IR so that Clang could be C++ ABI compatible with MSVC on 32-bit x86. The feature works, but there is room for improvement. I recently took the time to write up a design using token values that will hopefully be better named and easier to work with and around.

For the technical details of the proposal, I’ve written up the RFC in Markdown here:
https://github.com/rnk/llvm-project/blob/call-setup-docs/llvm/docs/CallSetup.md
I’ve pasted the text below if you want to quote and reply on the list.

The main question I have for the community is, given that it is infeasible to upgrade inalloca to llvm.call.setup, can we drop support for the old IR? So far as I am aware, Clang has been the only user of this attribute, and only for i*86-windows-msvc targets. The main goal of adding llvm.call.setup is to get rid of inalloca, so if all LLVM IR transforms have to keep knowing about it, there is no reason to add a new way to do the same thing.

Thanks,
Reid

The last RFC: http://lists.llvm.org/pipermail/llvm-dev/2013-July/064218.html
Current inalloca docs: https://llvm.org/docs/InAlloca.html

I assume by “drop support”, you mean reject it in the bitcode reader/IR parser? We can’t reasonably support a complex feature like inalloca if nobody is testing it. If we can’t reasonably upgrade it, and we don’t think there are any users other than clang targeting 32-bit Windows, probably dropping support is best.

More details comments on the proposal:

“llvm.call.setup must have exactly one corresponding call site”: Normal IR rules would allow cloning the call site (in jump threading), or erasing the call site (if there’s a noreturn call in an argument). What’s the benefit of enforcing this rule, as opposed to just saying all the call sites must have the same signature?

The proposal doesn’t address what happens if llvm.call.setup is called while there’s another llvm.call.setup still active. Is it legal to call llvm.call.setup in a loop? Or should nested llvm.call.setup calls have the parent callsetup token as an operand?

Is there some way we can allow optimizations if we can’t modify the callee, but we can prove nothing captures the address of the preallocated region before the call? I guess under the current proposal we could transform preallocated->byval, but that isn’t very exciting.

How does this interact with other dynamic stack allocations? Should we switch VLAs to use a similar mechanism? (The problems with dynamic alloca in general aren’t as terrible, but it might still benefit: for example, it’s much easier to transform a dynamic allocation into a static allocation.)

“If an exception is thrown and caught within the call setup region, the newly established SP must be saved into the EH record when a call is setup.” What makes this case special vs. what we currently implement? Is this currently broken? Or is it related to supporting frame pointer elimination?

-Eli

I assume by “drop support”, you mean reject it in the bitcode reader/IR parser? We can’t reasonably support a complex feature like inalloca if nobody is testing it. If we can’t reasonably upgrade it, and we don’t think there are any users other than clang targeting 32-bit Windows, probably dropping support is best.

That’s a good point. There are already enough lightly tested features in LLVM. There’s no reason to leave another one lying around like a trap for the first unsuspecting user to try it.

More details comments on the proposal:

“llvm.call.setup must have exactly one corresponding call site”: Normal IR rules would allow cloning the call site (in jump threading), or erasing the call site (if there’s a noreturn call in an argument). What’s the benefit of enforcing this rule, as opposed to just saying all the call sites must have the same signature?

I think we could cope with unreachable code elimination deleting a paired call site (zero or one), but code duplication creating a second call site could be problematic. The call setup doesn’t describe the prototype of the main call site, so if there were multiple call sites, the backend would have to pick one call site arbitrarily or compare the call sites when setting up the call. If there are zero call sites, the backend can create static allocas of the appropriate type to satisfy the allocations. Of course, an IR pass (instcombine?) should do this transform first if it sees it. Maybe we could have CGP take care of it, too.

The proposal doesn’t address what happens if llvm.call.setup is called while there’s another llvm.call.setup still active. Is it legal to call llvm.call.setup in a loop? Or should nested llvm.call.setup calls have the parent callsetup token as an operand?

Nested setup is OK, but the verifier rule that there must be a paired call site should make it impossible to do in a loop. I guess we should have some rule to reject the following:
%cs1 = llvm.call.setup()
%cs2 = llvm.call.setup()
call void @cs1() [ “callsetup”(token %cs1) ]
call void @cs2() [ “callsetup”(token %cs2) ]

Is there some way we can allow optimizations if we can’t modify the callee, but we can prove nothing captures the address of the preallocated region before the call? I guess under the current proposal we could transform preallocated->byval, but that isn’t very exciting.

I suppose we could say that the combo of byval+preallocated just means byval, and teach transforms that that’s OK.

How does this interact with other dynamic stack allocations? Should we switch VLAs to use a similar mechanism? (The problems with dynamic alloca in general aren’t as terrible, but it might still benefit: for example, it’s much easier to transform a dynamic allocation into a static allocation.)

VLAs could use something like this, but they are generally of unknown size while call sites have a known fixed size. I think that makes them pretty different.

“If an exception is thrown and caught within the call setup region, the newly established SP must be saved into the EH record when a call is setup.” What makes this case special vs. what we currently implement? Is this currently broken? Or is it related to supporting frame pointer elimination?

I think of it as a special case because you can’t write this in standard C++. Today, I think we leak stack memory in this case. There’s no correctness issue because we copy SP into its own virtual register at the point of the alloca, and arguments are addressed relative to the vreg. What I had in mind for the new system is that we make some kind of fixed stack object that uses pre-computed SP offsets, assuming there are no dynamic allocas in the function. This would be a problem for a program that does:

setup call 1
store call 1 arg 0
try {
setup call 2
throw exception
call 2
} catch (…) {}
; call 2’s frame is still on the stack
store call 1 arg 1 ; SP offset would be incorrect

call 1

Reply inline. (Sorry about the formatting; I can’t figure out how to avoid destroying it in Outlook.)

Sorry for the delay. Arthur Eubanks has started working on the design here:
https://reviews.llvm.org/D74651
I felt I should follow up here about that.

It doesn’t seem like multiple call sites should be a problem if they’re sufficiently similar? If the argument layout for each callsite is the same, it doesn’t matter which callsite the backend chooses to compute the layout

It’s feasible, but it seems like asking for bugs. What should the backend do when it detects two preallocated calls with different layouts? Let’s imagine LTO is promoting one indirect call that uses call.setup into several direct calls. The IR would look like this:

%cs = call.setup
switch i32 %callee …
callee1:
call void @callee1(i32 %x, %struct.Foo* preallocated(%struct.Foo) %foo)
br label %rejoin

callee2:
call void @callee2(i32 %x, %struct.Foo* preallocated(%struct.Foo) %foo)
br label %rejoin

rejoin:

A logical next step would be to run DAE. Suppose one callee does not use i32 %x above. Now the prototypes disagree, and we can’t lower the call. We could teach DAE that all calls using callsetup tokens have to have the same prototype, but a simple verifier rule earlier (one call per call setup) seems easier to enforce.

Nested setup is OK, but the verifier rule that there must be a paired call site should make it impossible to do in a loop. I guess we should have some rule to reject the following:

%cs1 = llvm.call.setup()

%cs2 = llvm.call.setup()

call void @cs1() [ “callsetup”(token %cs1) ]

call void @cs2() [ “callsetup”(token %cs2) ]

I think in general, there can be arbitrary control flow between a token and its uses, as long as the definition dominates the use. So you could call llvm.call.setup repeatedly in a loop, then call some function using the callsetup token in a different loop, unless some rule specific to callsetup forbids it.

I agree that the IR would validate according to the current rules, but I would interpret that IR as meaning: setup N call sites in a loop, then destroy just the last call site N times in a loop. For example, if you replaced the call site token in this example with stacksave+alloca and teardown with stackrestore, the semantics would be: allocate lots of stack, then reset to the last saved stack pointer repeatedly.

I think we can write some non-verifier rules that prohibit this kind of code pattern. Optimizations should already obey these rules since they don’t reorder things that modify inaccessible state. Things like:

  • It is UB to stacksave ; call.setup ; stackrestore ; call
  • UB to call.setup 1 ; call.setup 2 ; call 1 ; call 2
  • UB to call.setup ; alloca ; call
    etc

It would be nice to make the rules strong enough to ensure we can statically compute the size of the stack frame at any point (assuming no dynamic allocas). Code generated by clang would be statically well-nested, I think; not sure how hard it would be to ensure optimizations maintain that invariant.

I agree, that is definitely a goal of the redesign.

Connecting nested llvm.call.setups using tokens might make it easier for passes to reason about the nesting, since the region nest would be explicitly encoded.

I agree, that could be useful, it would replicate what we did for exception handling.

VLAs could use something like this, but they are generally of unknown size while call sites have a known fixed size. I think that makes them pretty different.

I don’t think we need to implement it at the same time, but the systems would interact, so it might be worth planning out.

I do recall that MSVC does some pretty squirrelly stuff when you insert alloca calls into a call sequence. In this example, arguments are evaluated in the order 2, 3, 1:

struct Foo {
Foo();
Foo(const Foo &o);
~Foo();
int x;
};
void receive_ptr(Foo, void *, Foo);
extern “C” void *_alloca(size_t n);
extern int gv;
void f(int n) { receive_ptr(Foo(), _alloca((++gv, n)), Foo()); }

The standard says order of argument evaluation is unspecified, so this ordering is valid. However, I wouldn’t want to implement the same behavior in clang. We should probably implement some kind of warning, though. Compiled with clang, this code has undefined behavior.

Is there really any other option for us here, other than to document that using VLAs, alloca, and call.setup in the wrong order will result in UB? We can precisely nail down what the “right order” is, but we basically can’t make arbitrary stack adjustment orderings work.

Reply inline.

This would specifically be for cases where we try to rewrite the signature? I would assume we should forbid rewriting the signature of a call with an operand bundle. And once some optimization drops the bundle and preallocated marking, to allow such rewriting, the signature doesn’t need to match anymore.

Yes, I really would like to enable DAE and other signature rewriting IPO transforms. Maybe today DAE doesn’t run on calls with bundles, but this feature is designed to allow the non-preallocated arguments to be removed or expanded into multiple arguments without disturbing the preallocated argument numbering.

Ultimately, I think it’s worth some effort here to try to avoid blocking optimizations like jump threading. That said, if you want to make the calls “noduplicate”, it isn’t that terrible of an alternative.

Does “noduplicate” block inlining, though? Maybe we really want “convergent”? In any case, I’m OK with powering down jump threading to pick up IPO, which is incompatible with today’s inalloca.

Good. It might be a good idea to try to write out an algorithm for this to ensure this works out. In particular, I’m concerned about cases where two predecessors of a basic block appear to have a different stack size (an if-then-else, or a loop backedge). We need to make sure such cases are either invalid, or UB on entry to the block.

I spent a little time thinking, and I’m not sure what rules we need to make this work out. For example, should we forbid tail-merging multiple calls to abort()? IF we should, how would we write a rule which restricts that?

This is actually a big open problem, and it came up again in the SEH discussion. It seems to be my fate to struggle against the LLVM IR design decision to not have scopes.

Without introducing new IR constructs, we could define a list of instructions that set the current region. We could write an algorithm for assigning regions to each block. The region is implicit in the IR. These are the things that could create regions:

  • call.preallocated.setup
  • catchpad
  • cleanuppad
  • lifetime.start? unclear
    Passes are required to ensure that each BB belongs to exactly one region. Each region belongs to its parent, and ending a region returns to the parent of the ending region.

I don’t think this idea is ready to be added to LangRef, but it is a good future direction, perhaps with new supporting IR constructs.

I think for now we have to live with the possibility that the analysis which assigns SP adjustment levels to MBBs may fail to find a unique SP level, in which case we must use a frame pointer.

OTOH, we can easily establish the invariant at the MIR level. We should always be able to assign each MBB a unique most recently active call site and an SP adjustment level. We can easily teach BranchFolding to preserve this invariant. We already do it for funclets.

I’m not really concerned with funny usage of calls to alloca() in call arguments, or anything like that. I’m happy to pick whatever rule is easiest for us. I’m more concerned with ensuring nothing blows up if we inline a call to a function that contains a VLA, or something like that.

Sounds good. Inlining dynamic allocas and VLAs should already just work. The inliner places stacksave/stackrestore calls around the original call site, if dynamic allocas were present in the inlined code.

This would specifically be for cases where we try to rewrite the signature? I would assume we should forbid rewriting the signature of a call with an operand bundle. And once some optimization drops the bundle and preallocated marking, to allow such rewriting, the signature doesn’t need to match anymore.

Yes, I really would like to enable DAE and other signature rewriting IPO transforms. Maybe today DAE doesn't run on calls with bundles, but this feature is designed to allow the non-preallocated arguments to be removed or expanded into multiple arguments without disturbing the preallocated argument numbering.

This is sort of besides the point. I think we get the best of everything if we just say the bundle and preallocated attribute have to be dropped before any other IPO transforms. (If we control the function and all the callers, this transform should be relatively simple: we just remove the attribute and bundle from the function and all the callers, and insert an llvm.call.teardown after each call where we dropped a bundle.)

Even if we can potentially rewrite signatures in some cases without dropping the preallocated attribute, there isn’t really any incentive to do that, as far as I know.
Good. It might be a good idea to try to write out an algorithm for this to ensure this works out. In particular, I’m concerned about cases where two predecessors of a basic block appear to have a different stack size (an if-then-else, or a loop backedge). We need to make sure such cases are either invalid, or UB on entry to the block.

I spent a little time thinking, and I’m not sure what rules we need to make this work out. For example, should we forbid tail-merging multiple calls to abort()? IF we should, how would we write a rule which restricts that?

This is actually a big open problem, and it came up again in the SEH discussion. It seems to be my fate to struggle against the LLVM IR design decision to not have scopes.

Without introducing new IR constructs, we could define a list of instructions that set the current region. We could write an algorithm for assigning regions to each block. The region is implicit in the IR. These are the things that could create regions:
- call.preallocated.setup
- catchpad
- cleanuppad
- lifetime.start? unclear

stacksave/stackrestore.

I don’t think there’s any reason to make lifetime intrinsics properly nest; nothing really cares, as far as I know. (The current lifetime intrinsics have problems, but I don’t think that’s one of them.)

Passes are required to ensure that each BB belongs to exactly one region. Each region belongs to its parent, and ending a region returns to the parent of the ending region.

I don't think this idea is ready to be added to LangRef, but it is a good future direction, perhaps with new supporting IR constructs.

I think for now we have to live with the possibility that the analysis which assigns SP adjustment levels to MBBs may fail to find a unique SP level, in which case we must use a frame pointer.

Okay. Can we at least list out the cases where we expect this might happen?

OTOH, we can easily establish the invariant at the MIR level. We should always be able to assign each MBB a unique most recently active call site and an SP adjustment level. We can easily teach BranchFolding to preserve this invariant. We already do it for funclets.

I’m not really concerned with funny usage of calls to alloca() in call arguments, or anything like that. I’m happy to pick whatever rule is easiest for us. I’m more concerned with ensuring nothing blows up if we inline a call to a function that contains a VLA, or something like that.

Sounds good. Inlining dynamic allocas and VLAs should already just work. The inliner places stacksave/stackrestore calls around the original call site, if dynamic allocas were present in the inlined code.

So stacksave/stackrestore regions properly nest with call.preallocated.setup regions? Sure, that makes sense.

-Eli

Bringing this back up for discussion on handling exceptions.

According to the inalloca design, there should be a stackrestore after an invoke in both the non-exceptional and exceptional case (that doesn’t seem to be happening in some cases as we’ve seen, but that’s beside the point).

Does it make sense to model a preallocated call as handling the cleanup of the stack in normal control flow (e.g. always for a normal call, and in the non-exceptional path for an invoke)? Then @llvm.call.preallocated.teardown is only necessary in the exceptional path to cleanup the stack.

It makes sense to say that the instruction with the preallocated bundle frees the allocation, I think, on both the normal and exceptional codepaths of an invoke. inalloca has to use a stackrestore to match the stacksave, but there isn’t any reason to match that here. And there isn’t any reason to distinguish between the normal and unwind paths of the invoke.

The llvm.call.preallocated.teardown, then, is only necessary if we manage to throw an exception before we reach the call with the preallocated bundle. (For example, if the constructor for a function argument throws an exception.)

The one part that’s a little tricky is I’m not sure how llvm.call.preallocated.teardown is supposed to work on the exceptional path. If you try to take stack memory allocated outside a funclet, and free it inside a funclet, we can’t actually free the memory at the point of the call, so there isn’t any simple lowering. And in general, the llvm.call.preallocated.setup doesn’t dominate the point where we exit the funclet. So I’m not sure what the best way to represent that is. I guess we could stick the teardown call in the funclet, say we “free” the memory at that point, and let the backend do whatever magic is necessary to ensure we actually adjust the stack pointer correctly when the funclet returns.

-Eli

It makes sense to say that the instruction with the preallocated bundle frees the allocation, I think, on both the normal and exceptional codepaths of an invoke. inalloca has to use a stackrestore to match the stacksave, but there isn’t any reason to match that here. And there isn’t any reason to distinguish between the normal and unwind paths of the invoke.

Ah I see, the stack cleanup would be implicit after the preallocated call, whether the normal or exceptional path, makes sense. I assume that’s what invoke + byval must do.

The llvm.call.preallocated.teardown, then, is only necessary if we manage to throw an exception before we reach the call with the preallocated bundle. (For example, if the constructor for a function argument throws an exception.)

Does it make sense to say that in order to avoid stack leaks, exactly one of either the preallocated call or the teardown must be called? And if both are called it’s UB?

The one part that’s a little tricky is I’m not sure how llvm.call.preallocated.teardown is supposed to work on the exceptional path. If you try to take stack memory allocated outside a funclet, and free it inside a funclet, we can’t actually free the memory at the point of the call, so there isn’t any simple lowering.

I assume you mean the exceptional path of e.g. a previous constructor, not the preallocated call. I don’t understand how it would be any different from a typical stackrestore of an inalloca in a funclet.

The llvm.call.preallocated.teardown, then, is only necessary if we manage to throw an exception before we reach the call with the preallocated bundle. (For example, if the constructor for a function argument throws an exception.)
Does it make sense to say that in order to avoid stack leaks, exactly one of either the preallocated call or the teardown must be called? And if both are called it's UB?

Yes, exactly.

The one part that’s a little tricky is I’m not sure how llvm.call.preallocated.teardown is supposed to work on the exceptional path. If you try to take stack memory allocated outside a funclet, and free it inside a funclet, we can’t actually free the memory at the point of the call, so there isn’t any simple lowering.
I assume you mean the exceptional path of e.g. a previous constructor, not the preallocated call. I don't understand how it would be any different from a typical stackrestore of an inalloca in a funclet.

I don’t think inalloca actually works correctly here. So comparing it isn’t really helpful.

-Eli