Stackmaps: caller-save-registers passed as deopt args

This is a follow up on a conversation some of us had at the hacker lab – I noticed that sometimes I will get notified that a deopt value lives in a register that is not callee-save (caller-save I guess, but is there another term for this that is less similar to callee-save?). This surprised me quite a bit since those registers immediately got clobbered by the call inside the patchpoint, so we were discussing whether this is the appropriate behavior or not.

[Another point about terminology – the stackmaps documentation refers to the non-function-call arguments as “live values”, but we were calling them “deopt values” in person, and I don’t particularly like either term, since we also have “live outs” which deal with a separate problem, and for Pyston we pass some non-deopt-values in this section. It might be worth nailing down a term for this – personally I call them “stackmap arguments” and the initial arguments “function arguments”.]

In the cases I’m running into, the value lives in both a register and on the stack, and the stackmap code seems to preferentially give the register even if it is a caller-save register. In our discussion, it seemed like another possible situation is if the value only lives in a caller-save register, if it is not used after that patchpoint. So the questions are: should we prefer a stack slot to caller-save registers, and should we prefer them enough to emit a spill if necessary? I think the answers to both should be yes. I think LLVM should emit the spill because otherwise the client’s patchpoint-initialization logic will have to do that; that’s fine, but it means that one has to reserve nops for spilling in every patchpoint, whereas LLVM will emit the spill code when necessary.

There was some discussion about what the behavior should be if you use anyregcc, since in that case it seems reasonable to receive a caller-save register if you are interested in using it inside the patchpoint. My thoughts are that in that case you might be better off including the value as part of the function arguments instead of the stackmap/deopt arguments. I think Philip had some more thoughts about how the calling convention of the patchpoint should apply to the deopt args.

So concretely, I think we should change the lowering so that (other than anyregcc) the deopt args always get passed on the stack or in callee save registers, even if it requires adding spills. I think some of the other people in the discussion had a better idea about how to potentially implement that than I did.

kmod

Hi Kevin,

Thank you for starting this discussion!

I think the distinction is really between whether the live values are
"live on call" or "live on return". Ideally we should be able to mark
values to be of either kind (liveoncall vs. liveonreturn) in the call
to the patchpoint intrinsic. This will make the semantics of
patchpoint slightly odd though -- we'll end up with an SSA instruction
where some inputs are expected to be live *after* the instruction has
been retired. Are there other SSA instructions that have this sort of
semantics? Nevertheless, I think splitting the inputs to a patchpoint
into "function arguments", "values live on call" and "values live on
return" is a good idea.

There was some discussion about what the behavior should be if you use
anyregcc, since in that case it seems reasonable to receive a
caller-save register if you are interested in using it inside the
patchpoint. My thoughts are that in that case you might be better off
including the value as part of the function arguments instead of the
stackmap/deopt arguments.

I don't see why anything needs to be different for patchpoints running
with anyregcc. The call arguments always get lowered based on the
calling convention, the "live on call" values get arbitrary stack
slots / registers and the "live on return" values get assigned to CSRs
or stack slots. If I've read X86RegisterInfo::getCalleeSavedRegs
correctly, I think all registers are callee-saved for anyregcc so
whatever you call using the anyregcc convention will have to respect
that; but that's a different problem.

Thanks!
-- Sanjoy

Hi Kevin,

Thank you for starting this discussion!

Yes, sorry for being unresponsive for a few days. Sanjoy summarized the issues perfectly.

I think the distinction is really between whether the live values are
"live on call" or "live on return". Ideally we should be able to mark
values to be of either kind (liveoncall vs. liveonreturn) in the call
to the patchpoint intrinsic. This will make the semantics of
patchpoint slightly odd though -- we'll end up with an SSA instruction
where some inputs are expected to be live *after* the instruction has
been retired. Are there other SSA instructions that have this sort of
semantics?

No, it’s a lowering issue. Machine instructions can declare that some registers are early clobbered so that machine operands can’t use that register. In this case it’s best just to think about it as a kind of calling convention though.

Nevertheless, I think splitting the inputs to a patchpoint
into "function arguments", "values live on call" and "values live on
return" is a good idea.

I basically agree, but think we should have a new intrinsic that is even more general. We should be able to tag any number of arguments as following some convention/lowering rules. Those rules should not be baked into the intrinsic.

What Kevin is asking for is very reasonable and arguably a safer default, but we need the current functionality as well.
As a short term fix, until a new intrinsic is ready, Kevin can add a string attribute to the patchpoint call, say “live-on-return”.

Some work maybe needed to support this. One possibility is to add a machine operand flag that is kind of the inverse of EarlyClobber, say LateUse - that’s hard. An easier option would be to generate a second pseudo instruction immediately after the patchpoint that simply uses all the live-on-return values. A cop-out would be to give the call’s register mask operand a different flag so that is is interpreted as a bunch of early-clobber registers. This last solution would not allow both live-on-return and live-on-call operands in the same call. Hopefully there’s something more clever that I haven’t thought of yet.

There was some discussion about what the behavior should be if you use
anyregcc, since in that case it seems reasonable to receive a
caller-save register if you are interested in using it inside the
patchpoint. My thoughts are that in that case you might be better off
including the value as part of the function arguments instead of the
stackmap/deopt arguments.

I don't see why anything needs to be different for patchpoints running
with anyregcc. The call arguments always get lowered based on the
calling convention, the "live on call" values get arbitrary stack
slots / registers and the "live on return" values get assigned to CSRs
or stack slots. If I've read X86RegisterInfo::getCalleeSavedRegs
correctly, I think all registers are callee-saved for anyregcc so
whatever you call using the anyregcc convention will have to respect
that; but that's a different problem.

Yep.

-Andy

An easier option would be to generate a second
pseudo instruction immediately after the patchpoint that simply uses
all the live-on-return values.

Do you mean some variant of

PATCHPOINT ... (live_values = %vreg0, %vreg1)
FAKE-USE %vreg1 ;; %vreg1 needs to be live on return

For this kind of a solution, we'll have to prevent the register
allocator (and other lowering components, I'm not sure) playing tricks
with inserting fills and remats. IOW, the above code should not
compile to

mov 16(%rsp), %r11
mov 24(%rsp), %r10
inc %r10

PATCHPOINT ... (live_values = %r11, %r10)
mov 24(%rsp), %r10
inc %r10
FAKE-USE %r10

Naively, both the register assignments to %vreg1, in PATCHPOINT and
FAKE-USE, are caller-saved and not live on return. It may be possible
to reverse-engineer the computation that will compute the value of
%r10, but that's non-trivial.

One simple solution is to force spilling of live-on-return values at
the SelectionDAG layer (we do this for statepoints currently); but
that prevents us from using callee-saved registers and may be
suboptimal.

A completely orthogonal idea: it should also be possible to do
deopt-on-return only using live-on-call values if you're willing to
use invokes instead of calls. Instead of

  call void @patchpoint(@callee_that_may_deopt, args, live_values)

emit

  invoke @callee_that_may_deopt(args)

  deopt_pad:
    ... landingpad ...
    call void @patchpoint(@deoptimize_me, ... live_values ...)

... live_values ... in deopt_pad now only needs to be live-on-call. It
can actually even be function arguments to @deoptimize_me -- the
register / stack slot shuffling due to a fixed calling convention will
no longer happen on the (presumably hot) call, but only when
deoptimizing. You'd have to deoptimize by "throwing" an exception in
this case.

-- Sanjoy

An easier option would be to generate a second
pseudo instruction immediately after the patchpoint that simply uses
all the live-on-return values.

Do you mean some variant of

PATCHPOINT ... (live_values = %vreg0, %vreg1)
FAKE-USE %vreg1 ;; %vreg1 needs to be live on return

For this kind of a solution, we'll have to prevent the register
allocator (and other lowering components, I'm not sure) playing tricks
with inserting fills and remats. IOW, the above code should not
compile to

mov 16(%rsp), %r11
mov 24(%rsp), %r10
inc %r10

PATCHPOINT ... (live_values = %r11, %r10)
mov 24(%rsp), %r10
inc %r10
FAKE-USE %r10

Naively, both the register assignments to %vreg1, in PATCHPOINT and
FAKE-USE, are caller-saved and not live on return. It may be possible
to reverse-engineer the computation that will compute the value of
%r10, but that's non-trivial.

I was thinking FAKE_USE would need special handling. It would be bundled before and after register allocation, but unbundled during register allocation so the live interval of the live values would span the call instruction. The problem is that the register allocator itself is free do something like you’ve shown above, so it’s not a very robust solution and I don’t particularly like it. I thought it would be easier than adding a “LateUse” MachineOperand flag, but now I’m not so sure. Any ideas Quentin?

One simple solution is to force spilling of live-on-return values at
the SelectionDAG layer (we do this for statepoints currently); but
that prevents us from using callee-saved registers and may be
suboptimal.

If you need to do that then the design is broken. But it is an option for Kevin to get things working.

A completely orthogonal idea: it should also be possible to do
deopt-on-return only using live-on-call values if you're willing to
use invokes instead of calls. Instead of

call void @patchpoint(@callee_that_may_deopt, args, live_values)

emit

invoke @callee_that_may_deopt(args)

deopt_pad:
   ... landingpad ...
   call void @patchpoint(@deoptimize_me, ... live_values ...)

... live_values ... in deopt_pad now only needs to be live-on-call. It
can actually even be function arguments to @deoptimize_me -- the
register / stack slot shuffling due to a fixed calling convention will
no longer happen on the (presumably hot) call, but only when
deoptimizing. You'd have to deoptimize by "throwing" an exception in
this case.

That’s an elegant approach to avoiding the backend problems. The IR representation makes perfect sense. I think your deopt mechanism would need to patch the return address as if an exception were thrown (I guess that’s what you mean by “throwing an exception”). Also, you will end up with exception tables that need to be parsed in addition to stackmaps, which seems fairly horrible. Agree?

I assume that statepoints have the same issue with both deopt values and GC roots once you stop spilling them. Do you have a preferred or tentative solution for it?

-Andy

That's an elegant approach to avoiding the backend problems. The IR
representation makes perfect sense. I think your deopt mechanism would
need to patch the return address as if an exception were thrown (I
guess that’s what you mean by “throwing an exception”).

Yes.

Also, you will
end up with exception tables that need to be parsed in addition to
stackmaps, which seems fairly horrible. Agree?

Agreed. We have to figure out how to parse exception tables anyway if
we wish to do zero-cost exception handling using something like the
C++ ABI.

The interesting thing is modeling deoptimization as an "exceptional
situation" may have positive performance implications. For example,
we can teach the optimizer to aggressively sink computation into these
"deoptimization landingpads", so values needed only in case of
deoptimization don't get computed on the normal path.

I assume that statepoints have the same issue with both deopt values
and GC roots once you stop spilling them. Do you have a preferred or
tentative solution for it?

Yes, statepoints are sound w.r.t deoptimization state only if we can
convince the backend to put deopt state in preserved locations (callee
saved registers, stack slots etc.).

However, we may prefer the "deoptimizing by exception throw" approach
if that helps performance. We haven't yet looked into this issue in
depth, but it seems like an interesting approach.

-- Sanjoy

Hi,

Sorry for the late reply, I missed this thread.

An easier option would be to generate a second
pseudo instruction immediately after the patchpoint that simply uses
all the live-on-return values.

Do you mean some variant of

PATCHPOINT … (live_values = %vreg0, %vreg1)
FAKE-USE %vreg1 ;; %vreg1 needs to be live on return

For this kind of a solution, we’ll have to prevent the register
allocator (and other lowering components, I’m not sure) playing tricks
with inserting fills and remats. IOW, the above code should not
compile to

mov 16(%rsp), %r11
mov 24(%rsp), %r10
inc %r10

PATCHPOINT … (live_values = %r11, %r10)
mov 24(%rsp), %r10
inc %r10
FAKE-USE %r10

Naively, both the register assignments to %vreg1, in PATCHPOINT and
FAKE-USE, are caller-saved and not live on return. It may be possible
to reverse-engineer the computation that will compute the value of
%r10, but that’s non-trivial.

I was thinking FAKE_USE would need special handling. It would be bundled before and after register allocation, but unbundled during register allocation so the live interval of the live values would span the call instruction. The problem is that the register allocator itself is free do something like you’ve shown above, so it’s not a very robust solution and I don’t particularly like it. I thought it would be easier than adding a “LateUse” MachineOperand flag, but now I’m not so sure. Any ideas Quentin?

We takled with Andy about that.
The late-use flag would work (though I do not know how involved that would be).

Another approach is to extend the current register class constraints mechanism to match the dynamic constraint of your ABI convention.
Right now, we can specify the register class of each operand of an instruction. The idea would be to do the same with the operand of the patch point/stackmap intrinsic.
The tricky part is that, unlike regular instruction, those register classes are dynamically determined.

This approach is more general, i.e., we would be able to restrict the allocatable registers of every machine operand, but is likely more involved.

Here is how it would work on an example:

Right now, we have something like this (where <> represent register class).

vreg1 =
stackmap vreg1

The proposed approach would change that into:

vreg1 =
vreg2 = vreg1
stackmap vreg2

Then the register coalescer/register allocator would do all the work of propagating the constraints if at all possible, splitting, etc.

Cheers,
-Quentin