Representing a safepoint as an instruction in the x86 backend?

I've got a pseudo instruction with some tricky semantics I need help figuring out how to encode properly. For those interested, this is to support fully relocating garbage collection. I'm going to try to express the requirements clearly so that you don't need to understand the use case in detail.

My end goal is to capture a list of registers and/or stack offsets for a list of virtual registers (known explicitly inside SelectionDAGBuilder) at the PC following a call instruction. In particular, I need to be able to update all physical copies of these virtual registers. I've decided to approach this by introducing a new psuedo instruction with these semantics, but if anyone has an alternate approach they'd recommend, I'm open to that too.

Here's the semantics of my 'instruction':
- It must immediately follow a call instruction, before any return value copies, or frame manipulation. Not every call has a following safepoint.
- It has a variable number of arguments (virtual registers). All can be both read and written.
- It can handle any combination of stack locations and registers. Ideally, it should not effect register allocation.

The approach I've taken to date is based in part on the work done for PATCHPOINT. Here's what I've done:
- Introduced a SAFEPOINT psuedo instruction
- Reverse engineered the CALLSEQ_* series of nodes to insert my node after the CALL node in both glue and chain sequences. (BTW, is there any documentation on the call sequence? I think I've reverse engineered it correctly, but I'm not completely sure.)
- Introduced folding logic in foldMemoryOperand (analogous to PATCHPOINT, but which marks both load and store) -- this is where my problem currently lies
- Inserted code during MCInstLower to record the statepoint

The problem with this is that a reload from a stack slot will sometimes be inserted between the CALL and the SAFEPOINT. This is problematic since we are no longer recording the list of locations at the site of the call itself. If the recorded information is used during the lifetime of the subroutine call, the wrong locations would be updated. That would be "bad".

The reason for this is that the folding logic only applies if there's a single use of the physical register. If there's more than one use, it's assumed to be cheaper to reload than to perform two folded operations against memory. (I don't know if this is true always, but more importantly for me, it breaks my intended semantics.)

Does anyone know of a way to avoid the fold step to begin with? I'd really like the register allocation to not give preference to register uses for this instruction. If a virtual register is already in the stack, it shouldn't attempt to reload before this instruction. I haven't been able to find the appropriate hook for this.

I can go ahead and hack the folding code to unconditionally fold into SAFEPOINTs and move the load after the SAFEPOINT, but that feels like an utter hack. Before going down the road, does anyone have a better suggestion?

I'm very open to suggestions here. If I'm taking the wrong approach or something sounds like it doesn't work the way I've described, please point it out. I will freely admit this is my first serious endeavour into the x86 backend and that I'm learning as I go.

Philip

Note: For the moment, this is all x86 specific. Most of it could be made architecture independent without too much effort.

Hi Philip,

wouldn’t it be easier to adapt the patchpoint intrinsic to allow it to take also a function address/global address as target argument. We are already tapping into the target-specific call lowering, so the call would be lowered exactly the same way. By doing this you don’t have to worry about anything getting inserted between your CALL and SAFEPOINT instruction. I know the GLUE flag is supposed to keep things together in the call sequence, but I think it is very fragile to rely on this.

The reason for this is that the folding logic only applies if there's a single use of the physical register. If there's more than one use, it's assumed to be cheaper to reload than to perform two folded operations against memory. (I don't know if this is true always, but more importantly for me, it breaks my intended semantics.)

Yes, that is true and the code would break if you remove that constraint, because it would delete the instruction without checking if there are still any uses remaining - although this should be easy to fix.

Does anyone know of a way to avoid the fold step to begin with? I'd really like the register allocation to not give preference to register uses for this instruction. If a virtual register is already in the stack, it shouldn't attempt to reload before this instruction. I haven't been able to find the appropriate hook for this.

I thought Andy did something about this for patchpoints and stackmaps. Didn't we handled that in the target-specific frame index elimination?

-Juergen

GLUE is only for keeping things together during ISEL because SelectionDAG does not represent dependencies on named resources like physical registers.

Juergen is right about using patchpoint. Part of the reason that we have two intrinsics, stackmap and patchpoint[1], is that patchpoint allows you to embed a call instruction. A pseudo instruction is the only way to “glue” operations through the entire code generator (let’s ignore MachineBundles for now).

We always meant to have patchpoint take a function address. I think there are just a few FIXMEs to take care of to make it work.

-Andy

[1] I would really prefer to have only one patchpoint intrinsic that can be used for stack maps too, but the semantics for reserving encoding space don’t quite match up.

I've got a pseudo instruction with some tricky semantics I need help figuring out how to encode properly. For those interested, this is to support fully relocating garbage collection. I'm going to try to express the requirements clearly so that you don't need to understand the use case in detail.

My end goal is to capture a list of registers and/or stack offsets for a list of virtual registers (known explicitly inside SelectionDAGBuilder) at the PC following a call instruction. In particular, I need to be able to update all physical copies of these virtual registers. I've decided to approach this by introducing a new psuedo instruction with these semantics, but if anyone has an alternate approach they'd recommend, I'm open to that too.

Here's the semantics of my 'instruction':
- It must immediately follow a call instruction, before any return value copies, or frame manipulation. Not every call has a following safepoint.
- It has a variable number of arguments (virtual registers). All can be both read and written.
- It can handle any combination of stack locations and registers. Ideally, it should not effect register allocation.

The approach I've taken to date is based in part on the work done for PATCHPOINT. Here's what I've done:
- Introduced a SAFEPOINT psuedo instruction
- Reverse engineered the CALLSEQ_* series of nodes to insert my node after the CALL node in both glue and chain sequences. (BTW, is there any documentation on the call sequence? I think I've reverse engineered it correctly, but I'm not completely sure.)
- Introduced folding logic in foldMemoryOperand (analogous to PATCHPOINT, but which marks both load and store) -- this is where my problem currently lies
- Inserted code during MCInstLower to record the statepoint

The problem with this is that a reload from a stack slot will sometimes be inserted between the CALL and the SAFEPOINT. This is problematic since we are no longer recording the list of locations at the site of the call itself. If the recorded information is used during the lifetime of the subroutine call, the wrong locations would be updated. That would be "bad".

The reason for this is that the folding logic only applies if there's a single use of the physical register. If there's more than one use, it's assumed to be cheaper to reload than to perform two folded operations against memory. (I don't know if this is true always, but more importantly for me, it breaks my intended semantics.)

Does anyone know of a way to avoid the fold step to begin with? I'd really like the register allocation to not give preference to register uses for this instruction. If a virtual register is already in the stack, it shouldn't attempt to reload before this instruction. I haven't been able to find the appropriate hook for this.

I can go ahead and hack the folding code to unconditionally fold into SAFEPOINTs and move the load after the SAFEPOINT, but that feels like an utter hack. Before going down the road, does anyone have a better suggestion?

Anything we do to avoid the reload is purely an optimization. There is no way to guarantee it. I think you really need your pseudo instruction to include the call, which is exactly what patchpoint was designed for.

That said, it would be nice to be more aggressive in folding reloads into stackmaps/patchpoints. It sounds like you are already making use of the foldPatchPoint logic that I added to avoid the reloads. We just need some higher level logic to handle more cases. I’m not sure how hard it will be. See InlineSpiller::foldMemoryOperand.

-Andy

Hi Philip,

wouldn’t it be easier to adapt the patchpoint intrinsic to allow it to take also a function address/global address as target argument. We are already tapping into the target-specific call lowering, so the call would be lowered exactly the same way. By doing this you don’t have to worry about anything getting inserted between your CALL and SAFEPOINT instruction. I know the GLUE flag is supposed to keep things together in the call sequence, but I think it is very fragile to rely on this.

Let me rephrase your suggestion slightly to make sure I understood it. You're suggesting that I replace the CALL node with a PATCHPOINT node which includes both the actual call arguments and the stack map arguments. This could be done by extending the existing IR level patchpoint intrinsic, but could also be done at the SelectionDAGBuilder level. This pseudo instruction would be lowered to the actual call in MCInstLower.

Thinking about it, I think this would work. Originally, I'd thought there was an issue with distinguishing true arguments vs stack map arguments, but it looks like the PATCHPOINT code in recordPatchPoint already knows how to do that. Was that added later? I don't remember seeing this.

The reason for this is that the folding logic only applies if there's a single use of the physical register. If there's more than one use, it's assumed to be cheaper to reload than to perform two folded operations against memory. (I don't know if this is true always, but more importantly for me, it breaks my intended semantics.)

Yes, that is true and the code would break if you remove that constraint, because it would delete the instruction without checking if there are still any uses remaining - although this should be easy to fix.

Agreed. On both parts.

Does anyone know of a way to avoid the fold step to begin with? I'd really like the register allocation to not give preference to register uses for this instruction. If a virtual register is already in the stack, it shouldn't attempt to reload before this instruction. I haven't been able to find the appropriate hook for this.

I thought Andy did something about this for patchpoints and stackmaps. Didn't we handled that in the target-specific frame index elimination?

It's incomplete. Or at least it was when used with the scheme I originally described. I'll have to see how it fits after moving to the CALL replacement scheme.

Philip

Does anyone know of a way to avoid the fold step to begin with? I'd really like the register allocation to not give preference to register uses for this instruction. If a virtual register is already in the stack, it shouldn't attempt to reload before this instruction. I haven't been able to find the appropriate hook for this.

I can go ahead and hack the folding code to unconditionally fold into SAFEPOINTs and move the load after the SAFEPOINT, but that feels like an utter hack. Before going down the road, does anyone have a better suggestion?

Anything we do to avoid the reload is purely an optimization. There is no way to guarantee it. I think you really need your pseudo instruction to include the call, which is exactly what patchpoint was designed for.

I'm probably going to move in that direction, but even with the patchpoint implementation this isn't strictly true.

If my stack map has more arguments than physical registers, it's mandatory that the patchpoint not require a load. Since writing my original email, I think I've figured out how to address this in the register allocator. I'll try to clean this up and send a patch.

Consider the following pseudo code:
(ptr1, ..., ptr200) = initialize_pointers();
call some_function() // safepoint needed here
use(ptr1, ..., ptr200) //more than one use to prevent folding

I'm not entirely sure the above applies to the patchpoint implementation. I know it definitely applied to my previous approach. I'll try to create a test case using only patchpoint and share it.

That said, it would be nice to be more aggressive in folding reloads into stackmaps/patchpoints. It sounds like you are already making use of the foldPatchPoint logic that I added to avoid the reloads. We just need some higher level logic to handle more cases. I’m not sure how hard it will be. See InlineSpiller::foldMemoryOperand.

If you have suggestions for specific missing cases, let me know.

p.s. I'll need to extend PATCHPOINT semantics in at least minor ways. Every argument is both readable and *writeable*. This is different than the existing intrinsics.

Philip

Does anyone know of a way to avoid the fold step to begin with? I'd really like the register allocation to not give preference to register uses for this instruction. If a virtual register is already in the stack, it shouldn't attempt to reload before this instruction. I haven't been able to find the appropriate hook for this.

I can go ahead and hack the folding code to unconditionally fold into SAFEPOINTs and move the load after the SAFEPOINT, but that feels like an utter hack. Before going down the road, does anyone have a better suggestion?

Anything we do to avoid the reload is purely an optimization. There is no way to guarantee it. I think you really need your pseudo instruction to include the call, which is exactly what patchpoint was designed for.

I'm probably going to move in that direction, but even with the patchpoint implementation this isn't strictly true.

If my stack map has more arguments than physical registers, it's mandatory that the patchpoint not require a load. Since writing my original email, I think I've figured out how to address this in the register allocator. I'll try to clean this up and send a patch.

Consider the following pseudo code:
(ptr1, ..., ptr200) = initialize_pointers();
call some_function() // safepoint needed here
use(ptr1, ..., ptr200) //more than one use to prevent folding

I'm not entirely sure the above applies to the patchpoint implementation. I know it definitely applied to my previous approach. I'll try to create a test case using only patchpoint and share it.

OK. That’s what foldPatchpoint was supposed to handle. We can handle stackmap/patchpoint with more than NumPhysRegs live values. Note that the call arguments are handled differently and must follow the calling convention that you choose—these are not foldable but may be passed on the stack.

That said, it would be nice to be more aggressive in folding reloads into stackmaps/patchpoints. It sounds like you are already making use of the foldPatchPoint logic that I added to avoid the reloads. We just need some higher level logic to handle more cases. I’m not sure how hard it will be. See InlineSpiller::foldMemoryOperand.

If you have suggestions for specific missing cases, let me know.

p.s. I'll need to extend PATCHPOINT semantics in at least minor ways. Every argument is both readable and *writeable*. This is different than the existing intrinsics.

I think you mean whichever register we happen to allocate for the live value (root ptr in your case) may be updated. This is the major difference between your safepoints and current patchpoints, and the main feature that enables fully moving GC.

You might need to have an implicit def for each live value use and mark it as “tied” to the use. I’m not sure we can handle tied implicit defs as opposed to tied normal defs. I don’t think it’s ever been done. It could require tinkering with some nasty areas of TwoAddress, RegisterCoalescer, and RegAlloc.

-Andy