[RFC] New StackMap format proposal (StackMap v2)

Hi @ll,

over the past year we gained more experience with the patchpoint/stackmap/statepoint intrinsics and it exposed limitations in the stackmap format.
The following proposal includes feedback and request from several interested parties and I would like to hear your feedback.

Missing correlation between functions and stackmap records:
Originally the client had to keep track of the ID to know which stackmap record belongs to which function, but this would stop working once functions are inlined.
The new format fixes that by adding a direct reference from the function to the stackmap records.

Call Size and Function Size:
These are additional information that are of interest and have been added to the format.
@Swaroop: Could you please provide a little more detailed explanation on the “Call Size” field and what exactly is there recorded. Is it just the call instruction
or also the materialization code for the address? For what is this used for?

Flat format:
We think moving to a flat form will make parsing easier, because every record has a fixed size and offsets can be calculated easily. The plan is to also
provide parsers for both stackmap versions (there is already one for the first format in tree) and a corresponding C-API to make it easier for clients to
adopt the new format. There is no plan to drop the original format and we will continue to support both formats. I will ask for feedback on the C API in a
separate RFC.

Another benefit we hope to achieve from this format is to optimize for size by uniquing entries - but that is independent optimization and not required.

More detailed frame record:
Clients require more information about the function frame, such as spilled registers, etc. The frame base register i.e. might change when dynamic stack
realignment is performed on X86.

If there is anything missing please let me know.

Thanks

Cheers,
Juergen

Header v2 {
uint8 : Stack Map Version (2)
uint8 : Reserved [3] (0)
uint32 : Constants Offset (bytes)
uint32 : Frame Records Offset (bytes)
uint32 : Frame Registers Offset (bytes)
uint32 : StackMap Records Offset (bytes)
uint32 : Locations Offset (bytes)
uint32 : LiveOuts Offset (bytes)
}

align to 8 bytes
Constants[] {
uint64 : LargeConstant
}

align to 8 bytes
FrameRecord[] {
uint64 : Function Address
uint32 : Function Size
uint32 : Stack Size
uint16 : Flags {
bool : HasFrame
bool : HasVariableSizeAlloca
bool : HasStackRealignment
bool : HasLiveOutInfo
bool : Reserved [12]
}
uint16 : Frame Base Register Dwarf RegNum
uint16 : Num Frame Registers
uint16 : Frame Register Index
uint16 : Num StackMap Records
uint16 : StackMap Record Index
}

align to 4 bytes
FrameRegister[] {
uint16 : Dwarf RegNum
int16 : Offset
uint8 : Size in Bytes
uint8 : Flags {
bool : IsSpilled
bool : Reserved [7]
}
}

align to 8 bytes
StackMapRecord[] {
uint64 : PatchPoint ID
uint32 : Instruction Offset
uint8 : Call size (bytes)
uint8 : Flags {
bool : HasLiveOutInfo
bool : Reserved [7]
}
uint16 : Num Locations
uint16 : Location Index
uint16 : Num LiveOuts
uint16 : LiveOut Index
}

align to 4 bytes
Location[] {
uint8 : Register | Direct | Indirect | Constant | ConstantIndex
uint8 : Reserved (location flags)
uint16 : Dwarf RegNum
int32 : Offset or SmallConstant
}

align to 2 bytes
LiveOuts[] {
uint16 : Dwarf RegNum
uint8 : Reserved
uint8 : Size in Bytes
}

Hi @ll,

over the past year we gained more experience with the patchpoint/stackmap/statepoint intrinsics and it exposed limitations in the stackmap format.
The following proposal includes feedback and request from several interested parties and I would like to hear your feedback.

Looks really nice. Do we care about field alignment though?

Missing correlation between functions and stackmap records:
Originally the client had to keep track of the ID to know which stackmap record belongs to which function, but this would stop working once functions are inlined.
The new format fixes that by adding a direct reference from the function to the stackmap records.

Call Size and Function Size:
These are additional information that are of interest and have been added to the format.
@Swaroop: Could you please provide a little more detailed explanation on the “Call Size” field and what exactly is there recorded. Is it just the call instruction
or also the materialization code for the address? For what is this used for?

I think the number of patchable bytes should be recorded (including the call itself), but I don’t know exactly how Swaroop is using it.

Andy

Hi @ll,

over the past year we gained more experience with the patchpoint/stackmap/statepoint intrinsics and it exposed limitations in the stackmap format.
The following proposal includes feedback and request from several interested parties and I would like to hear your feedback.

Looks really nice. Do we care about field alignment though?

Never mind. I just noticed that you took care of natural alignment by reserving byte arrays (Reserved [3]) and adding “align to” notes before each set of records.

Andy

Regarding Call-site size specification:

CoreCLR (https://github.com/dotnet/coreclr) requires the size of the Call-instruction to be reported in the GCInfo encoding.

The runtime performs querries for StackMap records using instruction offsets as follows:

  1. Offset at the end of the call instruction (offset of next instruction-1) if the call instruction occurs in code where GC can only take control at safe-points.

  2. Offset of the start of the call instruction if the call instruction occurs within a code-range that allows full interruption (that is, all instructions in a range are considered safe-points)

LLVM/statepoint GC does not support option (2), but the CoreCLR’s GC-table Encoding has an interface designed to suite both modes.

void DefineCallSites(UINT32* pCallSites, BYTE* pCallSiteSizes, UINT32 numCallSites)

Therefore, it is helpful to have Call-Site size specified in StackMapRecord.

I agree with Andy, that the call-site size should include all bytes between the start of the call instruction and the start of the next instruction.

Thanks,

Swaroop.

Regarding Call-site size specification:

CoreCLR (https://github.com/dotnet/coreclr) requires the size of the Call-instruction to be reported in the GCInfo encoding.

The runtime performs querries for StackMap records using instruction offsets as follows:

  1. Offset at the end of the call instruction (offset of next instruction-1) if the call instruction occurs in code where GC can only take control at safe-points.

From: "Juergen Ributzka" <juergen@apple.com>
To: "LLVM Dev" <llvmdev@cs.uiuc.edu>
Cc: "Lang Hames" <lhames@apple.com>
Sent: Thursday, July 9, 2015 4:04:20 PM
Subject: [LLVMdev] [RFC] New StackMap format proposal (StackMap v2)

Hi @ll,

over the past year we gained more experience with the
patchpoint/stackmap/statepoint intrinsics and it exposed limitations
in the stackmap format.
The following proposal includes feedback and request from several
interested parties and I would like to hear your feedback.

Missing correlation between functions and stackmap records:
Originally the client had to keep track of the ID to know which
stackmap record belongs to which function, but this would stop
working once functions are inlined.
The new format fixes that by adding a direct reference from the
function to the stackmap records.

Call Size and Function Size:
These are additional information that are of interest and have been
added to the format.
@Swaroop: Could you please provide a little more detailed explanation
on the "Call Size" field and what exactly is there recorded. Is it
just the call instruction
or also the materialization code for the address? For what is this
used for?

Flat format:
We think moving to a flat form will make parsing easier, because
every record has a fixed size and offsets can be calculated easily.
The plan is to also
provide parsers for both stackmap versions (there is already one for
the first format in tree) and a corresponding C-API to make it
easier for clients to
adopt the new format. There is no plan to drop the original format
and we will continue to support both formats. I will ask for
feedback on the C API in a
separate RFC.

Another benefit we hope to achieve from this format is to optimize
for size by uniquing entries - but that is independent optimization
and not required.

More detailed frame record:
Clients require more information about the function frame, such as
spilled registers, etc. The frame base register i.e. might change
when dynamic stack
realignment is performed on X86.

If there is anything missing please let me know.

Thanks

Cheers,
Juergen

Header v2 {
uint8 : Stack Map Version (2)
uint8 : Reserved [3] (0)
uint32 : Constants Offset (bytes)
uint32 : Frame Records Offset (bytes)
uint32 : Frame Registers Offset (bytes)
uint32 : StackMap Records Offset (bytes)
uint32 : Locations Offset (bytes)
uint32 : LiveOuts Offset (bytes)
}

align to 8 bytes
Constants[] {
uint64 : LargeConstant
}

align to 8 bytes
FrameRecord[] {
uint64 : Function Address
uint32 : Function Size
uint32 : Stack Size
uint16 : Flags {
bool : HasFrame
bool : HasVariableSizeAlloca
bool : HasStackRealignment
bool : HasLiveOutInfo
bool : Reserved [12]
}
uint16 : Frame Base Register Dwarf RegNum
uint16 : Num Frame Registers
uint16 : Frame Register Index
uint16 : Num StackMap Records
uint16 : StackMap Record Index
}

align to 4 bytes
FrameRegister[] {
uint16 : Dwarf RegNum
int16 : Offset
uint8 : Size in Bytes
uint8 : Flags {
bool : IsSpilled
bool : Reserved [7]
}
}

align to 8 bytes
StackMapRecord[] {
uint64 : PatchPoint ID
uint32 : Instruction Offset
uint8 : Call size (bytes)
uint8 : Flags {
bool : HasLiveOutInfo
bool : Reserved [7]
}
uint16 : Num Locations
uint16 : Location Index
uint16 : Num LiveOuts
uint16 : LiveOut Index
}

Do you guarantee that these will appear in order of increasing instruction offset?

-Hal

My comments are inline.

Swaroop.

I think for this to be reliable, we'd have to make the stackmap locations
be valid to the end of the patchpoint, instead of only to the beginning of
the patchpoint. If we don't, the caller still has to be ready to move the
call instruction to make room for any spills+restores of non-preserved
registers used by the stackmap.

Hi Hal,

no, as far as I can recall we don’t make that guarantee for StackMap Records. Although, since we record the StackMap Records in function order and in instruction order inside a function this has always been true, but that wasn’t intentional. This format shouldn’t change this, but it also isn’t something that we programmatically enforce.

Do you depend on that behavior?

-Juergen

From: "Juergen Ributzka" <juergen@apple.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Lang Hames" <lhames@apple.com>, "LLVM Dev" <llvmdev@cs.uiuc.edu>
Sent: Friday, July 10, 2015 11:34:48 AM
Subject: Re: [LLVMdev] [RFC] New StackMap format proposal (StackMap v2)

Hi Hal,

no, as far as I can recall we don’t make that guarantee for StackMap
Records. Although, since we record the StackMap Records in function
order and in instruction order inside a function this has always
been true, but that wasn’t intentional. This format shouldn’t change
this, but it also isn’t something that we programmatically enforce.

Do you depend on that behavior?

No, but I've noticed that it is true in practice, and so I think that we should say something about it one way or another. Especially since, in switching to a fixed-size record format, binary searching now becomes relatively easy/fast. Maybe it would be a useful guarantee?

Thanks again,
Hal

Yes, we talked about specifying the register save convention as part of the patchpoint. It’s a separate but related feature. The point is that LLVM should probably emit the call at the end for those cases when the JIT doesn’t have to generate its own call sequence.

Andy

Sounds good. I will add that to the StackMap documentation when I update it for v2.

—Juergen

Hi,

I submitted several weeks ago a patch D9176 to extend stackmaps to support symbolic constants.
I think this is a good time to clean up this patch by proposing to add another stackmap location type for symbolic constants to the new v2 StackMap format.

The idea is that this new type behaves like the ConstantIndex type with the only difference that the constant value specified at the index will be zero when stored on disk and will get filled in by the runtime linker with the actual value of the symbol (returned value from RTDyldMemoryManager::getSymbolAddress).

Please let my know what you think.

My use case is (this is a copy of reply I just send to the review request):

We (Pyston project) use patchpoints to implement inline caches and for deoptimization when using the LLVM tier.
For the deoptimization use case we add all variables to the patchpoint live args which we need too continue the execution in a lower generic tier (e.g. interpreter). A lot of our generated IR values were direct inttoptr casts because we often generate instances of our objects outside of LLVM. For example we may generate instances of a python objects when we setup the internal representation of a python function which we then share between the interpreter and LLVM tier. That’s why we had a lot of inttoptr casts in our generated IR, there are also additional args like pointers to the AST nodes which we will need for deopt.

Deopimizations should happen only very rarely that means that we don’t want to actually load all the constants we specified as live values inside the patchpoint into registers/stack slots. Currently LLVM will put all arguments which are constants inside the stackmap constant table in order to not have to generate code in front of the patchpoint to put all this constant values into register/stack slots. This is exactly how I would expect the behavior to be and how I need it.

But then I added a new feature: in order to speedup JITing time if we encounter the same function on the next application start I implemented an object cache for the LLVM generated code. This means I need to be able to relocate all this embedded pointers because the memory layout will not be the same. I choose to solve this by emitting special unique symbol name for all cases where I previously embedded the direct pointer value. This symbol names are deterministic, on the next startup when encountering the same function I can directly load it from the object cache and just have to return the real pointer values inside the RTDyldMemoryManager::getSymbolAddress() overloaded function.

The problem I encountered and this patch tries to solve is that LLVM will currently emit code which will load all this symbolic constants into registers before the patchpoint. With this patch we will stop emitting this machine instructions and instead emit constant table entries inside the stackmap.

Hope this helps understanding what I have done (even if my english isn’t good), I successfully use this solution now since several weeks and it gave us a huge speedup.

Yep, great timing. Thanks for the context. The explanation made this make a lot more sense to me than the original patch had. I’m generally in support of this feature being added subject to normal code review. If nothing else, we should think about the format requirements even if the implementation isn’t quite in yet. It’s definitely something which would be good to support at some point.

+1 sounds entirely reasonable.

This sounds reasonable. I can’t see an immediate reason why this would restrict future implementation flexibility. Just to be clear, the problem with (2) has nothing to do with StackMaps. It has to do with the fact our entire backend confuses pointers and non-pointers. To have option (2), I believe we’d need to effectively have a stackmap for every instruction. That’s a lot more work than anyone has serious proposed to date. :slight_smile: I would argue in favor of recording the number of bytes recorded for patching. I think that gives us everything we need in practice. Any non-destructive or concurrent patch mechanism is going to need extremely fine grained control anyways. Trying to do that partly within LLVM and partly without just seems like a mistake.

+1 +1 subject to details being settled I’m still not convinced this is actually a good idea, but I have no strong objection either. I strongly feel that trying to support multiple versions of the file format in tree is a bad idea. It’s a distraction, support burden, and provides extremely little value. What’s the value in supporting a parser for a format that ToT will no longer be able to emit? Anyone who is actually consuming the old format would have to be using the old version in process anyways. (Possible exception: Is this intended to simplify the object caching model? If so, have we ever given any guarantees about object caching in the JIT across versions? That seems risky at best…) I’m also opposed to the complexity of C API, but I’ll defer that until the separate RFC is actually posted. Is the format Actually, so long as we document that there might be unknown bytes between Header.end and Constants.begin we should be fine. It just needs to be clearly documented. As an example, if we need to add the SymbolConstants section mentioned down thread, that shouldn’t require an version change. Actually, wait… that’s entirely handled within the record format. What does tracking the Frame Base Register give us? I think I’m missing a use case here.

From: "Philip Reames" <listmail@philipreames.com>
To: "Juergen Ributzka" <juergen@apple.com>, "LLVM Dev" <llvmdev@cs.uiuc.edu>
Cc: "Lang Hames" <lhames@apple.com>
Sent: Wednesday, July 15, 2015 1:02:21 AM
Subject: Re: [LLVMdev] [RFC] New StackMap format proposal (StackMap v2)

Comments inline. I tried to reply to particular points of discussion
downthread, so this is probably best read last.

Hi @ll,

over the past year we gained more experience with the
patchpoint/stackmap/statepoint intrinsics and it exposed limitations
in the stackmap format.
The following proposal includes feedback and request from several
interested parties and I would like to hear your feedback.

Missing correlation between functions and stackmap records:
Originally the client had to keep track of the ID to know which
stackmap record belongs to which function, but this would stop
working once functions are inlined.
The new format fixes that by adding a direct reference from the
function to the stackmap records. +1

Call Size and Function Size:
These are additional information that are of interest and have been
added to the format.
@Swaroop: Could you please provide a little more detailed explanation
on the "Call Size" field and what exactly is there recorded. Is it
just the call instruction
or also the materialization code for the address? For what is this
used for? +1 subject to details being settled

Flat format:
We think moving to a flat form will make parsing easier, because
every record has a fixed size and offsets can be calculated easily.
I'm still not convinced this is actually a good idea, but I have no
strong objection either.

The plan is to also
provide parsers for both stackmap versions (there is already one for
the first format in tree) and a corresponding C-API to make it
easier for clients to
adopt the new format. There is no plan to drop the original format
and we will continue to support both formats. I will ask for
feedback on the C API in a
separate RFC.
I strongly feel that trying to support multiple versions of the file
format in tree is a bad idea. It's a distraction, support burden,
and provides extremely little value. What's the value in supporting
a parser for a format that ToT will no longer be able to emit?
Anyone who is actually consuming the old format would have to be
using the old version in process anyways.

I agree. There seems to be no reason to support the old format. We did label these 'experimental' after all.

-Hal

Please see inline.

I don’t think supporting multiple version will be that hard and I don’t plan to keep all current and future version in tree forever. But we should provide a reasonable long transition period that allows clients to transition from one version to another - basically one release cycle.

(Possible exception: Is this intended to simplify the object caching model? If so, have we ever given any guarantees about object caching in the JIT across versions? That seems risky at best…)

I’m also opposed to the complexity of C API, but I’ll defer that until the separate RFC is actually posted.

Another benefit we hope to achieve from this format is to optimize for size by uniquing entries - but that is independent optimization and not required.

More detailed frame record:
Clients require more information about the function frame, such as spilled registers, etc. The frame base register i.e. might change when dynamic stack
realignment is performed on X86.

If there is anything missing please let me know.

Thanks

Cheers,
Juergen

Is the format Little endian, bit endian? native endian?

I would say whatever the object file format is.

Header v2 {
uint8 : Stack Map Version (2)
uint8 : Reserved [3] (0)
uint32 : Constants Offset (bytes)
uint32 : Frame Records Offset (bytes)
uint32 : Frame Registers Offset (bytes)
uint32 : StackMap Records Offset (bytes)
uint32 : Locations Offset (bytes)
uint32 : LiveOuts Offset (bytes)

Let’s reserve at least two uint32s for future field groups so that we can extend in a backwards compatible way. Another option would be to steal one of the reserved bytes and require parsers to ignore sections they don’t understand.

Actually, so long as we document that there might be unknown bytes between Header.end and Constants.begin we should be fine. It just needs to be clearly documented.

That was the idea. That way we could add new sections and old parsers would still work. They would just ignore those sections. Although with the addition of official parser in trees I am not sure how important this is anymore.

As an example, if we need to add the SymbolConstants section mentioned down thread, that shouldn’t require an version change.

Once we release a LLVM version any change should require a version change.

}

align to 8 bytes
Constants[] {
uint64 : LargeConstant
}

align to 8 bytes
FrameRecord[] {
uint64 : Function Address
uint32 : Function Size
uint32 : Stack Size
uint16 : Flags {
bool : HasFrame

What does this mean?

If the function has a frame pointer.

bool : HasVariableSizeAlloca
bool : HasStackRealignment
bool : HasLiveOutInfo
bool : Reserved [12]

}
uint16 : Frame Base Register Dwarf RegNum

I don’t think this is actually sufficient for dynamic stack realignment. Don’t you end up with some things indexed off RSP and others indexed off RBP? Actually, wait… that’s entirely handled within the record format. What does tracking the Frame Base Register give us? I think I’m missing a use case here.

uint16 : Num Frame Registers
uint16 : Frame Register Index

Are these (begin, begin+length) offsets into the Frame Registers table? If so, should we swap them?

Sure

uint16 : Num StackMap Records
uint16 : StackMap Record Index
}

align to 4 bytes
FrameRegister[] {
uint16 : Dwarf RegNum
int16 : Offset
uint8 : Size in Bytes
uint8 : Flags {
bool : IsSpilled
bool : Reserved [7]
}
}

This seems tied specifically to the assumption that registers are spilled on entry. Given the recent work towards shrink wrapping, is that something we want to do?

p.s. What’s with the IsSpilled bit? Why would you have a record for an unspilled register?

This was done before shrink wrapping and pristine register support. The original problem was that the live-out set didn’t contain pristine registers and we have to add all callee-saved register to the live-out set too to make sure we didn’t miss anything. We could do better by recording which callee-save registers actually got spilled and which not.

align to 8 bytes
StackMapRecord[] {
uint64 : PatchPoint ID

“ID” (remove the patchpoint since statepoint now has one too)

sure

uint32 : Instruction Offset
uint8 : Call size (bytes)

“Patchable size” per down thread discussion

uint8 : Flags {
bool : HasLiveOutInfo
bool : Reserved [7]
}

Why do we need these flags? Isn’t this Num LiveOuts == 0?

A verification that the information was actually computed - we might have 0 live-outs too. We could also change this in the future to attach this information only to call-sites that have an attribute that request this information.