Proposal: function prefix data

Hi,

I would like to propose that we introduce a mechanism in IR to allow
arbitrary data to be stashed before a function body. The purpose of
this would be to allow additional data about a function to be looked
up via a function pointer. Two use cases come to mind:

1) We'd like to be able to use UBSan to check that the type of the
   function pointer of an indirect function call matches the type of
   the function being called. This can't really be done efficiently
   without storing type information near the function.

2) Allowing GHC's tables-next-to-code ABI [1] to be implemented.
   In general, I imagine this feature could be useful for the
   implementation of languages which require runtime metadata for
   each function.

The proposal is that an IR function definition acquires a constant
operand which contains the data to be emitted immediately before
the function body (known as the prefix data). To access the data
for a given function, a program may bitcast the function pointer to
a pointer to the constant's type. This implies that the IR symbol
points to the start of the prefix data.

To maintain the semantics of ordinary function calls, the prefix data
must have a particular format. Specifically, it must begin with a
sequence of bytes which decode to a sequence of machine instructions,
valid for the module's target, which transfer control to the point
immediately succeeding the prefix data, without performing any other
visible action. This allows the inliner and other passes to reason
about the semantics of the function definition without needing to
reason about the prefix data. Obviously this makes the format of the
prefix data highly target dependent.

This requirement could be relaxed when combined with my earlier symbol
offset proposal [2] as applied to functions. However, this is outside
the scope of the current proposal.

Example:

%0 = type <{ i32, i8* }>

define void @f() prefix %0 <{ i32 1413876459, i8* bitcast ({ i8*, i8* }* @_ZTIFvvE to i8*) }> {
  ret void
}

This is an example of something that UBSan might generate on an
x86_64 machine. It consists of a signature of 4 bytes followed by a
pointer to the RTTI data for the type 'void ()'. The signature when
laid out as a little endian 32-bit integer decodes to the instruction
'jmp .+0x0c' (which jumps to the instruction immediately succeeding
the 12-byte prefix) followed by the bytes 'F' and 'T' which identify
the prefix as a UBSan function type prefix.

A caller might check that a given function pointer has a valid signature
like this:

  %4 = bitcast void ()* @f to %0*
  %5 = getelementptr %0* %4, i32 0, i32 0
  %6 = load i32* %5
  %7 = icmp eq i32 %6, 1413876459

In the specific case above, where the function pointer is a constant,
optimisation passes such as globalopt could potentially be adapted
to recognise prefix data and hence replace %6 etc with a constant.
(This is one reason why I decided to represent prefix data in IR
rather than, say, using inline asm as proposed in the GHC thread [1].)

Thoughts?

Thanks,

Hi,

I would like to propose that we introduce a mechanism in IR to allow
arbitrary data to be stashed before a function body. The purpose of
this would be to allow additional data about a function to be looked
up via a function pointer. Two use cases come to mind:

1) We'd like to be able to use UBSan to check that the type of the
   function pointer of an indirect function call matches the type of
   the function being called. This can't really be done efficiently
   without storing type information near the function.

How efficient does it have to be? Have some alternatives already proven to
be "too slow"? (e.g. a binary search into a sorted table)

2) Allowing GHC's tables-next-to-code ABI [1] to be implemented.
   In general, I imagine this feature could be useful for the
   implementation of languages which require runtime metadata for
   each function.

The proposal is that an IR function definition acquires a constant
operand which contains the data to be emitted immediately before
the function body (known as the prefix data). To access the data
for a given function, a program may bitcast the function pointer to
a pointer to the constant's type. This implies that the IR symbol
points to the start of the prefix data.

To maintain the semantics of ordinary function calls, the prefix data
must have a particular format. Specifically, it must begin with a
sequence of bytes which decode to a sequence of machine instructions,
valid for the module's target, which transfer control to the point
immediately succeeding the prefix data, without performing any other
visible action. This allows the inliner and other passes to reason
about the semantics of the function definition without needing to
reason about the prefix data. Obviously this makes the format of the
prefix data highly target dependent.

I'm not sure that something this target dependent is the right choice. Your
example below suggests that the frontend would then need to know magic to
put "raw" in the instruction stream. Have you considered having the feature
expose just the intent "store this data attached to the function, to be
accessed very quickly", and then have an intrinsic
("llvm.getfuncdata.i{8,16,32,64}"?) which extracts the data in a
target-dependent way? Forcing clients to embed deep
target-specific-machine-code knowledge in their frontends seems like a step
in the wrong direction for LLVM.

This requirement could be relaxed when combined with my earlier symbol
offset proposal [2] as applied to functions. However, this is outside
the scope of the current proposal.

Example:

%0 = type <{ i32, i8* }>

define void @f() prefix %0 <{ i32 1413876459, i8* bitcast ({ i8*, i8* }*
@_ZTIFvvE to i8*) }> {
  ret void
}

This is an example of something that UBSan might generate on an
x86_64 machine. It consists of a signature of 4 bytes followed by a
pointer to the RTTI data for the type 'void ()'. The signature when
laid out as a little endian 32-bit integer decodes to the instruction
'jmp .+0x0c' (which jumps to the instruction immediately succeeding
the 12-byte prefix) followed by the bytes 'F' and 'T' which identify
the prefix as a UBSan function type prefix.

Do you know whether OoO CPU's will still attempt to decode the "garbage" in
the instruction stream, even if there is a jump over it? (IIRC they will
decode ahead of the PC and hiccup (but not fault) on garbage). Maybe it
would be better to steganographically encode the value inside the
instruction stream? On x86 you could use 48b8<imm64> which only has 2 bytes
overhead for an i64 (putting a move like that, which moves into a
caller-saved register on entry, would effectively be a noop). This is some
pretty gnarly target-dependent stuff which seems like it would best be
hidden in the backend (e.g. architectures that have "constant island"-like
passes might want to stash the data in there instead).

-- Sean Silva

To maintain the semantics of ordinary function calls, the prefix data
must have a particular format. Specifically, it must begin with a
sequence of bytes which decode to a sequence of machine instructions,
valid for the module's target, which transfer control to the point
immediately succeeding the prefix data, without performing any other
visible action. This allows the inliner and other passes to reason
about the semantics of the function definition without needing to
reason about the prefix data. Obviously this makes the format of the
prefix data highly target dependent.

What if the prefix data was stored before the start of the function
code? The function's symbol will point to the code just as before,
eliminating the need to have instructions that skip the prefix data.

It would look something like:

Prefix Data ... (variable length) | Prefix Data Length (fixed length

[32 bits?]) | Function code .... |

            ^ function symbol points here (function code)

I hope the simple ASCII art makes it through my mail client.

To access the data, you do

prefix_data = function_ptr - sizeof(prefix_length) - prefix_length

Cheers,
Jevin

> Hi,
>
> I would like to propose that we introduce a mechanism in IR to allow
> arbitrary data to be stashed before a function body. The purpose of
> this would be to allow additional data about a function to be looked
> up via a function pointer. Two use cases come to mind:
>
> 1) We'd like to be able to use UBSan to check that the type of the
> function pointer of an indirect function call matches the type of
> the function being called. This can't really be done efficiently
> without storing type information near the function.
>

How efficient does it have to be? Have some alternatives already proven to
be "too slow"? (e.g. a binary search into a sorted table)

This has admittedly not been measured. It depends on the rate at
which the program performs indirect function calls. But given the
other use cases for this feature we might as well use it in UBSan as
opposed to something which is going to be strictly slower.

> 2) Allowing GHC's tables-next-to-code ABI [1] to be implemented.
> In general, I imagine this feature could be useful for the
> implementation of languages which require runtime metadata for
> each function.
>
> The proposal is that an IR function definition acquires a constant
> operand which contains the data to be emitted immediately before
> the function body (known as the prefix data). To access the data
> for a given function, a program may bitcast the function pointer to
> a pointer to the constant's type. This implies that the IR symbol
> points to the start of the prefix data.
>
> To maintain the semantics of ordinary function calls, the prefix data
> must have a particular format. Specifically, it must begin with a
> sequence of bytes which decode to a sequence of machine instructions,
> valid for the module's target, which transfer control to the point
> immediately succeeding the prefix data, without performing any other
> visible action. This allows the inliner and other passes to reason
> about the semantics of the function definition without needing to
> reason about the prefix data. Obviously this makes the format of the
> prefix data highly target dependent.
>

I'm not sure that something this target dependent is the right choice. Your
example below suggests that the frontend would then need to know magic to
put "raw" in the instruction stream. Have you considered having the feature
expose just the intent "store this data attached to the function, to be
accessed very quickly", and then have an intrinsic
("llvm.getfuncdata.i{8,16,32,64}"?) which extracts the data in a
target-dependent way?

The problem is that things like UBSan need to be able to understand
the instruction stream anyway (to a certain extent). In UBSan's case,
determining at runtime whether a function has prefix data depends on
a specific signature of instructions at the start of the program.
There are a wide variety of signatures that can be used here and
I believe we shouldn't try to constrain the frontend author with a
signature (at least partly) of our own design.

I think that if someone wants a target-independent way of
embedding prefix data it should be done as a library on top of the
target-dependent facilities provided in IR. One could imagine a set
of routines like this:

/// Given some constant data, attach valid prefix data.
void attachPrefixData(Function *F, Constant *Data);

/// Returns an i1 indicating whether prefix data is present for FP.
Value *hasPrefixData(Value *FP);

/// Returns a pointer to the prefix data for FP.
Value *getPrefixDataPointer(Value *FP, Type *DataType);

Forcing clients to embed deep
target-specific-machine-code knowledge in their frontends seems like a step
in the wrong direction for LLVM.

Given a set of routines such as the ones described above, I think we
can give frontends a choice of whether to do this or not. Besides,
LLVM already contains plenty of target-specific information in its IR.
Varargs, inline asm, calling conventions, etc. I don't think making
all aspects of the IR target-independent should be a worthwhile goal
for LLVM.

> This requirement could be relaxed when combined with my earlier symbol
> offset proposal [2] as applied to functions. However, this is outside
> the scope of the current proposal.
>
> Example:
>
> %0 = type <{ i32, i8* }>
>
> define void @f() prefix %0 <{ i32 1413876459, i8* bitcast ({ i8*, i8* }*
> @_ZTIFvvE to i8*) }> {
> ret void
> }
>
> This is an example of something that UBSan might generate on an
> x86_64 machine. It consists of a signature of 4 bytes followed by a
> pointer to the RTTI data for the type 'void ()'. The signature when
> laid out as a little endian 32-bit integer decodes to the instruction
> 'jmp .+0x0c' (which jumps to the instruction immediately succeeding
> the 12-byte prefix) followed by the bytes 'F' and 'T' which identify
> the prefix as a UBSan function type prefix.
>

Do you know whether OoO CPU's will still attempt to decode the "garbage" in
the instruction stream, even if there is a jump over it? (IIRC they will
decode ahead of the PC and hiccup (but not fault) on garbage). Maybe it
would be better to steganographically encode the value inside the
instruction stream? On x86 you could use 48b8<imm64> which only has 2 bytes
overhead for an i64 (putting a move like that, which moves into a
caller-saved register on entry, would effectively be a noop).

On the contrary, I think this is a good argument for allowing
(not forcing) frontends to encode the prefix data as they please,
thus enabling this kind of creativity.

This is some
pretty gnarly target-dependent stuff which seems like it would best be
hidden in the backend (e.g. architectures that have "constant island"-like
passes might want to stash the data in there instead).

I think that adding support for things like constant islands is
something that can be added incrementally at a later stage. One could
consider for example an additional llvm::Function field which specifies
the number of bytes that the backend may use at the beginning of the
function such that the prefix data may be of any format. (Once this
is in place the aforementioned library routines could become relatively
trivial.) The backend could use this space to, say, insert a relative
branch that skips the prefix data and a first constant island.

Thanks,

>
> To maintain the semantics of ordinary function calls, the prefix data
> must have a particular format. Specifically, it must begin with a
> sequence of bytes which decode to a sequence of machine instructions,
> valid for the module's target, which transfer control to the point
> immediately succeeding the prefix data, without performing any other
> visible action. This allows the inliner and other passes to reason
> about the semantics of the function definition without needing to
> reason about the prefix data. Obviously this makes the format of the
> prefix data highly target dependent.

What if the prefix data was stored before the start of the function
code? The function's symbol will point to the code just as before,
eliminating the need to have instructions that skip the prefix data.

It would look something like:
> Prefix Data ... (variable length) | Prefix Data Length (fixed length
[32 bits?]) | Function code .... |

            ^ function symbol points here (function code)

I hope the simple ASCII art makes it through my mail client.

To access the data, you do

prefix_data = function_ptr - sizeof(prefix_length) - prefix_length

A similar scheme is described in the next paragraph of my email:

> This requirement could be relaxed when combined with my earlier symbol
> offset proposal [2] as applied to functions. However, this is outside
> the scope of the current proposal.

Unfortunately, this won't work for UBSan, as it needs to be able to
take an arbitrary function pointer and determine whether the prefix
data is present. If the function lives at the beginning of a segment
boundary and does not have prefix data a segfault may occur when
attempting to access the prefix data.

It should definitely work for GHC though (and is how I understand
the tables-next-to-code ABI to be implemented in its non-LLVM backend).

Thanks,

As much as I like this idea for it’s use in languages with type systems like Haskell and Scheme, this proposal would limit LLVM to non-Harvard architectures. That’s generally a really small minority of all processors, but it would mean there could never be a clang-avr.

An alternative you could use is, instead of using the function pointer as the variable where you are referring to a function, you could have the variable be a pointer to a static struct with the data and the actual function pointer. Basically, it’s like how static class variables as handled in C++.

I don’t know LLVM IR, so I’ll use C to explain.

Instead of this:

void func(void){}

int main(){
func();
return 0;
}

You could do this:

void func(void){}

/* You have to initialize this at compile time. /
struct {
char
data;
int len;
void (*ptr)(void) = func;
} func_data;

int main(){
func_data.ptr();
return 0;
}

As much as I like this idea for it's use in languages with type systems
like Haskell and Scheme, this proposal would limit LLVM to non-Harvard
architectures. That's generally a really small minority of all processors,
but it would mean there could never be a clang-avr.

Not really. It just would mean that the prefix data feature would not exist
(or at least would not be usable) on such architectures.

An alternative you could use is, instead of using the function pointer as
the variable where you are referring to a function, you could have the
variable be a pointer to a static struct with the data and the actual
function pointer. Basically, it's like how static class variables as
handled in C++.

I don't know LLVM IR, so I'll use C to explain.

Instead of this:

void func(void){}

int main(){
   func();
   return 0;
}

You could do this:

void func(void){}

/* You have to initialize this at compile time. */
struct {
char* data;
int len;
void (*ptr)(void) = func;
} func_data;

int main(){
   func_data.ptr();
   return 0;
}

You could certainly use something like this to implement runtime
function metadata on Harvard architectures (the existing LLVM GHC
backend must already do something like this), and (optionally, as an
optimisation) prefix data on all other architectures.

Thanks,

Are there any Harvard architectures that clang supports that doesn't
provide instructions to read program memory? AVR has the LPM
instruction.

Cheers,
Jevin

>
> > Hi,
> >
> > I would like to propose that we introduce a mechanism in IR to allow
> > arbitrary data to be stashed before a function body. The purpose of
> > this would be to allow additional data about a function to be looked
> > up via a function pointer. Two use cases come to mind:
> >
> > 1) We'd like to be able to use UBSan to check that the type of the
> > function pointer of an indirect function call matches the type of
> > the function being called. This can't really be done efficiently
> > without storing type information near the function.
> >
>
> How efficient does it have to be? Have some alternatives already proven
to
> be "too slow"? (e.g. a binary search into a sorted table)

This has admittedly not been measured. It depends on the rate at
which the program performs indirect function calls. But given the
other use cases for this feature

So far you only seem to have presented the GHC ABI use case (and see just
below about UBSan). Do you know of any other use cases?

we might as well use it in UBSan as
opposed to something which is going to be strictly slower.

Below you use UBSan's use case as a motivating example, which seems
incongruous to this "might as well use it in UBSan" attitude. Without
having evaluated alternatives as being "too slow" for UBSan, I don't think
that UBSan's use case should be used to drive this proposal.

> > 2) Allowing GHC's tables-next-to-code ABI [1] to be implemented.
> > In general, I imagine this feature could be useful for the
> > implementation of languages which require runtime metadata for
> > each function.
> >
> > The proposal is that an IR function definition acquires a constant
> > operand which contains the data to be emitted immediately before
> > the function body (known as the prefix data). To access the data
> > for a given function, a program may bitcast the function pointer to
> > a pointer to the constant's type. This implies that the IR symbol
> > points to the start of the prefix data.
> >
> > To maintain the semantics of ordinary function calls, the prefix data
> > must have a particular format. Specifically, it must begin with a
> > sequence of bytes which decode to a sequence of machine instructions,
> > valid for the module's target, which transfer control to the point
> > immediately succeeding the prefix data, without performing any other
> > visible action. This allows the inliner and other passes to reason
> > about the semantics of the function definition without needing to
> > reason about the prefix data. Obviously this makes the format of the
> > prefix data highly target dependent.
> >
>
> I'm not sure that something this target dependent is the right choice.
Your
> example below suggests that the frontend would then need to know magic to
> put "raw" in the instruction stream. Have you considered having the
feature
> expose just the intent "store this data attached to the function, to be
> accessed very quickly", and then have an intrinsic
> ("llvm.getfuncdata.i{8,16,32,64}"?) which extracts the data in a
> target-dependent way?

The problem is that things like UBSan need to be able to understand
the instruction stream anyway (to a certain extent). In UBSan's case,
determining at runtime whether a function has prefix data depends on
a specific signature of instructions at the start of the program.
There are a wide variety of signatures that can be used here and
I believe we shouldn't try to constrain the frontend author with a
signature (at least partly) of our own design.

I think that if someone wants a target-independent way of
embedding prefix data it should be done as a library on top of the
target-dependent facilities provided in IR. One could imagine a set
of routines like this:

/// Given some constant data, attach valid prefix data.
void attachPrefixData(Function *F, Constant *Data);

/// Returns an i1 indicating whether prefix data is present for FP.
Value *hasPrefixData(Value *FP);

/// Returns a pointer to the prefix data for FP.
Value *getPrefixDataPointer(Value *FP, Type *DataType);

> Forcing clients to embed deep
> target-specific-machine-code knowledge in their frontends seems like a
step
> in the wrong direction for LLVM.

Given a set of routines such as the ones described above, I think we
can give frontends a choice of whether to do this or not. Besides,
LLVM already contains plenty of target-specific information in its IR.
Varargs, inline asm, calling conventions, etc. I don't think making
all aspects of the IR target-independent should be a worthwhile goal
for LLVM.

I don't have an issue with target-dependent things per se; I just think
that they should be given a bit more thought and not added unless existing
mechanisms are insufficient. For example, could this be implemented as a
late IR pass that adds a piece of inline-asm to the beginning of the
function?

-- Sean Silva

So far you only seem to have presented the GHC ABI use case (and see just
below about UBSan). Do you know of any other use cases?

Not concretely. I was referring to other language implementations which
might use runtime function metadata.

> we might as well use it in UBSan as
> opposed to something which is going to be strictly slower.
>

Below you use UBSan's use case as a motivating example, which seems
incongruous to this "might as well use it in UBSan" attitude. Without
having evaluated alternatives as being "too slow" for UBSan, I don't think
that UBSan's use case should be used to drive this proposal.

OK. So let's approach this from the
GHC/runtime-function-metadata-based-language standpoint. I would
argue that the client still ought to have some control over where the
data appears relative to the function. This might be for the sake
of conformance with an existing ABI for that language. For example,
in the existing GHC tables-next-to-code ABI, the data appears right
before the function.

Given the ability to do this, is it too much of a stretch for the
client to be able to specify where the symbol should be located?
This is something that will be needed anyway, as explained in my
symbol offset proposal. And provided that the client behaves, it
shouldn't impose any additional burden on LLVM itself.

I don't have an issue with target-dependent things per se; I just think
that they should be given a bit more thought and not added unless existing
mechanisms are insufficient. For example, could this be implemented as a
late IR pass that adds a piece of inline-asm to the beginning of the
function?

I don't like it, for four reasons:

1) Inline asm is just as target dependent as prefix data (perhaps even
   more so, if you consider that different targets may have different
   flavours of asm).

2) It takes control of the specific encoding of the instructions out of
   your hands, which can be important if you use it as a signature.
   
3) It inhibits optimisation, as it becomes more difficult to
   optimise away loads through a known function pointer.

4) The backend will probably need to be taught to treat this particular
   piece of asm specially, i.e. by not emitting a function prelude
   until it is emitted. By contrast, the backend can be taught
   to emit prefix data trivially with two lines of code.

Thanks,

Hi,

What if the prefix data was stored before the start of the function
code? The function's symbol will point to the code just as before,
eliminating the need to have instructions that skip the prefix data.

how many platforms would this work on? Last time I tried something analogous
to this it fell through because the Darwin object code format didn't support
it. I'm not saying that what you are suggesting wouldn't work on Darwin - I
don't know. But given my past experience it would be wise to do a platform
survey before going too far.

Ciao, Duncan.

As much as I like this idea for it’s use in languages with type systems like Haskell and Scheme, this proposal would limit LLVM to non-Harvard architectures. That’s generally a really small minority of all processors, but it would mean there could never be a clang-avr.

LLVM already effectively assumes a unified address space. The pointer address space attributes and such come close to what would be required for good Harvard support, but it’s not enough. In particular there’s the fundamental assumption that all pointers are the same size. For embedded Harvard arches, that’s often not the case.

-JIm