[PATCH] Symbol offsets

Somehow this cover letter was dropped from my symbol offsets patch set:

  1. http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-May/073200.html
  2. http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-May/073201.html

Original message

I’m a little concerned we got prefix data wrong. We had the following motivating use cases:

  1. Function prologue sigils, where we emit a special nop slide, maybe with data in it. Peter implemented a ubsan feature using this.

  2. Function hotpatching, where we emit some data before the function and a special nop before the function. Typically the nop is ‘mov edi, edi’ on x86 Windows, preceded by five bytes of padding for a long jump. Profilers can uses this to turn on and off instrumentation of a running binary.

  3. Tables-before-code, where data is completely prior to the code. GHC needs this.

In all cases, any code inside the prologue had no meaning to LLVM. Inlining a function with a funky prologue is completely valid.

I worry that symbol_offset combined with prefix are too low-level. What if we split this up into something like prefix data “prologue” data? Prefix data would be an arbitrary LLVM constant, and prologue data is a byte sequence of native executable code. Something like:

define void @foo() prefix [i8* x 2] { i8* @a, i8* @b } prologue [i8 x 4] c"\xde\xad\xbe\xef" { ret void }

I think the two forms are fundamentally equivalent to optimizations like global constant propagation, but it’d be nice to have an intuitive representation. One of the strengths of LLVM’s IL is that it’s comprehensible to mere mortal compiler engineers, and not just computer programs.

P.S. You could also represent this with aliases with a non-zero offset
from the beginning of the function. Rafael is implementing this, but I
don't think that's a very good representation.

I mean, I don't think it's a good representation for tables-before-code. I
think it's a perfectly reasonable way to represent MSVC-style vftables that
have RTTI data, which we plan to do.

I like this proposal. Now that I've thought about it more, I think it might
not be too important for global variables and functions to share a similar
representation for offsets. One comment though.

Before, when I was thinking about calls to functions with prefix data in
cases where the function entry point appears after the data (i.e. GHC's use
case), I was imagining that we could have two new properties for functions:
the symbol offset and the entry point offset. UBSan etc would set both to
zero, while GHC would set the former to zero and the latter to the size of
the prefix.

Provided that we need to cater to platforms where the function's metadata
cannot appear before the function's symbol (which I believe to be the case
on at least Darwin) we need some way of representing the distance between
the symbol and the entry point in external function declarations. Under your
proposal, we could probably do that by having a way of representing the type
of the prefix separately from its "initializer".

Thanks,

Now that aliases can have any expressions, can't you use something like

@data = private global [2 x i32] [i32 42, i32 43]
@symbol = alias getelementptr ([2 x i32]* @data, i32 0, i32 1)

This produces

.Ldata:
        .long 42 # 0x2a
        .long 43 # 0x2b
...
        .globl symbol
symbol = .Ldata+4

That is, in the object file there is only one symbol (named symbol)
and it is at offset 4.

Rafael Espíndola <rafael.espindola@gmail.com> writes:

Now that aliases can have any expressions, can't you use something like

@data = private global [2 x i32] [i32 42, i32 43]
@symbol = alias getelementptr ([2 x i32]* @data, i32 0, i32 1)

This produces

.Ldata:
        .long 42 # 0x2a
        .long 43 # 0x2b
...
        .globl symbol
symbol = .Ldata+4

That is, in the object file there is only one symbol (named symbol)
and it is at offset 4.

I believe we could but I'll have to try implementing it to know for
certain. Thanks for the suggestion!

Cheers,

- Ben

Peter Collingbourne <peter@pcc.me.uk> writes:

Rafael Espíndola <rafael.espindola@gmail.com> writes:

Now that aliases can have any expressions, can't you use something like

@data = private global [2 x i32] [i32 42, i32 43]
@symbol = alias getelementptr ([2 x i32]* @data, i32 0, i32 1)

This produces

.Ldata:
        .long 42 # 0x2a
        .long 43 # 0x2b
...
        .globl symbol
symbol = .Ldata+4

That is, in the object file there is only one symbol (named symbol)
and it is at offset 4.

How would one define the body of the function `symbol` in this case?

Cheers,

- Ben

I don't think it would work. For LLVM generated "data" (like function
bodies) it seems something like the prefix feature is what is needed.

Cheers,
Rafael

Rafael Espíndola <rafael.espindola@gmail.com> writes:

I suspected this was the case. Is a rework of prefix support likely to
make it in for 3.5?

Unlikely. It has branched already and I don't know of anyone working on it.

Cheers,
Rafael

Rafael Espíndola <rafael.espindola@gmail.com> writes:

I suspected this was the case. Is a rework of prefix support likely to
make it in for 3.5?

Unlikely. It has branched already and I don't know of anyone working on it.

Fair enough. If there is consensus around a reasonably concrete proposal
I would be happy to put together a patch (acknowledging that it probably
won't make it in for 3.5). Does Reid's proposal seem reasonable?

Cheers,

- Ben

Ben Gamari <bgamari.foss@gmail.com> writes:

Rafael Espíndola <rafael.espindola@gmail.com> writes:

I suspected this was the case. Is a rework of prefix support likely to
make it in for 3.5?

Unlikely. It has branched already and I don't know of anyone working on it.

Fair enough. If there is consensus around a reasonably concrete proposal
I would be happy to put together a patch (acknowledging that it probably
won't make it in for 3.5). Does Reid's proposal seem reasonable?

Ping?

+the people I hashed this out with so many months ago

I think it’s a reasonable proposal, but obviously I floated it. :slight_smile: Let’s try to get a second opinion.

Again, it’s a syntax something like:
define void @foo() prefix [i8* x 2] { i8* @a, i8* @b } prologue [i8 x 4] c"\xde\xad\xbe\xef" { ret void }

Reid Kleckner <rnk@google.com> writes:

+the people I hashed this out with so many months ago

I think it's a reasonable proposal, but obviously I floated it. :slight_smile: Let's
try to get a second opinion.

Again, it's a syntax something like:
define void @foo() prefix [i8* x 2] { i8* @a, i8* @b } prologue [i8 x 4]
c"\xde\xad\xbe\xef" { ret void }

As I've said previously, this syntax looks great to me. I haven't yet
written any code against this interface but I see no reason why we
wouldn't be able to use this to implement tables-next-to-code.

Gabor's suggestion of prefix data on basic blocks is interesting but I'm
not sure how this would be realized syntactically.

Either way, would it be possible to see something like this in LLVM
3.6. For a variety of reasons the TNTC hack GHC's LLVM backend currently
uses is becoming rather burdensome. It would be great to have a more
principled approach soon.

Cheers,

- Ben

Ben Gamari <bgamari.foss@gmail.com> writes:

Reid Kleckner <rnk@google.com> writes:

+the people I hashed this out with so many months ago

I think it's a reasonable proposal, but obviously I floated it. :slight_smile: Let's
try to get a second opinion.

Again, it's a syntax something like:
define void @foo() prefix [i8* x 2] { i8* @a, i8* @b } prologue [i8 x 4]
c"\xde\xad\xbe\xef" { ret void }

As I've said previously, this syntax looks great to me. I haven't yet
written any code against this interface but I see no reason why we
wouldn't be able to use this to implement tables-next-to-code.

Gabor's suggestion of prefix data on basic blocks is interesting but I'm
not sure how this would be realized syntactically.

Either way, would it be possible to see something like this in LLVM
3.6. For a variety of reasons the TNTC hack GHC's LLVM backend currently
uses is becoming rather burdensome. It would be great to have a more
principled approach soon.

Do any of your have any thoughts on this? I would be willing to put
implementing this on my queue but I'd first want to know that this
proposal sounds somewhat acceptable.

Cheers,

- Ben