LLVM GHC Backend: Tables Next To Code

Hello everyone,

On behalf of GHC hackers, I would like to discuss the possibility of
having a proper implementation of the tables-next-to-code optimisation
in LLVM.

Currently, the object code produced by all three GHC backends follows
the convention that the table with the metadata of a closure is
located immediately before the code of the closure. This makes it
possible to get to both the code and the metadata after a single
pointer resolution operation. This, obviously, requires certain
ordering of data and text in the object code. Since LLVM does not
make it possible to explicitly control the placement of data and code,
the necessary ordering is currently achieved by injecting GNU
Assembler subsections on platforms supported by GNU Assembler. Mac
assembler, however, does not support this feature, so the resulting
object code is post-processed directly.

It would be desirable that LLVM offered some control over the ordering
of text and data sections in the code. In particular, David Terei
suggests one of the following [0]:

1. Add a special variable "llvm.order" of an array type. It will list
   function names and global pointers in the order in which they
   should appear in the resulting object code. This basically
   corresponds to the "fno-toplevel-reorder" option in GCC.

2. Add support for attaching a global variable to a function. In this
   way one could be able to specify a that a global variable should
   appear before the corresponding function in the object code.

An example of a possible implementation of (2) has been recently
suggested by Gabor Greif. He proposes adding a "placebefore"
attribute to global variables (or, similarly, a "placeafter" attribute
for functions). The corresponding example is:

  @foo_D = common global %struct.Descr zeroinitializer, align 8,
placebefore @foo

  define i32 @foo() nounwind uwtable readnone {
     ret i32 undef
  }

The questions is whether the LLVM developers are willing to accept
such modifications and, if yes, which approach to implementing such
modifications they would recommend.

Sergiu

[0] LLVM: Add support for TNTC to LLVM compiler suite (#4213) · Issues · Glasgow Haskell Compiler / GHC · GitLab

On behalf of GHC hackers, I would like to discuss the possibility of
having a proper implementation of the tables-next-to-code optimisation
in LLVM.

It would be great to have this. However, the design will be tricky. Is there anything that spells out how the TNTC optimization works at the actual machine instruction level? It seems that there should be a blog post somewhere that shows the code with and without the optimization, but I can't find it offhand.

This, obviously, requires certain
ordering of data and text in the object code. Since LLVM does not
make it possible to explicitly control the placement of data and code,
the necessary ordering is currently achieved by injecting GNU
Assembler subsections on platforms supported by GNU Assembler. Mac
assembler, however, does not support this feature, so the resulting
object code is post-processed directly.

It's interesting that you bring this up. It turns out that on the mac toolchain (unless you disable subsectionsviasymbol, a gross hack) does not give you the ability to control the ordering of blobs of code separated by global labels (aka 'atoms' in the linker's terminology). This is important because it enables link-time dead code elimination, profile based code reordering etc. My understanding is that ELF toolchains don't have something like this, but it would be unfortunate if TNTC fundamentally prevents something like this from working.

Beyond this, the proposed model has some other issues: code ordering only makes sense within a linker section, but modeling "the table" and "the code" as two different LLVM values (a global value and a function) would mean that the optimizer will be tempted to put them into different sections, do dead code elimination, etc.

He proposes adding a "placebefore"
attribute to global variables (or, similarly, a "placeafter" attribute
for functions). The corresponding example is:

This is a non-starter for a few reasons, but that doesn't mean that there aren't other reasonable options. I'd really like to see the codegen that you guys are after to try to help come up with another suggestion that isn't a complete one-off hack for GHC. :slight_smile:

One random question: have you considered placing the table *inside* of the function? If the prologue for the closure was effectively:

Closure:
  jmp .LAfterTable
  .word ...
  .word ...
.LAfterTable:
  push $rbp
  ...

then you can avoid a lot of problems. I realize that this is not going to be absolutely as fast as your current TNTC implementation, but processors are *really really* good at predicting unconditional branches, so the cost is probably minimal, and it is likely to be much much faster than not having TNTC at all.

Getting even this to work will not be fully straight-forward, but again I'd like to understand more of what you're looking for from codegen to understand what the constraints are.

-Chris

Hmm writing a blog post about TNTC is beyond the time I have right now.

Here is some high level documentation of the layout of Heap objects in GHC:

With TNTC enabled we generate code for closures of this form:

.text
  .align 8
  .long Main_main1_srt-(Main_main1_info)+0
  .long 0
  .quad 4294967299
  .quad 0
  .quad 270582939663
.globl Main_main1_info
.type Main_main1_info, @object
Main_main1_info:
.Lc1Df:
  leaq -8(%rbp),%rax
  cmpq %r15,%rax
  jb .Lc1Dh

[...]

.data
.globl Main_main1_closure
.type Main_main1_closure, @object
Main_main1_closure:
  .quad Main_main1_info
  .quad 0

Without TNTC we instead generated code of this form:

.text
  .globl Main_main1_entry
  .type Main_main1_entry, @function
Main_main1_entry:
.LFB15:
  leaq -8(%rbp),%rax
  cmpq %r15,%rax
  jb .Lc1Dh

[...]

.data
  .globl Main_main1_info
  .align 8
  .type Main_main1_info, @object
  .size Main_main1_info, 40
Main_main1_info:
  .quad Main_main1_entry
  .quad 0
  .quad 270582939663
  .quad 4294967299
  .quad Main_main1_srt

  .globl Main_main1_closure
  .align 16
  .type Main_main1_closure, @object
  .size Main_main1_closure, 16
Main_main1_closure:
  .quad Main_main1_info
  .quad 0

Hopefully that helps understand what we are looking for if replicating
the current scheme exactly. Below follow some extracts from an email
conversation I had with Simon Marlow (principle GHC developer) mussing
briefly about some other schemes for TNTC:

--- Simon Marlow:

"The runtime overhead is mostly from the extra indirect memory access.
For closure entry this is less of an issue now that we have pointer
tagging, but the extra memory access is still required for every
return to a stack frame.

The code size overhead comes from the extra word in every info table
(most info tables are 2 words, and increase to 3 without TNTC), and
also the small increase in code due to the extra memory access."

[...]

"So there are a few alternatives, all of which entail some kind of compromise.

The one you suggest, having both an info pointer and a code pointer in
every closure/stack frame, entails greater memory overheads. I
suspect this would be worse than the current double-indirection
scheme.

The info tables are accessed relatively rarely during mutation, so it
makes sense to look at schemes that prioritise the code pointer. For
example, you could store the info tables in a separate area of memory
entirely, and keep a hash table to map code addresses to info tables
(this is how the C ABI keeps track of metadata associated with stack
frames for C++ exceptions I think, but I don't know much about how it
works).

A slightly more cunning scheme would be to have each code fragment
start with a dummy instruction containing the address of the info
table, e.g.

  movl <infoptr>, %dummy

this should be faster than the double-indirection scheme, but is very
architecture-dependent and probably involves some horrible hacks in
the compiler (could it even be done with LLVM? Maybe with the
mangler)."

Hmm writing a blog post about TNTC is beyond the time I have right now.

Sure, understandable. I'm surprised someone else hasn't already :slight_smile:

Here is some high level documentation of the layout of Heap objects in GHC:

heap objects · Wiki · Glasgow Haskell Compiler / GHC · GitLab

With TNTC enabled we generate code for closures of this form:

.text
  .align 8
  .long Main_main1_srt-(Main_main1_info)+0
  .long 0
  .quad 4294967299
  .quad 0
  .quad 270582939663
.globl Main_main1_info
.type Main_main1_info, @object
Main_main1_info:
.Lc1Df:
  leaq -8(%rbp),%rax
  cmpq %r15,%rax
  jb .Lc1Dh

Ok. I'd strongly recommend the approach of generating the table inside the prolog of the function. This means you'd get something like this:

.text
  .align 8
.globl Main_main1_info
.type Main_main1_info, @object
Main_main1_info:
.Lc1Df:
  jmp .Ltmp
  .long Main_main1_srt-(Main_main1_info)+0
  .long 0
  .quad 4294967299
  .quad 0
  .quad 270582939663
.Ltmp:
  leaq -8(%rbp),%rax
  cmpq %r15,%rax
  jb .Lc1Dh

Since the jmp is a fixed 2 bytes (0xEB, tablesize), all references to the table can still be done with trivial pc/RIP-relative addressing within the closure, and you just need one pointer for both the table and the closure data.

If you want to get extra special and tricky, you could be even more devious by storing "Main_main1_info + 2 + table size" as the canonical pointer. If you jump to *that* when dispatching to the closure, then you completely avoid the runtime overhead of the extra unconditional jump and get exactly the same code you're getting with GHC's native code generator.

To access the table in LLVM IR, you'll be generating some truly special (i.e. horrible :slight_smile: IR along the lines of (e.g. to load the 4294967299 field):

load (gep (bitcast @Main_main1_info to i64*), 0, 2)

The code generator probably isn't smart enough to turn that into a rip-relative memory access, but adding that should be straight-forward.

The tricky bit will be figuring out how to ensure that the inline asm blob containing the table will come before the standard prolog. Perhaps this can be handled by the existing GHC calling convention, or through creative use of the naked attribute.

-Chris

Hmm writing a blog post about TNTC is beyond the time I have right now.

Sure, understandable. I'm surprised someone else hasn't already :slight_smile:

Here is some high level documentation of the layout of Heap objects in GHC:

heap objects · Wiki · Glasgow Haskell Compiler / GHC · GitLab

With TNTC enabled we generate code for closures of this form:

.text
  .align 8
  .long Main_main1_srt-(Main_main1_info)+0
  .long 0
  .quad 4294967299
  .quad 0
  .quad 270582939663
.globl Main_main1_info
.type Main_main1_info, @object
Main_main1_info:
.Lc1Df:
  leaq -8(%rbp),%rax
  cmpq %r15,%rax
  jb .Lc1Dh

Ok. I'd strongly recommend the approach of generating the table inside the prolog of the function. This means you'd get something like this:

This is starting to look very similar to how ARM constant islands work, without the extra ugliness from how small the ARM immediate displacements are.

-Jim

This is starting to look very similar to how ARM constant islands work, without the extra ugliness from how small the ARM immediate displacements are.

-Jim

Would there be any reason that this couldn't be seen as an opportunity to move the constant islands pass out of the ARM backend and make the target-independent constant pools (which ARM bypasses completely) more generic?

This is starting to look very similar to how ARM constant islands work, without the extra ugliness from how small the ARM immediate displacements are.

-Jim

Would there be any reason that this couldn't be seen as an opportunity to move the constant islands pass out of the ARM backend and make the target-independent constant pools (which ARM bypasses completely) more generic?

The ARM pass does more than just constant islands (e.g., branch relaxation, TBB/TBH formation), so it's unfortunately not very generic. It might be possible to refactor it into using target hooks and such, but I have my doubts how practical that will turn out to be. Add in that the pass is very, very fragile and I get nervous about trying to generalize it too much. Keeping the generic pass pretty brain dead and letting targets override it seems sane to me.

-Jim

Since the jmp is a fixed 2 bytes (0xEB, tablesize), all references to the table can still be done with trivial pc/RIP-relative addressing within the closure, and you just need one pointer for both the table and the closure data.

Only if tablesize fits in 127 bytes. Otherwise it will use 0xE9-based
encoding which is larger.

> Since the jmp is a fixed 2 bytes (0xEB, tablesize), all references to
the table can still be done with trivial pc/RIP-relative addressing within
the closure, and you just need one pointer for both the table and the
closure data.
Only if tablesize fits in 127 bytes. Otherwise it will use 0xE9-based
encoding which is larger.

The table size is variable - between 2 and 6 32- or 64-bit words, where 2-3 is the common case. This means Chris's trick of adjusting the pointer by a fixed amount to point directly to the code doesn't work unless we pad the table to the worst-case size all the time, which wastes a lot of space.

Well, perhaps we could split the table into two, with the branch instruction after the first 2 or 3 words.

Cheers,
  Simon

The table size is variable - between 2 and 6 32- or 64-bit words, where 2-3 is the common case. This means Chris's trick of adjusting the pointer by a fixed amount to point directly to the code doesn't work unless we pad the table to the worst-case size all the time, which wastes a lot of space.

Well... but you can surely emit the size of table as an argument of
the jump instruction :slight_smile:

> The table size is variable - between 2 and 6 32- or 64-bit words, where
2-3 is the common case. This means Chris's trick of adjusting the pointer
by a fixed amount to point directly to the code doesn't work unless we pad
the table to the worst-case size all the time, which wastes a lot of
space.
Well... but you can surely emit the size of table as an argument of the
jump instruction :slight_smile:

Sure, the jump instruction is not the problem. But we were hoping to omit the jump instruction and have the compiler add the correct offset to the pointer when it generates code. In order to be able to do that, it has to know what the offset is: so either the offset is fixed, or we need some way to figure out what it is (we don't have that right now; in theory we could track that information but it might be painful to do so).

Cheers,
  Simon

Sure, the jump instruction is not the problem. But we were hoping to omit the jump instruction and have the compiler add the correct offset to the pointer when it generates code. In order to be able to do that, it has to know what the offset is:

label difference? Between start of function and start of the code?

Hi Chris,

One remaining question here is, if the GHC team tries some of these
alternative schemes and finds them unsatisfactory what is the LLVM
communities feeling in regards to extending LLVM IR to support
directly implementing TNTC? How do you envision this would look at the
IR level, how much work do you think it would be and most importantly
do you feel LLVM would be willing to accept patches for it?

We simply post process the assembly right now to get the desired code
and it works very well but it means we can't use the integrated
assembler which annoys me.

Cheers,
David.

Hi Chris,

One remaining question here is, if the GHC team tries some of these
alternative schemes and finds them unsatisfactory what is the LLVM
communities feeling in regards to extending LLVM IR to support
directly implementing TNTC?

I'm strongly in favor of getting proper support for TNTC, because your workarounds (while very pragmatic!) give me the shivers :). I'd much rather have a proper solution that works well with the rest of the llvm toolchain.

That said, the design and implementation needs to fit in well with the rest of llvm. I'm not willing to add a crazy completely-special purpose bolt-on extension just to support this, which means that we need to find a way to design it that makes sense in the larger context of llvm.

How do you envision this would look at the
IR level, how much work do you think it would be and most importantly
do you feel LLVM would be willing to accept patches for it?

I really like the idea of adding this as an inline asm blob at the start of a function, and biasing the actual address of the closure based on the size of the table. I'm not 100% confident that it will work (not being very familiar with TNTC) but it seems quite plausible and the impact on LLVM would be quite reasonable (some new calling convention work?)

-Chris

Chris said:

I really like the idea of adding this as an inline asm blob at the start of a function, and biasing the actual address of the closure based on the size of the table. I'm not 100% confident that it will work (not being very familiar with TNTC) but it seems quite plausible and the impact on LLVM would be quite reasonable (some new calling convention work?)

While reading this I had the idea that the LLVM code generator could
watch out for the specific combination of inline asm and calling
convention and strip off the fat. This way the code increase could be
dealt with. OTOH the inline asm would still need to be target
specific, which is very ugly. What about a new intrinsic which holds a
reference to the global and creates the right assembly in the backend?
Since the reference to the global (which is the table) would not be
used otherwise, the linker could drop it, thus no code increase and no
redundant global.

Does this make sense?

Cheers,

    Gabor

Hi Chris,

One remaining question here is, if the GHC team tries some of these
alternative schemes and finds them unsatisfactory what is the LLVM
communities feeling in regards to extending LLVM IR to support
directly implementing TNTC?

I'm strongly in favor of getting proper support for TNTC, because your workarounds (while very pragmatic!) give me the shivers :). I'd much rather have a proper solution that works well with the rest of the llvm toolchain.

That said, the design and implementation needs to fit in well with the rest of llvm. I'm not willing to add a crazy completely-special purpose bolt-on extension just to support this, which means that we need to find a way to design it that makes sense in the larger context of llvm.

OK great! I have no idea when we on the GHC side (me) will get time to
tackle this so don't hold your breath. Good to know where you guys
stand though. We began this conversation as a GSoC student is
interested in tackling it so we may progress down that path.

How do you envision this would look at the
IR level, how much work do you think it would be and most importantly
do you feel LLVM would be willing to accept patches for it?

I really like the idea of adding this as an inline asm blob at the start of a function, and biasing the actual address of the closure based on the size of the table. I'm not 100% confident that it will work (not being very familiar with TNTC) but it seems quite plausible and the impact on LLVM would be quite reasonable (some new calling convention work?)

-Chris

I personally would prefer something like Gabor suggests that is
platform agnostic but am not fussed on the issue, whatever works (and
isn't a hack) is my goal.

Cheers,
David