Named register variables GNU-style, deux

Hello all,

Recently on this list (as of last month), Renato Golin of Linaro
posted a thread entitled "Named register variables, GNU-style"[1].
This thread concerned the implementation of the GNU Register variables
feature for LLVM. I'd like to give some input on this, as a developer
of the Glasgow Haskell Compiler, as we are a user of this feature.
Furthermore, our use case is atypical - it is efficiency oriented, not
hardware oriented (e.g. I believe the Linux APIC x86 subsystem uses
them for hardware, as well as MIPS Linux as mentioned). Bear with me
on the details.

I'll say up front our use case alone shouldn't sway major decisions,
nor am I screaming for the feature - I can sleep at night. But I found
there was a surprising lack of highlighted use cases, and perhaps in
the future if things change, these points can have some insight.

The summary is this: we use this feature in our garbage collector to
steal a register that is solely dedicated to a thread-local storage
for our multicore runtime system. This thread local data structure is
possibly the most performance sensitive variable in the entire
multicore system, to the point where we have spent significant time
optimizing every read or write, load or spill that could affect it.

Furthermore, the GC is tied to the threading system in several ways
and is parallel itself - a loss in performance here directly equates
to a large overall performance loss for every parallel, multicore
program.

The lack of this feature is now causing us significant problems,
particularly on Mac OS X, as it now uses Clang by default.

You would think that considering this variable is (p)thread local, we
could just use a __thread variable, or pthread_{get,set}specific to
manage. But on OS X, both of these equate to an absolutely huge
performance loss, upwards of 25%. Which is unacceptable, realistically
speaking, but we've had to deal with it.

On Linux, the situation isn't so bad. The ABI allows a __thread
variable to just be stored at a direct offset to the %fs segment,
meaning that a read/write is still very fast. In fact, __thread is
preferable on i386 Linux: the pathetic number of registers means
stealing one is a loss, not a win.

The situation is not so good on x86_64 OS X. Generally we would steal
r13 on a 64-bit platform. But that's not allowed with Clang.
Furthermore, the __thread implementation on OS X is terrible compared
to Linux: while internally it uses %fs for a specific set of internal,
predefined keys, and it also uses them for __thread and
pthread_{get,set}specific, a read or write to a __thread variable does
NOT translate to a direct read/write. It translates to an indirect
call through %rdi.

In other words, this code:

#include <stdio.h>
#include <stdlib.h>

__thread int foo;

int main(int ac, char* av) {
  if (ac < 2) foo = 10;
  else foo = atoi(av[1]);

  printf("foo = %d\n", foo);

  return 0;
}

Translates to this on x86_64 Linux with Clang:

(gdb) disassemble main
Dump of assembler code for function main:
   0x00000000004005b0 <+0>: push %rax
   0x00000000004005b1 <+1>: mov %rsi,%rax
   0x00000000004005b4 <+4>: cmp $0x2,%edi
   0x00000000004005b7 <+7>: mov $0xa,%esi
   0x00000000004005bc <+12>: jl 0x4005d1 <main+33>
   0x00000000004005be <+14>: mov 0x8(%rax),%rdi
   0x00000000004005c2 <+18>: xor %esi,%esi
   0x00000000004005c4 <+20>: mov $0xa,%edx
   0x00000000004005c9 <+25>: callq 0x4004b0 <strtol@plt>
   0x00000000004005ce <+30>: mov %rax,%rsi
   0x00000000004005d1 <+33>: mov %esi,%fs:0xfffffffffffffffc
   0x00000000004005d9 <+41>: mov $0x400694,%edi
   0x00000000004005de <+46>: xor %eax,%eax
   0x00000000004005e0 <+48>: callq 0x400480 <printf@plt>
   0x00000000004005e5 <+53>: xor %eax,%eax
   0x00000000004005e7 <+55>: pop %rdx
   0x00000000004005e8 <+56>: retq

It translates to this on x86_64 OS X with Clang:

(lldb) disassemble -m -n main
a.out`main
a.out[0x100000f20]: pushq %rbp
a.out[0x100000f21]: movq %rsp, %rbp
a.out[0x100000f24]: pushq %rbx
a.out[0x100000f25]: pushq %rax
a.out[0x100000f26]: movl $0xa, %ebx
a.out[0x100000f2b]: cmpl $0x2, %edi
a.out[0x100000f2e]: jl 0x100000f3b ; main + 27
a.out[0x100000f30]: movq 0x8(%rsi), %rdi
a.out[0x100000f34]: callq 0x100000f60 ; symbol stub for: atoi
a.out[0x100000f39]: movl %eax, %ebx
a.out[0x100000f3b]: leaq 0xde(%rip), %rdi ; foo
a.out[0x100000f42]: callq *(%rdi)
a.out[0x100000f44]: movl %ebx, (%rax)
a.out[0x100000f46]: leaq 0x43(%rip), %rdi ; "foo = %d\n"
a.out[0x100000f4d]: xorl %eax, %eax
a.out[0x100000f4f]: movl %ebx, %esi
a.out[0x100000f51]: callq 0x100000f66 ; symbol stub for: printf
a.out[0x100000f56]: xorl %eax, %eax
a.out[0x100000f58]: addq $0x8, %rsp
a.out[0x100000f5c]: popq %rbx
a.out[0x100000f5d]: popq %rbp
a.out[0x100000f5e]: ret

Note the indirect call through %rdi on OS X.

Again, the performance difference between these two snippets cannot be
understated. And pthread_{get,set}specific do even worse because
they're not inlined at all (remember, we're talking 25-30% loss for
all programs.)

There are details here on a bug of ours[2], where I have tracked and
examined this issue for the past year or so. We are getting desperate
to fix this for OS X users - to the point of inlining XNU internals to
either use 'predefined keys' (e.g. OS X has special 'fast TLS' keys
for WebKit on some versions) or inline the 'fast path' of
pthread_{get}specific to do a direct read/write.

We've tried many combinations of compiler settings and tweaks to try
and minimize these effects in the past, but still, a register variable
is essentially superior to all other solutions we've found, especially
on x86_64.

Even passing the thread-local variable around directly as an argument
to every single function is slower - because the function bodies are
so large, a spill will inevitably occur somewhere, causing loads (or
other spills) to interfere with a read/write later. Even combined with
manually lowering/lifting reads/writes, it still results in minor
losses and doesn't guarantee the compiler won't optimistically undo
that. Not not as bad as 30% though, more like 5-7% last I checked. But
that's still significant, still slower, and it's far uglier for us to
implement, and penalizes Linux unfairly unless it gets even uglier.

So, that's the long and short of it. Now we get to LLVM's implementation.

First, obviously, is that this need precludes Renato's proposal that
only non-allocatable registers must be available.[3] We absolutely
have to have GPRs available, and nothing else makes sense for our use
case.

Chandler was strongly against this sort of idea, and likely with good
reason (I don't know anything about parameterizing the LLVM register
set over the set of reserved registers from a user. I don't know
anything about the designs. Sounds like madness to me, too). I have no
input on logistics. But we do need it, otherwise this feature is
totally useless to us.

Also, in the last set of discussions, Joerg Sonnenberger proposed[4]
that these registers are reserved - possibly at the global
(translation unit) level or local (function body) level. We also
require this - temporarily spilling GPRs otherwise will almost
certainly result in the same sort of problem as using a function
argument - they will always collide in ways we cannot control or
predict. We *do* actually care about every single read, write, spill
and load.

Renato replied the need for this is just a of workaround for an
inefficient compiler - and he's right, it is. Otherwise, we wouldn't
do it. :slight_smile: And based on our observations, I'm sorry to say I don't
think GCC or LLVM are going to magically eliminate that difference of
5-7% loss we saw *consistently* any time soon. It's a realistic
difference to eliminate with enough work - but those wins don't ever
come easy, I know, and our code base is large and complex. That's
going to be a lot of work (but I know you're all smart enough for it).

Again, to recap, GHC alone probably is not enough of a compelling use
case by itself to support these two points on the design - which seem
somewhat radical on review of the original threads. Our needs are
atypical for sure. But I hope they serve as a useful input while you
consider the design space.

And also, I apologize in advanced if this is considered beating a dead horse.

Thanks.

[1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-March/071503.html
[2] Threaded RTS performing badly on recent OS X (10.8?) (#7602) · Issues · Glasgow Haskell Compiler / GHC · GitLab
[3] http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-March/071561.html
[4] http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-March/071620.html

In practice, pthread_getspecific() on x86-64 on Mac OS X is just a very
simple assembly routine:

  movq %gs:_PTHREAD_TSD_OFFSET(,%rdi,8),%rax
  ret

For Native Client on Mac x86-64, we check that pthread_getspecific()
contains the code above, and we inline the %gs access into NaCl's runtime
code (reading the value of _PTHREAD_TSD_OFFSET from pthread_getspecific()'s
code).

You can find the code for doing that here:
https://src.chromium.org/viewvc/native_client/trunk/src/native_client/src/trusted/service_runtime/arch/x86_64/nacl_tls_64.c?revision=11149

NaCl's reason for doing this is that NaCl needs to be able to read a
thread-local variable in a context when there's no stack available for
calling pthread_getspecific(). (We could pre-allocate a pool of stacks and
then allocate a stack from this pool with an atomic operation, then call
pthread_getspecific() on that stack. But that's a lot more complicated,
and slower.)

This will of course break if OS X's implementation of pthread_getspecific()
changes (other than to change _PTHREAD_TSD_OFFSET). Hopefully, if that
ever happens, OS X will have already started providing better thread-local
variables that can be accessed without calling a function, like what
Linux/ELF and Windows provide. :slight_smile:

This is hacky, but it should be completely reliable if
pthread_getspecific() matches the expected pattern, because it's not like
the code for pthread_getspecific() is going to change underneath you.

You could use the same trick, and fall back to calling
pthread_getspecific() if the code it contains doesn't match the pattern you
expect.

Cheers,
Mark

Recently on this list (as of last month), Renato Golin of Linaro
posted a thread entitled "Named register variables, GNU-style"[1].

Hi Austin,

FYI, this is the (now outdated) first proposal on the non-allocatable registers:

http://reviews.llvm.org/D3261

I read your email to the end and I understand why this is not good enough.

Again, the performance difference between these two snippets cannot be
understated. And pthread_{get,set}specific do even worse because
they're not inlined at all (remember, we're talking 25-30% loss for
all programs.)

It's for problems like these that the named GPRs feature exist (not
the stack register trick), but there are two issues that need to be
solved, and I'm solving one at a time.

First, obviously, is that this need precludes Renato's proposal that
only non-allocatable registers must be available.[3] We absolutely
have to have GPRs available, and nothing else makes sense for our use
case.

I believe we really should have GPRs in the named register scheme in
the future, but there are other problems that need to be dealt with
first, as Chandler exposed.

This is not flogging a dead horse and a feature that I believe is
important, not because it's heavily used by many people, but because
it's sparsely used by critical parts of very low level software that
needs the extra edge to give *all* relying dependent software a big
performance boost. People writing high-level software should not use
(like inline asm, etc) or will suffer the consequences.

We need to do the following steps, in order:

1. create the representation in IR (the intrinsics, metadata, etc),
and to lower it on some back-ends without any special reservation. (my
current work)

2. make it possible to add GPRs to the reserved list of the allocator
on a module scope (from those metadata nodes) and create some tests
with edge cases (especially ABI-related registers) to make sure the
code generation won't go crazy or pervert the ABI, creating
error/warning messages for the cases where it does.

3. move the code in with a flag to enable it, and let it run for a few
months/releases.

4. when the dust settles, make it default on.

I don't think enabling this feature by default will have any impact in
current code, since if you don't use it, there's no difference. But
the worry is that code that used it (and will be now compiled with
Clang/LLVM) will perform badly/wrong. Since this is an experimental
feature, on very specific code, I think the problems we'll see will be
manageable.

I'll get to step 1 next week, and we should start thinking about the
GPRs issue right afterwards. Since that's not particularly important
for me right now (the kernel doesn't need it), I may slow down a bit,
so your help (and those that need it) will be highly appreciated.

I may be wrong, but from what I've seen of the reservation mechanism,
it shouldn't be too hard to do it dynamically on a module-level. But I
only want to start thinking about it when I finish step 1.

Makes sense?

cheers,
--renato

Thanks for the excellent write-up. Just wanted to clarify...

Austin Seipp <aseipp@pobox.com> writes:

Recently on this list (as of last month), Renato Golin of Linaro
posted a thread entitled "Named register variables, GNU-style"[1].
This thread concerned the implementation of the GNU Register variables
feature for LLVM. I'd like to give some input on this, as a developer
of the Glasgow Haskell Compiler, as we are a user of this feature.
Furthermore, our use case is atypical - it is efficiency oriented, not
hardware oriented (e.g. I believe the Linux APIC x86 subsystem uses
them for hardware, as well as MIPS Linux as mentioned).

The MIPS case sounds pretty similar to yours: it sets aside a specific
GPR to hold thread-local information. The main difference is that MIPS
Linux was in the lucky position of being able to use a nonallocatable
GPR, since $gp ($28) is normally reserved for ABI features that Linux
doesn't need.

Thanks,
Richard