[RFC] lld: Dropping TLS relaxations in favor of TLSDESC

tl;dr: TLSDESC have solved most problems in formerly inefficient TLS access models, so I think we can drop TLS relaxation support from lld.

lld’s code to handle relocations is a mess; the code consists of a lot of cascading "if"s and needs a lot of prior knowledge to understand what it is doing. Honestly it is head-scratching and needs serious refactoring. I’m trying to simplify it to make it manageable again, and I’m now focusing on the TLS relaxations.

Thread-local variables in ELF is complicated. The ELF TLS specification [1] defines 4 different access models: General Dynamic, Local Dynamic, Initial Exec and Local Exec.

I’m not going into the details of the spec here, but the reason why we have so many different models for the same feature is because they were different in speed, and we have to use (formerly) slow models when we know less about their run-time memory layout at compile-time or link-time. So, there was a trade-off between generality and performance. For example, if you want to use thread-local variables in a dlopen(2)'able DSO, you need to choose the slowest model. If a linker knows at link-time that a more restricted access model is applicable (e.g. if it is linking a main executable, it knows for sure that it is not creating a DSO that will be used via dlopen), the linker is allowed to rewrite instructions to load thread-local variables to use a faster access model.

What makes the situation more complicated is the presence of a new method of accessing thread-local variables. After the ELF TLS spec was defined, TLSDESC [2] was proposed and implemented. With that method, General Dynamic and Local Dynamic models (that were pretty slow in the original spec) are as fast as much faster Initial Exec model. TLSDESC doesn’t have a trade-off of dlopen’ability and access speed. According to [2], it also reduces the size of generated DSOs. So it seems like TLSDESC is strictly a better way of accessing thread-local variables than the old way, and the thread-local variable’s performance problem (that the TLS ELF spec was trying to address by defining four different access models and relaxations in between) doesn’t seem a real issue anymore.

lld supports all TLS relaxations as defined by the ELF TLS spec. I accepted the patches to implement all these features without thinking hard enough about it, but on second thought, that was likely a wrong decision. Being a new linker, we don’t need to trace the history of the evolution of the ELF spec. Instead, we should have implemented whatever it makes sense now.

So, I’d like to propose we drop TLS relaxations from lld, including Initial Exec → Local Exec. Dropping IE→LE is strictly speaking a degradation, but I don’t think that is important. We don’t have optimizations for much more frequent variable access patterns such as locally-accessed variables that have GOT slots (which in theory we can skip GOT access because GOT slot values are known at link-time), so it is odd that we are only serious about TLS variables, which are usually much less important. Even if it would turn out that we want it after implementing more important relaxations, I’d like to drop it for now and reimplement it in a different way later.

This should greatly simplifies the code because it does not only reduce the complexity and amount of the existing code, but also reduces the amount of knowledge you need to have to read the code, without sacrificing performance of lld-generated files in practice.

Thoughts?

[1] https://www.akkadia.org/drepper/tls.pdf

[2] http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt

Rui Ueyama via llvm-dev <llvm-dev@lists.llvm.org> writes:

tl;dr: TLSDESC have solved most problems in formerly inefficient TLS access
models, so I think we can drop TLS relaxation support from lld.

lld's code to handle relocations is a mess; the code consists of a lot of
cascading "if"s and needs a lot of prior knowledge to understand what it is
doing. Honestly it is head-scratching and needs serious refactoring. I'm
trying to simplify it to make it manageable again, and I'm now focusing on
the TLS relaxations.

Thread-local variables in ELF is complicated. The ELF TLS specification [1]
defines 4 different access models: General Dynamic, Local Dynamic, Initial
Exec and Local Exec.

I'm not going into the details of the spec here, but the reason why we have
so many different models for the same feature is because they were
different in speed, and we have to use (formerly) slow models when we know
less about their run-time memory layout at compile-time or link-time. So,
there was a trade-off between generality and performance. For example, if
you want to use thread-local variables in a dlopen(2)'able DSO, you need to
choose the slowest model. If a linker knows at link-time that a more
restricted access model is applicable (e.g. if it is linking a main
executable, it knows for sure that it is not creating a DSO that will be
used via dlopen), the linker is allowed to rewrite instructions to load
thread-local variables to use a faster access model.

What makes the situation more complicated is the presence of a new method
of accessing thread-local variables. After the ELF TLS spec was defined,
TLSDESC [2] was proposed and implemented. With that method, General Dynamic
and Local Dynamic models (that were pretty slow in the original spec) are
as fast as much faster Initial Exec model. TLSDESC doesn't have a trade-off
of dlopen'ability and access speed. According to [2], it also reduces the
size of generated DSOs. So it seems like TLSDESC is strictly a better way
of accessing thread-local variables than the old way, and the thread-local
variable's performance problem (that the TLS ELF spec was trying to address
by defining four different access models and relaxations in between)
doesn't seem a real issue anymore.

lld supports all TLS relaxations as defined by the ELF TLS spec. I accepted
the patches to implement all these features without thinking hard enough
about it, but on second thought, that was likely a wrong decision. Being a
new linker, we don't need to trace the history of the evolution of the ELF
spec. Instead, we should have implemented whatever it makes sense now.

So, I'd like to propose we drop TLS relaxations from lld, including Initial
Exec → Local Exec. Dropping IE→LE is strictly speaking a degradation, but I
don't think that is important. We don't have optimizations for much more
frequent variable access patterns such as locally-accessed variables that
have GOT slots (which in theory we can skip GOT access because GOT slot
values are known at link-time), so it is odd that we are only serious about
TLS variables, which are usually much less important. Even if it would turn
out that we want it after implementing more important relaxations, I'd like
to drop it for now and reimplement it in a different way later.

This should greatly simplifies the code because it does not only reduce the
complexity and amount of the existing code, but also reduces the amount of
knowledge you need to have to read the code, without sacrificing
performance of lld-generated files in practice.

Thoughts?

I don't think we can do it.

The main thing we have to keep in mind is that not everyone is using
TLSDESC. In fact, clang doesn't even support -mtls-dialect=gnu2.

If everyone switches to TLSDESC, then I am OK with dropping
optimizations for the old model.

But even with TLSDESC we still need linker relaxations. The TLSDESC idea
solves some of the GD -> IE cost in the case where the .so is not
dlopened, but that is it. Note that AARCH64 that is TLSDESC only has
relaxations.

So I am strongly against removing either non TLSDESC support of support
for the relaxations.

Cheers,
Rafael

Rui Ueyama via llvm-dev <llvm-dev@lists.llvm.org> writes:

> tl;dr: TLSDESC have solved most problems in formerly inefficient TLS
access
> models, so I think we can drop TLS relaxation support from lld.
>
> lld's code to handle relocations is a mess; the code consists of a lot of
> cascading "if"s and needs a lot of prior knowledge to understand what it
is
> doing. Honestly it is head-scratching and needs serious refactoring. I'm
> trying to simplify it to make it manageable again, and I'm now focusing
on
> the TLS relaxations.
>
> Thread-local variables in ELF is complicated. The ELF TLS specification
[1]
> defines 4 different access models: General Dynamic, Local Dynamic,
Initial
> Exec and Local Exec.
>
> I'm not going into the details of the spec here, but the reason why we
have
> so many different models for the same feature is because they were
> different in speed, and we have to use (formerly) slow models when we
know
> less about their run-time memory layout at compile-time or link-time. So,
> there was a trade-off between generality and performance. For example, if
> you want to use thread-local variables in a dlopen(2)'able DSO, you need
to
> choose the slowest model. If a linker knows at link-time that a more
> restricted access model is applicable (e.g. if it is linking a main
> executable, it knows for sure that it is not creating a DSO that will be
> used via dlopen), the linker is allowed to rewrite instructions to load
> thread-local variables to use a faster access model.
>
> What makes the situation more complicated is the presence of a new method
> of accessing thread-local variables. After the ELF TLS spec was defined,
> TLSDESC [2] was proposed and implemented. With that method, General
Dynamic
> and Local Dynamic models (that were pretty slow in the original spec) are
> as fast as much faster Initial Exec model. TLSDESC doesn't have a
trade-off
> of dlopen'ability and access speed. According to [2], it also reduces the
> size of generated DSOs. So it seems like TLSDESC is strictly a better way
> of accessing thread-local variables than the old way, and the
thread-local
> variable's performance problem (that the TLS ELF spec was trying to
address
> by defining four different access models and relaxations in between)
> doesn't seem a real issue anymore.
>
> lld supports all TLS relaxations as defined by the ELF TLS spec. I
accepted
> the patches to implement all these features without thinking hard enough
> about it, but on second thought, that was likely a wrong decision. Being
a
> new linker, we don't need to trace the history of the evolution of the
ELF
> spec. Instead, we should have implemented whatever it makes sense now.
>
> So, I'd like to propose we drop TLS relaxations from lld, including
Initial
> Exec → Local Exec. Dropping IE→LE is strictly speaking a degradation,
but I
> don't think that is important. We don't have optimizations for much more
> frequent variable access patterns such as locally-accessed variables that
> have GOT slots (which in theory we can skip GOT access because GOT slot
> values are known at link-time), so it is odd that we are only serious
about
> TLS variables, which are usually much less important. Even if it would
turn
> out that we want it after implementing more important relaxations, I'd
like
> to drop it for now and reimplement it in a different way later.
>
> This should greatly simplifies the code because it does not only reduce
the
> complexity and amount of the existing code, but also reduces the amount
of
> knowledge you need to have to read the code, without sacrificing
> performance of lld-generated files in practice.
>
> Thoughts?

I don't think we can do it.

The main thing we have to keep in mind is that not everyone is using
TLSDESC. In fact, clang doesn't even support -mtls-dialect=gnu2.

Oh, okay, that is a surprise to me. There's no reason not to support that
and make it default, I wasn't even try that. We definitely should support
that.

If everyone switches to TLSDESC, then I am OK with dropping

optimizations for the old model.

But even with TLSDESC we still need linker relaxations. The TLSDESC idea
solves some of the GD -> IE cost in the case where the .so is not
dlopened, but that is it. Note that AARCH64 that is TLSDESC only has
relaxations.

So I am strongly against removing either non TLSDESC support of support
for the relaxations.

It's still pretty arguable. By default, compilers use General Dynamic model
with -fpic, and Initial Exec without -fpic. lld doesn't do any relaxation
if -shared is given. So, if you are creating a DSO, thread-local variables
in the DSO are accessed using Global Dynamic model. No relaxations are
involved.

If you are creating an executable and if your executable is not
position-independent, you're using Initial Exec model by default which is
as fast as variables accessed through GOT. If you really want to use Local
Exec model, you can pass -ftls-model=local-exec to compilers.

If you are creating a position-independent executable and you want to use
Initial Exec or Local Exec, you can do that by passing
-ftls-model={initial-exec,local-exec} to compilers.

So I don't see a strong reason to do a complicated instruction rewriting in
the linker. I feel more like we should do whatever it is instructed to do
by command line options and input object files. You are for example free to
pass the -fPIC option to create object files and still let the linker to
create a non-PIC executable, even though these combinations doesn't make
much sense and produces slightly inefficient binary. If you don't like it,
you can fix the compiler options. Thread-local variables can be considered
in the same way, no?

Rui Ueyama <ruiu@google.com> writes:

So I am strongly against removing either non TLSDESC support of support
for the relaxations.

It's still pretty arguable. By default, compilers use General Dynamic model
with -fpic, and Initial Exec without -fpic.

It is more complicated than that. You can get all 4 modes with clang

Rui Ueyama <ruiu@google.com> writes:

>> So I am strongly against removing either non TLSDESC support of support
>> for the relaxations.
>>
>
> It's still pretty arguable. By default, compilers use General Dynamic
model
> with -fpic, and Initial Exec without -fpic.

It is more complicated than that. You can get all 4 modes with clang

-------------------------------
__thread int bar = 42;
int *foo(void) { return &bar; }
-------------------------------
without -fPIC: local exec.

-------------------------------
extern __thread int bar;
int *foo(void) { return &bar; }
-------------------------------
without -fPIC: initial exec.
with -fPIC: general dynamic

-------------------------------
__attribute__((visibility("hidden"))) extern __thread int bar;
int *foo(void) { return &bar; }
-------------------------------
with -fPIC: local dynamic.

The other case is

__attribute__((visibility("hidden"))) extern __thread int bar;
int *foo(void) { return &bar; }

without -fPIC which choose Local Exec.

> lld doesn't do any relaxation
> if -shared is given. So, if you are creating a DSO, thread-local
variables
> in the DSO are accessed using Global Dynamic model. No relaxations are
> involved.

There is not a lot of opportunities there. If one patches one access at
a time LD is as expensive as GD. The linker also doesn't know if the .so
will be used with dlopen or not, sot it cannot relax to IE. I guess a
linker could have that command line option for the second part.

Now that I spell that out, it is easy to see the TLSDESC big
advantage. It can optimize the case the static linker cannot.

Because of this fact, DSOs that use thread-local variables such as libc are
already compiled with -ftls-model=initial-exec. So the authors of DSOs in
which the performance thread-local variables matters are already aware of
the issue and how to workaround it.

If you are creating an executable and if your executable is not
> position-independent, you're using Initial Exec model by default which is
> as fast as variables accessed through GOT. If you really want to use
Local
> Exec model, you can pass -ftls-model=local-exec to compilers.

But then all the used variables have to be defined in the same
executable. You can't have even one from a shared library (think errno).

Not really -- you can still use Local Exec per variable basis using the
visibility attribute. I don't think that we can observe noticeable
difference in performance between Initial Exec and Local Exec except an
synthetic benchmark though.

The nice thing about linker relaxations is that they are very user

friendly. The linker is the first point in the toolchaing where some
usefull fact is know, and it can optimize the result with no user
intervention.

I think I agree with this point. Automatic linker code relaxation is
convenient and if it makes a difference, we should implement that. But I'd
doubt if TLS relaxation is actually effective. George implemented them
because there's a spec defining how to relax them, and I accepted the
patches without thinking hard enough, but I didn't see a convincing
benchmark result (or even a non-convincing one) that shows that these
relaxations actually make real-world programs faster. Do you know of
any? It is funny that even the creator of TLSDESC found that their
optimization didn't actually makes NPTL faster as it is mentioned in the
"Conclusion" section in Alexandre Oliva's Home Page
writeups/TLS/RFC-TLSDESC-x86.txt.

So I don't think I'm proposing we simplify code by degrading user's code.
It feels more like we are making too much effort on something that doesn't
produce any measurable difference in real life.

Date: Tue, 7 Nov 2017 18:27:37 -0800
From: Rui Ueyama via llvm-dev <llvm-dev@lists.llvm.org>

tl;dr: TLSDESC have solved most problems in formerly inefficient TLS access
models, so I think we can drop TLS relaxation support from lld.

lld's code to handle relocations is a mess; the code consists of a lot of
cascading "if"s and needs a lot of prior knowledge to understand what it is
doing. Honestly it is head-scratching and needs serious refactoring. I'm
trying to simplify it to make it manageable again, and I'm now focusing on
the TLS relaxations.

Thread-local variables in ELF is complicated. The ELF TLS specification [1]
defines 4 different access models: General Dynamic, Local Dynamic, Initial
Exec and Local Exec.

I'm not going into the details of the spec here, but the reason why we have
so many different models for the same feature is because they were
different in speed, and we have to use (formerly) slow models when we know
less about their run-time memory layout at compile-time or link-time. So,
there was a trade-off between generality and performance. For example, if
you want to use thread-local variables in a dlopen(2)'able DSO, you need to
choose the slowest model. If a linker knows at link-time that a more
restricted access model is applicable (e.g. if it is linking a main
executable, it knows for sure that it is not creating a DSO that will be
used via dlopen), the linker is allowed to rewrite instructions to load
thread-local variables to use a faster access model.

What makes the situation more complicated is the presence of a new method
of accessing thread-local variables. After the ELF TLS spec was defined,
TLSDESC [2] was proposed and implemented. With that method, General Dynamic
and Local Dynamic models (that were pretty slow in the original spec) are
as fast as much faster Initial Exec model. TLSDESC doesn't have a trade-off
of dlopen'ability and access speed. According to [2], it also reduces the
size of generated DSOs. So it seems like TLSDESC is strictly a better way
of accessing thread-local variables than the old way, and the thread-local
variable's performance problem (that the TLS ELF spec was trying to address
by defining four different access models and relaxations in between)
doesn't seem a real issue anymore.

lld supports all TLS relaxations as defined by the ELF TLS spec. I accepted
the patches to implement all these features without thinking hard enough
about it, but on second thought, that was likely a wrong decision. Being a
new linker, we don't need to trace the history of the evolution of the ELF
spec. Instead, we should have implemented whatever it makes sense now.

So, I'd like to propose we drop TLS relaxations from lld, including Initial
Exec → Local Exec. Dropping IE→LE is strictly speaking a degradation, but I
don't think that is important. We don't have optimizations for much more
frequent variable access patterns such as locally-accessed variables that
have GOT slots (which in theory we can skip GOT access because GOT slot
values are known at link-time), so it is odd that we are only serious about
TLS variables, which are usually much less important. Even if it would turn
out that we want it after implementing more important relaxations, I'd like
to drop it for now and reimplement it in a different way later.

This should greatly simplifies the code because it does not only reduce the
complexity and amount of the existing code, but also reduces the amount of
knowledge you need to have to read the code, without sacrificing
performance of lld-generated files in practice.

Thoughts?

Not sure what the impact of this would be. Does this mean that some
TLS relocations will no longer be supported? Or is it that they just
won't be optimized. How about static binaries? Don't they rely on
the local exec model?

Doe this affect linking code generated by older compilers (say GCC
4.2.1) in any way?

I've skipped over the description and I have some difficulty sharing
this conlusion. I don't see how it makes any significant difference. I
also don't know if any system beyond glibc implements it.

Side note: position independent executables that are properly compiled
behave like non-position independent executables.

Side note 2: I strongly question the assertions about frequency of
dlopen vs direct linking from the TLSDESC paper. Quite a few hacks on
the dynamic linker side are a direct result of people wanting to dlopen
libGL from scripting languages like Python.

Joerg

I'm assuming it means the instruction sequences wouldn't be optimized,
I don't think it would be practical to remove support for the
relocations.

For Arm and, I think Mips is similar, there isn't any TLS relaxation
of instructions as the TLS relocations act on data and not
instructions. There are some cases where dynamic relocations can be
omitted, for example the module-id of an executable is defined to be 1
so there is no need for the dynamic linker to fill this in. For static
linking the linker knows module and the offsets of all the TLS Symbols
so it can resolve all the dynamic relocations. I don't know off the
top of my head whether this would apply to other architectures,
although I think the general principles should hold the same. The last
time I looked relaxation was the technique used to support static
linking on non-Arm and Mips Targets.

I have a vague memory of the OpenGL folk being sensitive to TLS
performance, particularly as the library is often shared. I think that
TLS relaxation isn't going to show up in many traditional benchmarking
suites as much of the performance critical code is going to be in the
application, and are unlikely to have much TLS in them. I'm thinking
that it would need something like a real-world application that makes
heavy use of shared-libraries with TLS (games, web-browsers or perhaps
HPC?).

Given that getting convincing data either way about the impact of TLS
relaxation could be difficult we should err towards keeping it.

Peter

Hi Rui,

I don’t think our team can support dropping TLS relaxations. As Peter suspected, we have noticed significant run-time performance gains in games when using these relaxations over not using them, with our proprietary linker. Also, for our platform dynamic libraries are usually used via our equivalent of dlopen(), not pre-loaded as is the idea behind TLSDESC, from my understanding, so even if TLSDESC were supported on our platform, there would be no significant benefit gained for us. Finally, we cannot ask our customers to change their compilation settings to use better TLS models, because programs may use static libraries which are intended for use in both dynamic libraries and main executables, and so will lose out if they do not have access to TLS relaxations.

Regards,

James

Rui Ueyama <ruiu@google.com> writes:

If you are creating an executable and if your executable is not
> position-independent, you're using Initial Exec model by default which is
> as fast as variables accessed through GOT. If you really want to use
Local
> Exec model, you can pass -ftls-model=local-exec to compilers.

But then all the used variables have to be defined in the same
executable. You can't have even one from a shared library (think errno).

Not really -- you can still use Local Exec per variable basis using the
visibility attribute. I don't think that we can observe noticeable
difference in performance between Initial Exec and Local Exec except an
synthetic benchmark though.

There nothing that the linker can do that the compiler could not have
done in the first place. The point is that if to switch to lld and keep
performance users should not have to annotate all tls variables with
tls-model.

The nice thing about linker relaxations is that they are very user

friendly. The linker is the first point in the toolchaing where some
usefull fact is know, and it can optimize the result with no user
intervention.

I think I agree with this point. Automatic linker code relaxation is
convenient and if it makes a difference, we should implement that. But I'd
doubt if TLS relaxation is actually effective. George implemented them
because there's a spec defining how to relax them, and I accepted the
patches without thinking hard enough, but I didn't see a convincing
benchmark result (or even a non-convincing one) that shows that these
relaxations actually make real-world programs faster. Do you know of
any? It is funny that even the creator of TLSDESC found that their
optimization didn't actually makes NPTL faster as it is mentioned in the
"Conclusion" section in http://www.fsfla.org/~lxoliva/
writeups/TLS/RFC-TLSDESC-x86.txt.

So I don't think I'm proposing we simplify code by degrading user's code.
It feels more like we are making too much effort on something that doesn't
produce any measurable difference in real life.

*PLEASE* let us keep it. It is bad enough that we are regressing
performance in the name of having code that you find nicer. It would be
really annoying to see us drop a working feature just to reduce our
code a bit.

The code is working, please let it be!

At the very least we should keep it until we are in a position to
actually measure it. As is this is just guesswork. We would need a
*much* bigger adoption before we could measure this.

Cheers,
Rafael

Joerg Sonnenberger via llvm-dev <llvm-dev@lists.llvm.org> writes:

Rui Ueyama <ruiu@google.com> writes:

>> If you are creating an executable and if your executable is not
>> > position-independent, you're using Initial Exec model by default
which is
>> > as fast as variables accessed through GOT. If you really want to use
>> Local
>> > Exec model, you can pass -ftls-model=local-exec to compilers.
>>
>> But then all the used variables have to be defined in the same
>> executable. You can't have even one from a shared library (think errno).
>>
>
> Not really -- you can still use Local Exec per variable basis using the
> visibility attribute. I don't think that we can observe noticeable
> difference in performance between Initial Exec and Local Exec except an
> synthetic benchmark though.

There nothing that the linker can do that the compiler could not have
done in the first place. The point is that if to switch to lld and keep
performance users should not have to annotate all tls variables with
tls-model.

> The nice thing about linker relaxations is that they are very user
>> friendly. The linker is the first point in the toolchaing where some
>> usefull fact is know, and it can optimize the result with no user
>> intervention.
>
>
> I think I agree with this point. Automatic linker code relaxation is
> convenient and if it makes a difference, we should implement that. But
I'd
> doubt if TLS relaxation is actually effective. George implemented them
> because there's a spec defining how to relax them, and I accepted the
> patches without thinking hard enough, but I didn't see a convincing
> benchmark result (or even a non-convincing one) that shows that these
> relaxations actually make real-world programs faster. Do you know of
> any? It is funny that even the creator of TLSDESC found that their
> optimization didn't actually makes NPTL faster as it is mentioned in the
> "Conclusion" section in Alexandre Oliva's Home Page
> writeups/TLS/RFC-TLSDESC-x86.txt.
>
> So I don't think I'm proposing we simplify code by degrading user's code.
> It feels more like we are making too much effort on something that
doesn't
> produce any measurable difference in real life.

*PLEASE* let us keep it. It is bad enough that we are regressing
performance in the name of having code that you find nicer. It would be
really annoying to see us drop a working feature just to reduce our
code a bit.

Please take it easy. :slight_smile: I'm not saying that I'm going to remove it.
Instead, I'm bringing a (possibly crazy) idea to the table to discuss, and
that is IMO a good thing. Part of the reason why lld is successful is
because of its relatively radical design choice such as Windows-ish library
semantics, which might have been somewhat crazy idea. So, I think "stop,
think and re-evaluate what has traditionally been done" is what we are good
at, regardless of the conclusion of the assessment. And as you know we
(including you) have been making reasonable decisions on technical design
choices.

The code is working, please let it be!

So, looks like there are programs in where TLS relaxation actually matters. It is interesting that both examples mentioned in this threads are graphics-related (OpenGL and games). I wonder if it is a coincidence or it is a common practice to use thread-local variables heavily in graphics. I haven’t wrote any games before, so it is likely that I don’t know some basics in that area.

In the OpenGL case it is primary an effect of retrofitting thread-safety
into existing APIs. Just like some systems retrofit many of the
non-reentrant libc functions by using thread-local storage for the
buffers.

Joerg

I saw a very similar looking bug the other night, linker was trying to make the TLS reloc 32 bit, things were setup for regular shared , and we got ugly segfaults. The best result I got in that case was -fPIE and -model=medium (intel assembler undocumented, but it’s not doing much complicated. The 32 bit TLS reloc changed to an X86_64_64 reloc… and yes, it did appear to change it to a 64 bit reloc, probably designed for situations like this.