How to learn tls variable access, using register-optimized decisions

problem

The problem I currently encounter is that in the process of using distributed thinlto, the compiler’s access to tls variables is different from non-distributed thinlto and no-lto, resulting in some abnormal behaviors. I speculate that this has nothing to do with lto, but certain optimizations are triggered when distributed thinlto is turned on. I want to learn what optimization triggers such a scenario, where should I start? From the log output by -mllvm -debug, I could not find any useful logs, so I don’t even know the triggered module at present.

Example

// file1.cc
void bar(int **);
void foo();
__thread int *wmh_tls_x = nullptr;

int main() {
     // wmh_tls_x = new int(1);
     bar(&wmh_tls_x);
     foo();
     int *wmh_tls_a = wmh_tls_x;
     bar(&wmh_tls_a);
     return 0;
}

as

0000000000201820 <main>:
; main():
; file1.cc:5
   201820: 53 pushq %rbx
   201821: 48 83 ec 10 subq $16, %rsp
; file1.cc:7
   201825: 66 66 66 64 48 8b 04 25 00 00 00 00 movq %fs:0, %rax
   201831: 48 89 c3 movq %rax, %rbx
   201834: 48 8d b8 f8 ff ff ff leaq -8(%rax), %rdi
   20183b: e8 30 00 00 00 callq 0x201870 <bar(int**)>
; file1.cc:8
   201840: e8 6b 00 00 00 callq 0x2018b0 <foo()>
; file1.cc:9
   201845: 48 8b 83 f8 ff ff ff movq -8(%rbx), %rax
   20184c: 48 89 44 24 08 movq %rax, 8(%rsp)
   201851: 48 8d 7c 24 08 leaq 8(%rsp), %rdi

file1.cc:7 accesses the tls variable through %fs:0, while file1.cc:9 accesses it through the result previously cached on the rbx register. Using non-distributed thinlto file1.cc:9 variables are still accessed through %fs:0.

For the simple demo above

  1. No-lto mode, file1:9 accesses the tls variable through the address cached in the register. In the actual project, the tls variable is accessed through fs:0, but I cannot construct such a demo.
  2. Non-distributed thinlto mode, file1:9 accesses tls variables through fs:0
  3. Distributed thinlto mode, file1:9 accesses tls variables through fs:0

Reproduction script

main_str='void bar(int **);
void foo();
__thread int *wmh_tls_x = nullptr;

int main() {
     // wmh_tls_x = new int(1);
     bar(&wmh_tls_x);
     foo();
     int *wmh_tls_a = wmh_tls_x;
     bar(&wmh_tls_a);
     return 0;
}


'

lib_str='#include <stdio.h>

void bar(int **x_) {
     if ((*x_) == NULL)
         *x_ = new int(1);
     int *x = *x_;
     printf("bar %d\n", *x);
     ++(*x_);
}

extern __thread int *wmh_tls_x;

void foo() {
     printf("foo %d\n", *wmh_tls_x);
     ++(*wmh_tls_x);
}
'

echo "$main_str" > file1.cc
echo "$lib_str" > file2.cc
export LLVM_LTO_PASS_REMARKS=all
CLANG=clang++

set -x
cxxflags="-O1 -g -fPIC"
# file1.cc enable thinlto, file2.cc is no-lto
${CLANG} -c file1.cc -flto=thin ${cxxflags}
${CLANG} -c file2.cc ${cxxflags}


# base bin
${CLANG} -c file1.cc ${cxxflags} -o file1.nolto.o
${CLANG} -fuse-ld=lld file1.nolto.o file2.o -o no_lto
llvm-objdump -ldC file1.nolto.o > file1.nolto.s
llvm-objdump -ldC no_lto > no_lto.s


# no_distributed_lto_bin
${CLANG} -flto=thin -fuse-ld=lld file1.o file2.o -o no_distributed_lto_bin
llvm-objdump -ldC no_distributed_lto_bin > no_distributed_lto_bin.s



# distributed_lto_bin
${CLANG} -flto=thin -fuse-ld=lld -Wl,--thinlto-emit-imports-files,--thinlto-index-only file1.o file2.o
${CLANG} ${cxxflags} -o file1.native.o -x ir file1.o -c -fthinlto-index=./file1.o.thinlto.bc
llvm-objdump -ldC file1.native.o > file1.lto.s
${CLANG} -fuse-ld=lld file1.native.o file2.o -o distributed_lto_bin
/home/distcc/clang-11.1.0/bin/llvm-objdump -ldC distributed_lto_bin > distributed_lto_bin.s

version

os: CentOS Linux release 7.8.2003 (Core)
cpu: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
clang: clang version 11.1.0

You can get the pass manager to print IR before or after each pass by passing -mllvm -print-after-all. If you look into that flag’s implementation you’ll find additional mechanisms to isolate to a specific pass or to printing only when IR is modified.

(BTW, in your problem explanation, the list of cases under For the simple demo above makes it look like all the cases are equivalent: the TLS is accessed through fs:0. Am I missing something?)

Thank you for your answer, I will try -mllvm -print-after-all.

(BTW, in your problem explanation, the list of cases under For the simple demo above makes it look like all the cases are equivalent: the TLS is accessed through fs:0 . Am I missing something?)

On my machine, when running the script, file1.cc:9 in the main function of distributed_lto_bin.s no longer uses fs:0 to access the tls variable. Note that this is the second access, the first access is in file1.cc:7, this time through the tls variable. I added machine and compiler version information.

You can get the pass manager to print IR before or after each pass by passing -mllvm -print-after-all . If you look into that flag’s implementation you’ll find additional mechanisms to isolate to a specific pass or to printing only when IR is modified.

It seems to have nothing to do with the optimization pass. When O0 is used, the ir output for the first time is the same as the last time, and this result can still be reproduced.

To be sure, where did you pass -mllvm -print-after-all? For the non-distributed, you need to pass it as a linker option like this: -Wl,-mllvm,-print-after-all - using your example:

${CLANG} -flto=thin -fuse-ld=lld file1.o file2.o -o no_distributed_lto_bin -Wl,-mllvm,-print-after-all 2>&1 | tee no_dist.log

For the distributed, you need to pass it to the backend clang invocation for file1, like so:

${CLANG} ${cxxflags} -o file1.native.o -x ir file1.o -c -fthinlto-index=./file1.o.thinlto.bc -mllvm -print-after-all 2>&1 | tee dist.log

For the non-distributed, you need to pass it as a linker option like this: -Wl,-mllvm,-print-after-all

Earlier I only compared the ir changes of distributed thinto.
I found that inconsistencies started from the stage of IR Dump After X86 DAG->DAG Instruction Selection. I tried to learn more.

fwiw - if all this may help with your study, the difference isn’t there in clang-16, but is in clang-13. Didn’t do more of a bisection though. I remember vaguely @teresajohnson mentioning something about a local vs distributed thinlto fix, maybe she remembers what that was.

I guess that the higher version of clang uses the -pie option by default when linking. During the verification process, I found that both the pic and pie options affect the selection of the tls variable access mode. In my demo, I use clang-11 -pie to link, the behavior of distributed thinlto and non-distributed thinl is consistent. When accessing the tls variable for the second time, fs:0 is no longer used.

I found that there is a pass Finalize-isel, which will affect the selection of demo tls variable access. Testing and debugging found that in the non-distributed Thinlto -no-pie mode, the finalize-isel pass did not optimize the tls variable due to the different results produced by the previous pass x86-isel.

# no-pie log
# *** IR Dump After X86 DAG->DAG Instruction Selection ***:
...
TLS_base_addr64 $noreg, 1, $noreg, target-flags(x86-tlsld) ...
...

#pie log
# *** IR Dump After X86 DAG->DAG Instruction Selection ***:
...
%7:gr64 = MOV64rm $noreg, 1, $noreg, target-flags(x86-tpoff) @wmh_tls_x, $fs, ...
...

I guess this is most likely the reason why distributed thinlto and non-distributed thinlto behave inconsistently. Is it because non-distributed thinlto uses no-pie when linking, while distributed thinlto retains -fpic in the compilation stage during the optimization stage. These two options lead to different processing logic during thinlto optimization?
Debugging found that in LowerGlobalTLSAddress, the TLSModel of non-distributed thinlto is InitialExec, and the TLSModel of distributed thinlto is LocalDynamic.

I remember vaguely @teresajohnson mentioning something about a local vs distributed thinlto fix, maybe she remembers what that was.

I didn’t search for relevant submissions. I wonder if you still remember the keywords?

I located the reason for the difference in access to tls variables between distributed thinlto and non-distributed thinlto (the compiler version I used is 11.1.0), because the RelocModel of distributed thinlto and non-distributed thinlto are different .
RelocModel initialization for non-distributed thinlto is the following steps, because clang-11 defaults to -no-pie, so config->isPic = false, which ultimately leads to RelocModel = Reloc::Static. As for distributed thinlto, since I retained the option -fPIC in the compilation phase during the thinlto backend optimization phase, so RelocModel = PIC_.
In lld/ELF/Driver.cpp:setConfigs

...
config->isPic = config->pie || config->shared;
...
//lld/ELF/LTO.cpp
static lto::Config createConfig() {
...
   if (auto relocModel = getRelocModelFromCMModel())
     c.RelocModel = *relocModel;
   else if (config->relocatable)
     c.RelocModel = None;
   else if (config->isPic)
     c.RelocModel = Reloc::PIC_;
   else
     c.RelocModel = Reloc::Static;
...

And TLSModel is related to RelocModel

TLSModel::Model TargetMachine::getTLSModel(const GlobalValue *GV) const {
   bool IsPIE = GV->getParent()->getPIELevel() != PIELevel::Default;
   Reloc::Model RM = getRelocationModel();
   bool IsSharedLibrary = RM == Reloc::PIC_ && !IsPIE;
   bool IsLocal = shouldAssumeDSOLocal(*GV->getParent(), GV);

   TLSModel::Model Model;
   if (IsSharedLibrary) {
     if(IsLocal)
       Model = TLSModel::LocalDynamic;
     else
       Model = TLSModel::GeneralDynamic;
   } else {
     if(IsLocal)
       Model = TLSModel::LocalExec;
     else
       Model = TLSModel::InitialExec;
   }
...
}

When running x86-isel, in the function X86TargetLowering::LowerGlobalTLSAddress, the compiler’s choice of instructions is related to TLSModel. I briefly learned about the access model of tls variables. In InitialExec mode, the compiler can access directly through fs:*. In LocalDynamic mode, additional operations such as calculating offsets are required. I guess because LocalDynamic mode performs worse, subsequent compilers will choose to cache the base address of the tls variable in a register.

| zcfh
May 9 |

  • | - |

I found a pass finalize-isel, which will affect the choice of demo. Testing and debugging found that non-distributed thinlto will not run this pass in -no-pie mode. I think this is likely to cause distributed thinlto and Reason for inconsistent behavior of non-distributed thinlto.
One guess I have is that because access to tls variables in Local Dynamic TLS Mode is more complex and requires calculation of offsets and other operations, the base address of the tls variable is cached in the register. Because non-distributed thinlto uses the pie option, access to tls variables is through other modes, so there is no need to run this pass, and no need to cache in registers.

Is this still an issue in clang 16? I haven’t read through the whole thread in detail but it sounds like there was some change to option defaults that addresses this there?

I remember vaguely @teresajohnson mentioning something about a local vs distributed thinlto fix, maybe she remembers what that was.

I didn’t search for relevant submissions. I wonder if you still remember the keywords?

Sorry, this doesn’t ring a bell. @Mircea Trofin was this something related to TLS?

Is this still an issue in clang 16? I haven’t read through the whole thread in detail but it sounds like there was some change to option defaults that addresses this there?

This situation is no longer an issue in clang 16.
Also, I feel there is a lack of official documentation on distributed thinlto. I’m currently integrating distributed thinlto into the build system I’m using, and I’m a bit worried about whether my usage is correct and whether I can guarantee consistent optimized behavior of distributed thinlto as local thinlto.