How to disable clang expression elimination for thread_local variable

OSX: Darwin Kernel Version 22.1.0 (Apple M1 chip)
Clang:Apple clang version 14.0.0 (clang-1400.0.29.202)

PoC C++ Code:

thread_local int* tls = nullptr;
// using libcontext to jump stack.
void jump_stack();
void* test() {
    // before jump_stack, assume we are at thread 1.
    int *cur_tls = tls;
    jump_stack();
    // after jump stack, we are at another thread 2.
    // we need to reload tls again.
    cur_tls = tls;
}

compile command :

clang++ -c test.cpp --std=c++11 -g -O0

generated code:

; void* test() {
       0: ff c3 00 d1   sub     sp, sp, #48
       4: fd 7b 02 a9   stp     x29, x30, [sp, #32]
       8: fd 83 00 91   add     x29, sp, #32
       c: 00 00 00 90   adrp    x0, 0x0 <ltmp0+0xc>
      10: 00 00 40 f9   ldr     x0, [x0]
      14: 08 00 40 f9   ldr     x8, [x0]
      18: 00 01 3f d6   blr     x8
      1c: e0 07 00 f9   str     x0, [sp, #8]
;       int *cur_tls = tls;
      20: 08 00 40 f9   ldr     x8, [x0]
      24: e8 0b 00 f9   str     x8, [sp, #16]
;       jump_stack();
      28: 00 00 00 94   bl      0x28 <ltmp0+0x28>
      2c: e0 07 40 f9   ldr     x0, [sp, #8]
;       cur_tls = tls;
      30: 08 00 40 f9   ldr     x8, [x0]
      34: e8 0b 00 f9   str     x8, [sp, #16]
; }
      38: a0 83 5f f8   ldur    x0, [x29, #-8]
      3c: fd 7b 42 a9   ldp     x29, x30, [sp, #32]
      40: ff c3 00 91   add     sp, sp, #48
      44: c0 03 5f d6   ret

before jump_stack, the tls has cached into [sp, #16], after jump_stack then reload [sp, #16] into cur_tls which the tls belong to the thread 1 not the thread 2.

Is there are any clang options to disable this optimization to reload thread_local variable always belong to current thread.

On Linux( 5.15.0-66-generic, x86_64) , using clang Ubuntu clang version 14.0.6 , generated code:

0000000000000000 <_Z4testv>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	48 83 ec 10          	sub    $0x10,%rsp
   8:	64 48 8b 04 25 00 00 	mov    %fs:0x0,%rax
   f:	00 00
  11:	48 89 45 f8          	mov    %rax,-0x8(%rbp)
  15:	e8 00 00 00 00       	callq  1a <_Z4testv+0x1a>
  1a:	64 48 8b 04 25 00 00 	mov    %fs:0x0,%rax
  21:	00 00
  23:	48 89 45 f8          	mov    %rax,-0x8(%rbp)
  27:	31 c0                	xor    %eax,%eax
  29:	48 83 c4 10          	add    $0x10,%rsp
  2d:	5d                   	pop    %rbp
  2e:	c3                   	retq

after jump_stack, tls reload from fs:0x0, it works well.

See Address thread identification problems with coroutine

1 Like