question about xray tls data initialization

I'm learning the xray library and try if it can be built on windows, in
xray_fdr_logging_impl.h

line 152 , comment written as
// Using pthread_once(...) to initialize the thread-local data structures

but at line 175, 183, code written as

thread_local pthread_key_t key;

  // Ensure that we only actually ever do the pthread initialization once.
  thread_local bool UNUSED Unused = [] {
    new (&TLSBuffer) ThreadLocalData();
    auto result = pthread_key_create(&key, +[](void *) {
      auto &TLD = *reinterpret_cast<ThreadLocalData *>(&TLSBuffer);

I'm confused that pthread_key_t and Unused are both thread_local
variable, doesn't it mean the following lambda will run for each
thread , and create one pthread_key_t for only one tls data(instead of
only one pthread_key_t for all thread) ? also what does the '+' before
lambda expression mean ? this may be stupid questions, could somebody
kindly helped ?

Yeah, that comment is out-of-date (and the implementation is buggy) – which is a shame really. :confused:

But, the good news, is I think we’ve fixed this now in the top-of-trunk with https://reviews.llvm.org/D39526 and https://reviews.llvm.org/D40164.

Curiously though, how far did your exploration into getting XRay to build on Windows go?

Cheers

– Dean

with some dirty hack , I've made xray runtime 'built' on windows ,
but unfortunately I haven't enough knowledge about linker and the
runtime, and finally built executable didn't run. I'd like to share
my changes here , hopes somebody help me to make it run on windows.
in AsmPrinter, copy/paster xray for coff target

InstMap = OutContext.getCOFFSection("xray_instr_map", 0,
SectionKind::getReadOnlyWithRel());
FnSledIndex = OutContext.getCOFFSection("xray_fn_idx",
0,SectionKind::getReadOnlyWithRel());

in XRayArgs , allow windows platform to use xray args. with this,
generated code seems have sled and xray parts.

in xray runtime,
bool atomic_compare_exchange_strong(volatile atomic_sint32_t *a,
                                           s32 *cmp,
                                           s32 xchg,
                                           memory_order mo)
is missed for MSVC , I take atomic_uint32_t implementation

msvc 14.1 treats BufferQueue::Buffer::Buffer as constructor instead of
data member, Buf.Buffer=>Buf.Data

FunctionRecord pack , __attribute__((packed)) => #pragma
pack(push,1), msvc also requires bitfields to be same type to pack
them together( all types => uint32_t)

FD int => HANDLE, most code logic still valid (-1 as invalid value),
r/w API replaced with windows

mprotect => VirtualProtect

readTSC in xray_x86_64.inc also works for windows

replace read tsc from proc with QueryPerformanceFrequency

msvc can not compile such code
void setupNewBuffer(int (*wall_clock_reader)(clockid_t,
                                                    struct timespec *));

must use typedef first . xray use clock_gettime as default
implementation , which is not friendly for windows .create a fake one
based on chrono system_clock(ignore clockid_t)

for tls destructor part, I've just commented them out.(but
https://www.codeproject.com/Articles/8113/Thread-Local-Storage-The-C-Way
gives a thread exit callback way for coff)

and last thing , which I don't understand is the weak symbol for
__start_xray_instr_map[]
__stop_xray_instr_map[]
__start_xray_fn_idx[]
__stop_xray_fn_idx[]

I replace them with __declspec(selectany) , but I'm not sure they
have same meanings.

some random generated code:
    .text
    .intel_syntax noprefix
    .def call;
    .scl 2;
    .type 32;
    .endef
    .globl call # -- Begin function call
    .p2align 4, 0x90
call: # @call
.seh_proc call
# BB#0: # %entry
    .p2align 1, 0x90
.Lxray_sled_0:
    .ascii "\353\t"
    nop word ptr [rax + rax + 512]
    sub rsp, 16
    .seh_stackalloc 16
    .seh_endprologue
    mov dword ptr [rsp + 12], ecx
    mov dword ptr [rsp + 8], 0
    mov dword ptr [rsp + 4], 0
.LBB0_1: # %for.cond
                                        # =>This Inner Loop Header: Depth=1
    mov eax, dword ptr [rsp + 4]
    cmp eax, dword ptr [rsp + 12]
    jge .LBB0_4
# BB#2: # %for.body
                                        # in Loop: Header=BB0_1 Depth=1
    mov eax, dword ptr [rsp + 4]
    add eax, dword ptr [rsp + 8]
    mov dword ptr [rsp + 8], eax
# BB#3: # %for.inc
                                        # in Loop: Header=BB0_1 Depth=1
    mov eax, dword ptr [rsp + 4]
    add eax, 1
    mov dword ptr [rsp + 4], eax
    jmp .LBB0_1
.LBB0_4: # %for.end
    mov eax, dword ptr [rsp + 8]
    add rsp, 16
    .p2align 1, 0x90
.Lxray_sled_1:
    ret
    nop word ptr cs:[rax + rax + 512]
    .seh_handlerdata
    .text
    .seh_endproc
                                        # -- End function
    .section xray_instr_map,"y"
.Lxray_sleds_start0:
    .quad .Lxray_sled_0
    .quad call
    .byte 0x00
    .byte 0x00
    .byte 0x00
    .zero 13
    .quad .Lxray_sled_1
    .quad call
    .byte 0x01
    .byte 0x00
    .byte 0x00
    .zero 13
.Lxray_sleds_end0:
    .section xray_fn_idx,"y"
    .p2align 4, 0x90
    .quad .Lxray_sleds_start0
    .quad .Lxray_sleds_end0
    .text

and parts of obj dump:

SECTION HEADER #5
     /16 name (xray_instr_map)
       0 physical address
       0 virtual address
      40 size of raw data
     198 file pointer to raw data (00000198 to 000001D7)
     1D8 file pointer to relocation table
       0 file pointer to line numbers
       4 number of relocations
       0 number of line numbers
  100000 flags
         1 byte align

RAW DATA #5
  00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
  00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
  00000020: 56 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 V...............
  00000030: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

RELOCATIONS #5
                                                Symbol Symbol
Offset Type Applied To Index Name
-------- ---------------- ----------------- -------- ------
00000000 ADDR64 00000000 00000000 0 .text
00000008 ADDR64 00000000 00000000 E call
00000020 ADDR64 00000000 00000056 0 .text
00000028 ADDR64 00000000 00000000 E call

SECTION HEADER #6
      /4 name (xray_fn_idx)
       0 physical address
       0 virtual address
      10 size of raw data
     200 file pointer to raw data (00000200 to 0000020F)
     210 file pointer to relocation table
       0 file pointer to line numbers
       2 number of relocations
       0 number of line numbers
  500000 flags
         16 byte align

RAW DATA #6
  00000000: 00 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 ........@.......

RELOCATIONS #6
                                                Symbol Symbol
Offset Type Applied To Index Name
-------- ---------------- ----------------- -------- ------
00000000 ADDR64 00000000 00000000 8 xray_instr_map
00000008 ADDR64 00000000 00000040 8 xray_instr_map

with some dirty hack , I’ve made xray runtime ‘built’ on windows ,

\o/

but unfortunately I haven’t enough knowledge about linker and the
runtime, and finally built executable didn’t run. I’d like to share
my changes here , hopes somebody help me to make it run on windows.

Thanks for working on this!

If you’re alright with it, maybe you can send some patches to review, preferably through the LLVM Phabricator instance? You can have me or Reid (who knows more about COFF and the Windows stuff) as reviewers.

in AsmPrinter, copy/paster xray for coff target

InstMap = OutContext.getCOFFSection(“xray_instr_map”, 0,
SectionKind::getReadOnlyWithRel());
FnSledIndex = OutContext.getCOFFSection(“xray_fn_idx”,
0,SectionKind::getReadOnlyWithRel());

in XRayArgs , allow windows platform to use xray args. with this,
generated code seems have sled and xray parts.

Nice, I suspect we can make this change with tests as well, which we can build on incrementally.

in xray runtime,
bool atomic_compare_exchange_strong(volatile atomic_sint32_t *a,
s32 *cmp,
s32 xchg,
memory_order mo)
is missed for MSVC , I take atomic_uint32_t implementation

This is in compiler-rt/lib/sanitizer_common/… right?

msvc 14.1 treats BufferQueue::Buffer::Buffer as constructor instead of
data member, Buf.Buffer=>Buf.Data

Interesting. That’s an easy patch to merge. :slight_smile:

FunctionRecord pack , attribute((packed)) => #pragma
pack(push,1), msvc also requires bitfields to be same type to pack
them together( all types => uint32_t)

Are you able to test this on other platforms?

FD int => HANDLE, most code logic still valid (-1 as invalid value),
r/w API replaced with windows

mprotect => VirtualProtect

readTSC in xray_x86_64.inc also works for windows

replace read tsc from proc with QueryPerformanceFrequency

msvc can not compile such code
void setupNewBuffer(int (*wall_clock_reader)(clockid_t,
struct timespec *));

must use typedef first . xray use clock_gettime as default
implementation , which is not friendly for windows .create a fake one
based on chrono system_clock(ignore clockid_t)

This one is definitely something to do, even for potentially supporting XRay on Darwin where older versions of the SDK (10.11 and lower) don’t define clock_gettime. Probably can be split off as a thing that can be reviewed and merged regardless.

for tls destructor part, I’ve just commented them out.(but
https://www.codeproject.com/Articles/8113/Thread-Local-Storage-The-C-Way
gives a thread exit callback way for coff)

Interesting, thanks! This one is something that could be abstracted away on a per-platform basis.

and last thing , which I don’t understand is the weak symbol for
__start_xray_instr_map[]
__stop_xray_instr_map[]
__start_xray_fn_idx[]
__stop_xray_fn_idx[]

I replace them with __declspec(selectany) , but I’m not sure they
have same meanings.

The __{start, stop}xray{instr_map,fn_idx}[] arrays are usually generated by the linker on ELF and ELF-like platforms. I’m not aware what the MSVC COFF linkers do, probably something others who know better can answer.

some random generated code:
.text
.intel_syntax noprefix
.def call;
.scl 2;
.type 32;
.endef
.globl call # – Begin function call
.p2align 4, 0x90
call: # @call
.seh_proc call

BB#0: # %entry

.p2align 1, 0x90
.Lxray_sled_0:
.ascii “\353\t”
nop word ptr [rax + rax + 512]
sub rsp, 16
.seh_stackalloc 16
.seh_endprologue
mov dword ptr [rsp + 12], ecx
mov dword ptr [rsp + 8], 0
mov dword ptr [rsp + 4], 0
.LBB0_1: # %for.cond

=>This Inner Loop Header: Depth=1

mov eax, dword ptr [rsp + 4]
cmp eax, dword ptr [rsp + 12]
jge .LBB0_4

BB#2: # %for.body

in Loop: Header=BB0_1 Depth=1

mov eax, dword ptr [rsp + 4]
add eax, dword ptr [rsp + 8]
mov dword ptr [rsp + 8], eax

BB#3: # %for.inc

in Loop: Header=BB0_1 Depth=1

mov eax, dword ptr [rsp + 4]
add eax, 1
mov dword ptr [rsp + 4], eax
jmp .LBB0_1
.LBB0_4: # %for.end
mov eax, dword ptr [rsp + 8]
add rsp, 16
.p2align 1, 0x90
.Lxray_sled_1:
ret
nop word ptr cs:[rax + rax + 512]
.seh_handlerdata
.text
.seh_endproc

– End function

.section xray_instr_map,“y”
.Lxray_sleds_start0:
.quad .Lxray_sled_0
.quad call
.byte 0x00
.byte 0x00
.byte 0x00
.zero 13
.quad .Lxray_sled_1
.quad call
.byte 0x01
.byte 0x00
.byte 0x00
.zero 13
.Lxray_sleds_end0:
.section xray_fn_idx,“y”
.p2align 4, 0x90
.quad .Lxray_sleds_start0
.quad .Lxray_sleds_end0
.text

and parts of obj dump:

SECTION HEADER #5
/16 name (xray_instr_map)
0 physical address
0 virtual address
40 size of raw data
198 file pointer to raw data (00000198 to 000001D7)
1D8 file pointer to relocation table
0 file pointer to line numbers
4 number of relocations
0 number of line numbers
100000 flags
1 byte align

RAW DATA #5
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …
00000020: 56 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 V…
00000030: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …

RELOCATIONS #5
Symbol Symbol
Offset Type Applied To Index Name


00000000 ADDR64 00000000 00000000 0 .text
00000008 ADDR64 00000000 00000000 E call
00000020 ADDR64 00000000 00000056 0 .text
00000028 ADDR64 00000000 00000000 E call

SECTION HEADER #6
/4 name (xray_fn_idx)
0 physical address
0 virtual address
10 size of raw data
200 file pointer to raw data (00000200 to 0000020F)
210 file pointer to relocation table
0 file pointer to line numbers
2 number of relocations
0 number of line numbers
500000 flags
16 byte align

RAW DATA #6
00000000: 00 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 …@…

RELOCATIONS #6
Symbol Symbol
Offset Type Applied To Index Name


00000000 ADDR64 00000000 00000000 8 xray_instr_map
00000008 ADDR64 00000000 00000040 8 xray_instr_map

This looks like it’s actually worked, at least at CodeGen time.

Thanks again for sharing your experience, it’d be really great if you can have patches that we can review and land to potentially get XRay working on Windows!

Cheers

with some dirty hack , I've made xray runtime 'built' on windows ,

\o/

with more test, I've found that trampoline didn't got built for windows :confused:
currently cmake didn't generate build rule for asm so its silently
ignored(with msvc ide, but not ninja).
we must have enable_language(ASM_MASM) to use masm, and trampoline
also need ports.

If you're alright with it, maybe you can send some patches to review,
preferably through the LLVM Phabricator instance? You can have me or Reid
(who knows more about COFF and the Windows stuff) as reviewers.

in AsmPrinter, copy/paster xray for coff target

InstMap = OutContext.getCOFFSection("xray_instr_map", 0,
SectionKind::getReadOnlyWithRel());
FnSledIndex = OutContext.getCOFFSection("xray_fn_idx",
0,SectionKind::getReadOnlyWithRel());

in XRayArgs , allow windows platform to use xray args. with this,
generated code seems have sled and xray parts.

Nice, I suspect we can make this change with tests as well, which we can
build on incrementally.

where can I find some examples to test this xray part in llvm ?

in xray runtime,
bool atomic_compare_exchange_strong(volatile atomic_sint32_t *a,
                                          s32 *cmp,
                                          s32 xchg,
                                          memory_order mo)
is missed for MSVC , I take atomic_uint32_t implementation

This is in compiler-rt/lib/sanitizer_common/... right?

yes, sanitizer_atomic_msvc.h didn't provide this override. according
to msdn of interlockedcompareexchange, implementation for
atomic_uint32_t should also works for atomic_sint32_t. this is a
copy/paste but I think its short enough. any better suggestion ?

FunctionRecord pack , __attribute__((packed)) => #pragma
pack(push,1), msvc also requires bitfields to be same type to pack
them together( all types => uint32_t)

Are you able to test this on other platforms?

I've tested this on linux64 (with clang) and it pass check-xray , but
I don't have mac to test. if changing all attribute to pragma is
desirable , I can submit a patch for that .

with some dirty hack , I’ve made xray runtime ‘built’ on windows ,

\o/

with more test, I’ve found that trampoline didn’t got built for windows :confused:
currently cmake didn’t generate build rule for asm so its silently
ignored(with msvc ide, but not ninja).
we must have enable_language(ASM_MASM) to use masm, and trampoline
also need ports.

Right – this is similar to issues we’ve run into trying to make XRay work / get built for Darwin too.

If you’re alright with it, maybe you can send some patches to review,
preferably through the LLVM Phabricator instance? You can have me or Reid
(who knows more about COFF and the Windows stuff) as reviewers.

in AsmPrinter, copy/paster xray for coff target

InstMap = OutContext.getCOFFSection(“xray_instr_map”, 0,
SectionKind::getReadOnlyWithRel());
FnSledIndex = OutContext.getCOFFSection(“xray_fn_idx”,
0,SectionKind::getReadOnlyWithRel());

in XRayArgs , allow windows platform to use xray args. with this,
generated code seems have sled and xray parts.

Nice, I suspect we can make this change with tests as well, which we can
build on incrementally.

where can I find some examples to test this xray part in llvm ?

Those are in the llvm/test/CodeGen/X86/… – in particular, searching for ‘xray_’ in the files there will be the best way of finding examples of what we’re looking for to verify.

in xray runtime,
bool atomic_compare_exchange_strong(volatile atomic_sint32_t *a,
s32 *cmp,
s32 xchg,
memory_order mo)
is missed for MSVC , I take atomic_uint32_t implementation

This is in compiler-rt/lib/sanitizer_common/… right?

yes, sanitizer_atomic_msvc.h didn’t provide this override. according
to msdn of interlockedcompareexchange, implementation for
atomic_uint32_t should also works for atomic_sint32_t. this is a
copy/paste but I think its short enough. any better suggestion ?

I’m sure adding an implementation for atomic_sint32_t will be nice to have across platforms. :slight_smile:

FunctionRecord pack , attribute((packed)) => #pragma
pack(push,1), msvc also requires bitfields to be same type to pack
them together( all types => uint32_t)

Are you able to test this on other platforms?

I’ve tested this on linux64 (with clang) and it pass check-xray , but
I don’t have mac to test. if changing all attribute to pragma is
desirable , I can submit a patch for that .

We’re still working on getting XRay to build right and work on macOS so that shouldn’t be a barrier. :slight_smile:

A patch would be good there too.

Thanks again!

– Dean

I wonder if I can build xray with clang/llvm-as on windows, seems that
is a little easier and requires less changes. but I've not tried yet.

I wonder if I can build xray with clang/llvm-as on windows, seems that
is a little easier and requires less changes. but I’ve not tried yet.

That should be something worth the try I think. It’s also not entirely a bad idea to still be able to build XRay with MSVC, but only if it’s not too hard to get that done.