(I resend the message here since it was too large for the mailing list.)
I run into a problem when using libunwind as the runtime library (LLVM-13.0.0 on centos 7 x86-64, kernel 3.10+). Basically, my settings are as the following:
| EXE | | SO LIB |
| ------- | | ------- |
| | Uwd | | <---> | | Uwd | |
| ------- | | ------- |
The application consists of an executable in C++ and a shared library in Rust. The linker and the compiler, including the runtime libs are in LLVM environment: clang/lld/libunwind/compiler-rt/libc++/libc++abi. Both the executable and the shared lib contain a copy statically linked llvm runtimes. Rust compiler will mark the symbols in the SO as hidden, so it should be safe.
However, when CPU profiler is triggered, the application will quit on segmentation fault from times to times. I managed to capture a stacktrace under debugging mode:
I don’t think backtrace-rs/libunwind.rs at master · rust-lang/backtrace-rs · GitHub or pprof-rs/profiler.rs at master · tikv/pprof-rs · GitHub are doing something wrong here. They are simply invoking the API in libunwind to do backtrace and collect all frames.
Any idea on what is really going on here?
Schrodinger ZHU Yifan
School of Data Science, CUHK(SZ)
Database Kernel Development Intern at PingCAP, Inc
What a story! I have to summarize what is happening here.
So, I was assigned to migrate teams’ toolchains to LLVM. Then, I suddenly noticed that the project was using statically linked runtime libs that dupilicate in both the executable and the rust cdylib. I was about to change it to dynamic linkage however the original settings had been used for a while so I am not sure whether such settings would cause any problem in the future. Moreover, rustc would keep symbols hidden, which claims to able to mitigate the ODR problem.
At that time, I did not encounter the bug here. But, out of my worry, I posted the problem on stackoverflow:
The answers were in support of the safety. For engineering consideration, I finally decided to keep the linkage settings as they were.
Then, last week, I randomly ran some tests and spotted the problem here. It took a while before I noticed that the corruptions are all located in the STL frames. Fortunately, I eventually noticed that
switching runtime libs to shared libraries would address the problem.
It is still worthy to investigate why duplicated static libs are making dwarf parsers unhappy. However, it will be a long lasting lesson for me to remember the importance of ODR. ; )
DwarfParser.hpp just decodes .eh_frame and runs the bytecode. It’s possible that the
.eh_frame is corrupted and so a memory load in libunwind will trigger a segfault. libunwind cannot do anything protecting this without incurring a large overhead (e.g. using a pipe|connect|rt_sigprocmask to let the kernel whether an address is readable).
Unfortunately, I just found moving to shared linkage did not address all issues. Segfault in _Backtrace_Unwind · Issue #47551 · rust-lang/rust · GitHub, this issue ticket is inspriing. But I think when using clang with -unwindlib=libunwind, compiler-rt.crtend.o is already linked in.
Indeed, I get EH_FRAME_LIST_END in the shared library:
88946: 0000000000757dd8 4 OBJECT LOCAL HIDDEN 11 __EH_FRAME_LIST_END__
88947: 00000000009f4c5c 0 FUNC LOCAL DEFAULT 12 _init
88948: 00000000009f4c7c 0 FUNC LOCAL DEFAULT 13 _fini
88949: 0000000002db7450 0 NOTYPE LOCAL HIDDEN 24 _GLOBAL_OFFSET_TABLE_
88950: 0000000002dac9c0 0 NOTYPE LOCAL HIDDEN 22 _DYNAMIC
88951: 0000000002b1e5b0 15 FUNC LOCAL HIDDEN 14 fstat64
88952: 0000000002b1e550 28 FUNC LOCAL HIDDEN 14 pthread_atfork
88953: 0000000002b1e550 28 FUNC LOCAL HIDDEN 14 __pthread_atfork
88954: 0000000002b1e5d0 21 FUNC LOCAL HIDDEN 14 fstatat64
88955: 0000000002b1e5c0 16 FUNC LOCAL HIDDEN 14 lstat64
88956: 0000000002b1e5a0 16 FUNC LOCAL HIDDEN 14 stat64
But the problem persists, any idea?
I also got the following:
objdump --dwarf=frames libXXXXX.so | grep -i ZERO
0029be80 ZERO terminator