Segfault in libunwind during CPU profiling

(I resend the message here since it was too large for the mailing list.)

Hi,

I run into a problem when using libunwind as the runtime library (LLVM-13.0.0 on centos 7 x86-64, kernel 3.10+). Basically, my settings are as the following:

-----------           ---------------
|   EXE   |           |   SO LIB    |
| ------- |           |   -------   |
| | Uwd | |   <--->   |   | Uwd |   |
| ------- |           |   -------   |
-----------           ---------------

The application consists of an executable in C++ and a shared library in Rust. The linker and the compiler, including the runtime libs are in LLVM environment: clang/lld/libunwind/compiler-rt/libc++/libc++abi. Both the executable and the shared lib contain a copy statically linked llvm runtimes. Rust compiler will mark the symbols in the SO as hidden, so it should be safe.

However, when CPU profiler is triggered, the application will quit on segmentation fault from times to times. I managed to capture a stacktrace under debugging mode:

I don’t think backtrace-rs/libunwind.rs at master · rust-lang/backtrace-rs · GitHub or pprof-rs/profiler.rs at master · tikv/pprof-rs · GitHub are doing something wrong here. They are simply invoking the API in libunwind to do backtrace and collect all frames.

Any idea on what is really going on here?


Schrodinger ZHU Yifan

School of Data Science, CUHK(SZ)

Database Kernel Development Intern at PingCAP, Inc

Github: SchrodingerZhu

Twitter: ZhuSchrodinger

What a story! I have to summarize what is happening here.

So, I was assigned to migrate teams’ toolchains to LLVM. Then, I suddenly noticed that the project was using statically linked runtime libs that dupilicate in both the executable and the rust cdylib. I was about to change it to dynamic linkage however the original settings had been used for a while so I am not sure whether such settings would cause any problem in the future. Moreover, rustc would keep symbols hidden, which claims to able to mitigate the ODR problem.

At that time, I did not encounter the bug here. But, out of my worry, I posted the problem on stackoverflow:

The answers were in support of the safety. For engineering consideration, I finally decided to keep the linkage settings as they were.

Then, last week, I randomly ran some tests and spotted the problem here. It took a while before I noticed that the corruptions are all located in the STL frames. Fortunately, I eventually noticed that
switching runtime libs to shared libraries would address the problem.

It is still worthy to investigate why duplicated static libs are making dwarf parsers unhappy. However, it will be a long lasting lesson for me to remember the importance of ODR. ; )

libunwind DwarfParser.hpp just decodes .eh_frame and runs the bytecode. It’s possible that the .eh_frame is corrupted and so a memory load in libunwind will trigger a segfault. libunwind cannot do anything protecting this without incurring a large overhead (e.g. using a pipe|connect|rt_sigprocmask to let the kernel whether an address is readable).

1 Like

Unfortunately, I just found moving to shared linkage did not address all issues. Segfault in _Backtrace_Unwind · Issue #47551 · rust-lang/rust · GitHub, this issue ticket is inspriing. But I think when using clang with -unwindlib=libunwind, compiler-rt.crtend.o is already linked in.

Indeed, I get EH_FRAME_LIST_END in the shared library:

 88946: 0000000000757dd8     4 OBJECT  LOCAL  HIDDEN    11 __EH_FRAME_LIST_END__
 88947: 00000000009f4c5c     0 FUNC    LOCAL  DEFAULT   12 _init
 88948: 00000000009f4c7c     0 FUNC    LOCAL  DEFAULT   13 _fini
 88949: 0000000002db7450     0 NOTYPE  LOCAL  HIDDEN    24 _GLOBAL_OFFSET_TABLE_
 88950: 0000000002dac9c0     0 NOTYPE  LOCAL  HIDDEN    22 _DYNAMIC
 88951: 0000000002b1e5b0    15 FUNC    LOCAL  HIDDEN    14 fstat64
 88952: 0000000002b1e550    28 FUNC    LOCAL  HIDDEN    14 pthread_atfork
 88953: 0000000002b1e550    28 FUNC    LOCAL  HIDDEN    14 __pthread_atfork
 88954: 0000000002b1e5d0    21 FUNC    LOCAL  HIDDEN    14 fstatat64
 88955: 0000000002b1e5c0    16 FUNC    LOCAL  HIDDEN    14 lstat64
 88956: 0000000002b1e5a0    16 FUNC    LOCAL  HIDDEN    14 stat64

But the problem persists, any idea?

I also got the following:

objdump --dwarf=frames libXXXXX.so | grep -i ZERO
0029be80 ZERO terminator