Tips for debugging ThinLTO crashes

thevinster · July 5, 2023, 9:00pm

I recently encountered a thinLTO crash, and I found it pretty difficult to debug and isolate the root cause.

The first issue was figuring out which file caused the crash. The way I went about it was to get the repro command for the linking step and add --threads=1 --save-temps to make the link serialized so that I can order it by the most recent timestamp to determine which step within the thinLTO job failed. While this way worked for the crash I encountered, this process felt very time consuming (and hacky) since we’re making the assumption that this can be reproduced serially. Are there any other better debugging tips to quickly figure out the file(s) causing the crashes?

The second issue is to reliably reduce the issue so that a bug can be reported upstream. To save time reducing, I was lucky in that using opt -O3 reproduced the same crash (but not when using any other optimization levels). But, I imagine this is not always the case and that it is better to properly run the thinLTO backend (which would take a lot longer). This process also felt like a hack to me. Is there anything that I could’ve done differently?

The crash is now fixed and landed upstream, but I wanted to poll the community to see if there are better ways to debug thinLTO crashes.

cc: @smeenai

smeenai · July 5, 2023, 9:11pm

Just to clarify the wording a bit here, we wanted to figure out which ThinLTO backend job the crash was occurring in. The hack was to make the backends jobs run serially, so that we could figure out the last temp file produced before the assertion failure and infer the backend job from that.

The dream here would be an option that makes a ThinLTO backend crash generate a reproducer tarball with the speciifc llvm-lto2 command to run that backend job. I don’t know how tricky that might be though.

aeubanks · July 5, 2023, 9:17pm

I usually see the module name somewhere in the crash (can’t remember off the top of my head where it appears…).

Chromium has some docs on ThinLTO troubleshooting. But yes ideally there’d be a reproducer tarball.

Also, using explicit ThinLTO backend actions fixes this issue since every ThinLTO backend compile has its own build action/clang invocation, but that requires a lot of build system support.

rnk · July 12, 2023, 8:55pm

The difficulty you have described has been an issue for a long time, and I have always advocated that we try to move to processes instead of threads for backend actions. With processes, it is easy to identify the failing backend action and capture a reproducer. However, there are downsides to processes. They have higher overhead, we would need to implement some kind of parallel process pool runner, and IIRC thinlto with threads benefits from some amount of shared memory.

Another upside to a process-based backend action solution is that the backend actions could be distributed by allowing the user to replace the program responsible for running the backend action with a flag. This gives you a point to inject any old kind of compilation wrapper, like a distributed build wrapper such as goma. I believe ChromeOS takes this approach, but I think it was implemented by replacing the whole linker with a wrapper script instead of internalizing it within the linker.

pogo59 · July 13, 2023, 1:44pm

Sounds a lot like the direction @kromanova describes in [RFC] Integrated Distributed ThinLTO

Topic		Replies	Views
getting nowhere with thinLTO LLVM Dev List Archives	1	70	November 9, 2017
Some simple questions about debug lld LLVM Dev List Archives	1	67	October 3, 2020
Is it possible to output the optimizations performed by thinlto IR & Optimizations lto , thinlto	1	146	November 29, 2023
[pre-RFC] Data races in concurrent ThinLTO processes LLVM Dev List Archives	14	85	March 29, 2018
ThinLTO with Linux+ELF+Gold -- incorrectly dropping weak definitions. LLVM Dev List Archives	6	84	May 17, 2017

Tips for debugging ThinLTO crashes

Related Topics