I recently encountered a thinLTO crash, and I found it pretty difficult to debug and isolate the root cause.
The first issue was figuring out which file caused the crash. The way I went about it was to get the repro command for the linking step and add
--threads=1 --save-temps to make the link serialized so that I can order it by the most recent timestamp to determine which step within the thinLTO job failed. While this way worked for the crash I encountered, this process felt very time consuming (and hacky) since we’re making the assumption that this can be reproduced serially. Are there any other better debugging tips to quickly figure out the file(s) causing the crashes?
The second issue is to reliably reduce the issue so that a bug can be reported upstream. To save time reducing, I was lucky in that using
opt -O3 reproduced the same crash (but not when using any other optimization levels). But, I imagine this is not always the case and that it is better to properly run the thinLTO backend (which would take a lot longer). This process also felt like a hack to me. Is there anything that I could’ve done differently?
The crash is now fixed and landed upstream, but I wanted to poll the community to see if there are better ways to debug thinLTO crashes.
Just to clarify the wording a bit here, we wanted to figure out which ThinLTO backend job the crash was occurring in. The hack was to make the backends jobs run serially, so that we could figure out the last temp file produced before the assertion failure and infer the backend job from that.
The dream here would be an option that makes a ThinLTO backend crash generate a reproducer tarball with the speciifc
llvm-lto2 command to run that backend job. I don’t know how tricky that might be though.
I usually see the module name somewhere in the crash (can’t remember off the top of my head where it appears…).
Chromium has some docs on ThinLTO troubleshooting. But yes ideally there’d be a reproducer tarball.
Also, using explicit ThinLTO backend actions fixes this issue since every ThinLTO backend compile has its own build action/clang invocation, but that requires a lot of build system support.
The difficulty you have described has been an issue for a long time, and I have always advocated that we try to move to processes instead of threads for backend actions. With processes, it is easy to identify the failing backend action and capture a reproducer. However, there are downsides to processes. They have higher overhead, we would need to implement some kind of parallel process pool runner, and IIRC thinlto with threads benefits from some amount of shared memory.
Another upside to a process-based backend action solution is that the backend actions could be distributed by allowing the user to replace the program responsible for running the backend action with a flag. This gives you a point to inject any old kind of compilation wrapper, like a distributed build wrapper such as goma. I believe ChromeOS takes this approach, but I think it was implemented by replacing the whole linker with a wrapper script instead of internalizing it within the linker.
Sounds a lot like the direction @kromanova describes in [RFC] Integrated Distributed ThinLTO