I got a 3% optimization effect by using thinlto, which I think is great. But it will increase my link time from 10 seconds to 20 minutes when using lld, so I hope to find some means to speed up thinlto. At present, I learned from the official website documents that you can use --thinlto-index-only to generate indexes and optimize each bitcode (.o) separately, and use distributed construction to speed up the process.
The problem I’m having is that by following the command below, I’m linking only the .o and everything works fine.
An increase from 10s to 20m is a much bigger slowdown than I have seen before. How much does your compilation process take (non-LTO, before invoking lld), and how many files are you LTO linking? Also, how many cpus does your machine have and what level of parallelism are you using?
Yes, distributed thinlto as currently implemented for build systems such as bazel does not interact well with archives, as you have seen.
However, how much using distributed ThinLTO will help depends on how much parallelism you can get by distributing the LTO backend invocations. If you are running on a single machine it won’t be any faster (will actually be slower) than the default in-process ThinLTO. Are you trying to integrate this into a specific distributed build system?
Thank you for your answer. Regarding the question you mentioned, I will add some information.
Using ld takes about 10+ minutes. Maybe it’s because our project uses O3 and -g, so it’s slower.
Number of linked files 2700.
The cpu has 64 virtual cores. I’ve tested that thinlto uses 32 threads by default. I’ve tried increasing the number of threads, but it doesn’t work very well.
Our build system uses make and distcc as distributed compilation tool. For me, I can’t use bazel directly. Based on the articles I found before, I also have a general understanding of how to transform distributed Thinlto in our system. But one problem is that I’m not sure how much improvement distributed thinlto can bring. Because I found that thinlto’s CPU utilization cannot continuously reach 3200% and will drop after running for a period of time. Maybe the backend optimization of some files takes a long time? Or has it already entered the linking stage? Is there an option to output detailed elapsed time information?
By the way, maybe it would be more efficient if I put the thinlto cache in a shared directory?
Hi @teresajohnson.
I conducted a verification on another project (the time was shorter and it was easier to verify). When using lld non lto, the link is 3 seconds, and when using thinlto, the link is 120 seconds. I collected the CPU utilization during link by executing top once per second.
I found:
At the 50th second, the CPU utilization dropped from 3200% to 2000%, and then continued to decrease.
At the 60th second, the utilization dropped to 200%
At the dropped to 100%
rose to 2200% in the last 2 seconds.
For the low utilization between 60 seconds and 118 seconds, are there two files with with special overhead? Or are taking other steps? If it is the former, can I find such a file so that it does not execute thinlto and solve the problem of slow links?
Now we know that time-trace can also act on the link stage (clang is really powerful), which verifies the previous guess. There are indeed several proto-related file optimizations that are very slow and affect the entire stage.
Can you clarify what you mean here by “ld takes…” - are you talking about ThinLTO with (gnu or gold) ld as the linker instead of lld? Or something else?
What I was wondering about is the end to end time to build the project from scratch for non-LTO. The reason is that part of the compilation time moves into the LTO phases invoked by the linker (e.g. all of the code gen). So it really makes the most sense to compare end to end times for the build - how do these compare for a clean build?
The results you saw make sense if the compilation has some large files. In a non-LTO compile presumably the build will be dominated by the time to compile those files to native code, but the (native) link will still be fairly short. With ThinLTO a lot of time is going to be spent in the linker’s LTO threads compiling those large modules down to native code. In a non-clean build with caching, those long ThinLTO backend threads won’t re-execute as long as the associated module is not affected by any of the code changes.
By the way, maybe it would be more efficient if I put the thinlto cache in a shared directory?
Where do you have it now? If you have the opportunity to share across multiple builds of the same project, all the better.
Sorry, I misunderstood. When I say "ld takes...", I mean the time it takes to use “ld” to link. Does “end to end” refer to the time it takes to compile? It takes about 20+ minutes to build from scratch, of which 10 seconds are spent using the lld link.
Through time-trace, I probably found the problem. Because there are some large file optimizations that hinder the overall progress.
The problem I currently encounter is that after I add -fno-lto to these “large files” and build them together with other use thinlto files, the symbol will be reported as undefined when linking. I should be able to solve it if I take a closer look at the dependencies.
I have one more question to ask.I encountered some unexpected things while using the cache. I use ccache as compilation cache. After make clean , while ccache all hits the cache, thinlto doesn’t seem to hit the cache. If only two consecutive links are performed, thinlto can hit the cache. Are there any possible factors?
Does “end to end” refer to the time it takes to compile? It takes about 20+ minutes to build from scratch, of which 10 seconds are spent using the lld link.
Ok, the results do make more sense then. Yes, by end to end I meant the full build (compile + link) time. With LTO a significant part of the optimization pipeline (and all of the codegen pipeline) are performed via the link step.
The problem I currently encounter is that after I add -fno-lto to these “large files” and build them together with other use thinlto files, the symbol will be reported as undefined when linking. I should be able to solve it if I take a closer look at the dependencies.
LTO linking non-LTO and LTO objects should work just fine. Is the undefined symbol from a non-LTO or LTO object? Note if you are simulating a distributed ThinLTO build, both link steps need to link in all objects, regardless of LTO or not.
I have one more question to ask.I encountered some unexpected things while using the cache . I use ccache as compilation cache. After make clean , while ccache all hits the cache, thinlto doesn’t seem to hit the cache. If only two consecutive links are performed, thinlto can hit the cache. Are there any possible factors?
Do you mean the ThinLTO backend threads? The caching support for these is built in to LLVM, and I don’t think that uses ccache? I am not super familiar with this though as we don’t use in process ThinLTO nor ccache internally.
ThinLTO caching is builtin the linker. You need to provide an argument to tell which directory is to use for caching (different depending the platform, -thinlto-cache-dir for ELF, -cache_path_lto for macho), otherwise there will be no caching. ccache doesn’t handle caching for linker.
For the build time, it seems you just have 1-2 object files that take much longer than other to codegen. It might be the case for normal build too but build system can hide that better with more work. Or the slow codegen only manifest after LTO optimization, which is also possible.
@cachemeifyoucan@teresajohnson
Sorry, I didn’t describe it clearly.
Here, ccache is used as the cache in the compilation phase. In the case of all cache hits in the compilation phase, and add thinlto-cache-dir, the cache is not hit when linking thinlto (judged based on the link time), but this is not as expected. It may be a usage problem. I will check again.
At present, a failure scenario has been discovered. By mounting a remote disk as a shared directory, start a container every time you build, using the mounted disk as thinlto-cache-dir. Is it because thinlto’s cache also adds machine information, so the cache cannot be hit?
I found the cause of the cache miss. Since Pythonglob.glob is used to expand wildcard characters such as “*.cc”, and since glob.glob does not guarantee the order, the order obtained when executed on different containers may be different (tested on the same machine) execution is the same). Since llvm-ar packages .o in the same order as .cc, this results in different offsets of .o within .a. One counterintuitive thing here is that the contents of .o are the same, only the offset within .a has changed, but this will also affect cache hits.
Thank you for your replies.