Threading was enabled by default back in 2016:
It was preceded by some other work in the area, which also came with numbers:
There are 2 basic problems here:
- while taskset and other similar factors are included when deciding how many threads to spawn, there is no heuristic to determine how many threads can even do any work here. This is particularly painful when running on a ~100-thread box and compiling a bunch of small programs, as each of them ends up spawning tons of threads which it can’t utilize.
- thread usage does not scale
Even numbers in the second linked commit say this much:
21,339,542,522 cycles # 3.143 GHz ( +- 1.49% )
6.787630744 seconds time elapsed ( +- 1.86% )
vs
43,116,163,217 cycles # 2.920 GHz ( +- 3.07% )
4.621228056 seconds time elapsed ( +- 1.90% )
Over twice the computing power spent on dropping total real time by ~30%.
I concede this may be the a sensible tradeoff in certain settings, but it is actively harmful when the machine at hand has tons of other builds to do.
To get fresher numbers I checked lld 16 on FreeBSD linking the OS kernel. I found there is a minor win for 2 threads and it continues to 4, past that there is only more cycles spent on not saving anything. Seeing as Linux would be a better platform to validate the result, I checked on Ubuntu 22 once more linking the FreeBSD kernel.
The box is a 2 socket * 24 cores * 2 threads Cascade Lake, thus llvm by default spawns 96 threads.
1
2717163727 cycles # 3.887 GHz
0.700959349 seconds time elapsed
0.500675000 seconds user
0.200270000 seconds sys
2
3104761204 cycles # 3.799 GHz
0.505410756 seconds time elapsed
0.581056000 seconds user
0.240437000 seconds sys
4
4376501790 cycles # 3.634 GHz
0.477517145 seconds time elapsed
0.667885000 seconds user
0.548620000 seconds sys
8
5949968952 cycles # 3.531 GHz
0.445650034 seconds time elapsed
0.705256000 seconds user
1.001784000 seconds sys
16
9124207475 cycles # 3.475 GHz
0.434845947 seconds time elapsed
0.877085000 seconds user
1.779471000 seconds sys
32
16170389972 cycles # 3.503 GHz
0.435720385 seconds time elapsed
0.995980000 seconds user
3.657437000 seconds sys
64
28984684980 cycles # 3.518 GHz
0.454470824 seconds time elapsed
1.163124000 seconds user
7.127417000 seconds sys
96
44226562065 cycles # 3.535 GHz
0.469082744 seconds time elapsed
0.958124000 seconds user
11.595758000 seconds sys
As you can see, barring measurement error, any wins disappear around 4 threads. The default case of spawning 96 threads spends 10 x cpu cycles of the 4 thread case, delivering nothing for it, with the time in the kernel skyrocketing.
I would bench linking Chrome or something else of that sort but don’t have sensible means to do it, maybe it can do better in that case.
All that said, I see 2 action items here:
- bare minimum: put a hard limit on how many threads lld is willing to spawn on its own. say it could spawn the number specified by --threads, and if the option is not provided it would be min(4, whatever count found in taskset)
- preferable in addition to the above: add a heuristic for thread count based on the input. while i don’t have good suggestions here, there is no way a helloworld-sized program will have any use for this many threads and this much should be easy to determine.
To add some context, I’m playing around with package building (literally thousands of mostly small programs) and the avoidable threading is a completely unnecessary bottleneck. I do damage control it for now with --threads=1, but I should not need to.