Avoidable overhead from threading by default

Threading was enabled by default back in 2016:

It was preceded by some other work in the area, which also came with numbers:

There are 2 basic problems here:

  1. while taskset and other similar factors are included when deciding how many threads to spawn, there is no heuristic to determine how many threads can even do any work here. This is particularly painful when running on a ~100-thread box and compiling a bunch of small programs, as each of them ends up spawning tons of threads which it can’t utilize.
  2. thread usage does not scale

Even numbers in the second linked commit say this much:

    21,339,542,522 cycles                    #    3.143 GHz                      ( +-  1.49% )
       6.787630744 seconds time elapsed                                          ( +-  1.86% )

vs

   43,116,163,217 cycles                    #    2.920 GHz                      ( +-  3.07% )
       4.621228056 seconds time elapsed                                          ( +-  1.90% )

Over twice the computing power spent on dropping total real time by ~30%.

I concede this may be the a sensible tradeoff in certain settings, but it is actively harmful when the machine at hand has tons of other builds to do.

To get fresher numbers I checked lld 16 on FreeBSD linking the OS kernel. I found there is a minor win for 2 threads and it continues to 4, past that there is only more cycles spent on not saving anything. Seeing as Linux would be a better platform to validate the result, I checked on Ubuntu 22 once more linking the FreeBSD kernel.

The box is a 2 socket * 24 cores * 2 threads Cascade Lake, thus llvm by default spawns 96 threads.

1
        2717163727      cycles                    #    3.887 GHz                    
       0.700959349 seconds time elapsed
       0.500675000 seconds user
       0.200270000 seconds sys
2
        3104761204      cycles                    #    3.799 GHz                    
       0.505410756 seconds time elapsed
       0.581056000 seconds user
       0.240437000 seconds sys
4
        4376501790      cycles                    #    3.634 GHz                    
       0.477517145 seconds time elapsed
       0.667885000 seconds user
       0.548620000 seconds sys
8
        5949968952      cycles                    #    3.531 GHz                    
       0.445650034 seconds time elapsed
       0.705256000 seconds user
       1.001784000 seconds sys
16
        9124207475      cycles                    #    3.475 GHz                    
       0.434845947 seconds time elapsed
       0.877085000 seconds user
       1.779471000 seconds sys
32
       16170389972      cycles                    #    3.503 GHz                    
       0.435720385 seconds time elapsed
       0.995980000 seconds user
       3.657437000 seconds sys
64
       28984684980      cycles                    #    3.518 GHz                    
       0.454470824 seconds time elapsed
       1.163124000 seconds user
       7.127417000 seconds sys
96
       44226562065      cycles                    #    3.535 GHz                    
       0.469082744 seconds time elapsed
       0.958124000 seconds user
      11.595758000 seconds sys

As you can see, barring measurement error, any wins disappear around 4 threads. The default case of spawning 96 threads spends 10 x cpu cycles of the 4 thread case, delivering nothing for it, with the time in the kernel skyrocketing.

I would bench linking Chrome or something else of that sort but don’t have sensible means to do it, maybe it can do better in that case.

All that said, I see 2 action items here:

  1. bare minimum: put a hard limit on how many threads lld is willing to spawn on its own. say it could spawn the number specified by --threads, and if the option is not provided it would be min(4, whatever count found in taskset)
  2. preferable in addition to the above: add a heuristic for thread count based on the input. while i don’t have good suggestions here, there is no way a helloworld-sized program will have any use for this many threads and this much should be easy to determine.

To add some context, I’m playing around with package building (literally thousands of mostly small programs) and the avoidable threading is a completely unnecessary bottleneck. I do damage control it for now with --threads=1, but I should not need to.

I believe that you have to manage the --threads yourself. There are similar issues with ninja. It has no notion of parallel jobs. For your 48 core Cascade Lake, ninja will start ~48 jobs. If many of them are lld, you will completely oversubscribe the machine.

@MaskRay I was told you are the person to prod concerning the issue