Avoidable overhead from threading by default

Threading was enabled by default back in 2016:

It was preceded by some other work in the area, which also came with numbers:

There are 2 basic problems here:

  1. while taskset and other similar factors are included when deciding how many threads to spawn, there is no heuristic to determine how many threads can even do any work here. This is particularly painful when running on a ~100-thread box and compiling a bunch of small programs, as each of them ends up spawning tons of threads which it can’t utilize.
  2. thread usage does not scale

Even numbers in the second linked commit say this much:

    21,339,542,522 cycles                    #    3.143 GHz                      ( +-  1.49% )
       6.787630744 seconds time elapsed                                          ( +-  1.86% )

vs

   43,116,163,217 cycles                    #    2.920 GHz                      ( +-  3.07% )
       4.621228056 seconds time elapsed                                          ( +-  1.90% )

Over twice the computing power spent on dropping total real time by ~30%.

I concede this may be the a sensible tradeoff in certain settings, but it is actively harmful when the machine at hand has tons of other builds to do.

To get fresher numbers I checked lld 16 on FreeBSD linking the OS kernel. I found there is a minor win for 2 threads and it continues to 4, past that there is only more cycles spent on not saving anything. Seeing as Linux would be a better platform to validate the result, I checked on Ubuntu 22 once more linking the FreeBSD kernel.

The box is a 2 socket * 24 cores * 2 threads Cascade Lake, thus llvm by default spawns 96 threads.

1
        2717163727      cycles                    #    3.887 GHz                    
       0.700959349 seconds time elapsed
       0.500675000 seconds user
       0.200270000 seconds sys
2
        3104761204      cycles                    #    3.799 GHz                    
       0.505410756 seconds time elapsed
       0.581056000 seconds user
       0.240437000 seconds sys
4
        4376501790      cycles                    #    3.634 GHz                    
       0.477517145 seconds time elapsed
       0.667885000 seconds user
       0.548620000 seconds sys
8
        5949968952      cycles                    #    3.531 GHz                    
       0.445650034 seconds time elapsed
       0.705256000 seconds user
       1.001784000 seconds sys
16
        9124207475      cycles                    #    3.475 GHz                    
       0.434845947 seconds time elapsed
       0.877085000 seconds user
       1.779471000 seconds sys
32
       16170389972      cycles                    #    3.503 GHz                    
       0.435720385 seconds time elapsed
       0.995980000 seconds user
       3.657437000 seconds sys
64
       28984684980      cycles                    #    3.518 GHz                    
       0.454470824 seconds time elapsed
       1.163124000 seconds user
       7.127417000 seconds sys
96
       44226562065      cycles                    #    3.535 GHz                    
       0.469082744 seconds time elapsed
       0.958124000 seconds user
      11.595758000 seconds sys

As you can see, barring measurement error, any wins disappear around 4 threads. The default case of spawning 96 threads spends 10 x cpu cycles of the 4 thread case, delivering nothing for it, with the time in the kernel skyrocketing.

I would bench linking Chrome or something else of that sort but don’t have sensible means to do it, maybe it can do better in that case.

All that said, I see 2 action items here:

  1. bare minimum: put a hard limit on how many threads lld is willing to spawn on its own. say it could spawn the number specified by --threads, and if the option is not provided it would be min(4, whatever count found in taskset)
  2. preferable in addition to the above: add a heuristic for thread count based on the input. while i don’t have good suggestions here, there is no way a helloworld-sized program will have any use for this many threads and this much should be easy to determine.

To add some context, I’m playing around with package building (literally thousands of mostly small programs) and the avoidable threading is a completely unnecessary bottleneck. I do damage control it for now with --threads=1, but I should not need to.

I believe that you have to manage the --threads yourself. There are similar issues with ninja. It has no notion of parallel jobs. For your 48 core Cascade Lake, ninja will start ~48 jobs. If many of them are lld, you will completely oversubscribe the machine.

@MaskRay I was told you are the person to prod concerning the issue

I find that on Linux glibc x86-64, --threads=8 often gives the peak performance. With more threads ld.lld may become slower, especially with the glibc memory allocator, but not that bad with mimalloc/jemalloc/tcmalloc/snmalloc/etc (lld with different malloc implementations · GitHub).

ld.lld respects sched_getaffinity(Linux) / cpuset_getaffinity (FreeBSD) / std::thread::hardware_concurrency (others), so if you confine the ld.lld process to a few cores, ld.lld will respect it.

Capping config->threadCount to a hard limit makes things too magical, so I don’t feel easy to do it. Guessing a good config->threadCount depending on the input is even more magical and we definitely don’t want to do that.

I think it is better for a build system to do the scheduling work than letting lld be too smart. Spawned as a job action, lld processes don’t know the global information and make the best scheduling.

For normal uses within or outside a build system in the absence of more information, the current strategy is not bad: let compiling jobs use --threads=1 (embarasingly parallel) and let link jobs use all available cores by default (there are typically just one or very few link jobs concurrently). If some users don’t like the default, they can create a ld.lld shell script.

I find that on Linux glibc x86-64, --threads=8 often gives the peak performance.

Then do min(8, taskset count), see below :slight_smile:

With more threads ld.lld may become slower, especially with the glibc memory allocator, but not that bad with mimalloc/jemalloc/tcmalloc/snmalloc/etc

In my test above the primary bottleneck was the kernel (on both systems). In particular Linux carries massive technical debt where there is no dedicated process abstraction – everything is a task_struct with linkage protected with a global lock (tasklist_lock) and most of the time is spend contending on it (and other locks).

Your own response though strengthens my position: if the program can’t make any sensible use of these threads, why even spawn them? They are actively detrimental to stated goal of improving performance.

ld.lld respects sched_getaffinity(Linux) / cpuset_getaffinity (FreeBSD) / std::thread::hardware_concurrency (others), so if you confine the ld.lld process to a few cores, ld.lld will respect it.

I’m aware, I added the FreeBSD support.

When I was adding it I noted this only damage-controls some of the state and most definitely does not solve the problem.

Consider a sample workload: building FreeBSD from scratch.

there is tons of libraries and binaries to link. no matter what taskset you are going to set over make invocation there is nothing solved here – excess threads keep getting spawned up to said limit by each lld and you can have quite a few running at the same time. This is particularly nasty if you have a high-core box like I do (see samples above). One has to damage control by manually passing --threads=1, which should not be necessary.

You may also notice that lld’s willingness to thread is completely detached from the -j argument to make (or an equivalent).

I think it is better for a build system to do the scheduling work than letting lld be too smart. Spawned as a job action, lld processes don’t know the global information and make the best scheduling.

But the current behavior is literally getting in the way of sensible scheduling.

Normally whatever build system you have assumes that the stuff it spawns is single-threaded or participating in “job server” to regulate the workload.

lld spawning numerous threads comes out of the left field and no knowledge of job servers does not help here, not that I would recommend adding it though.

If people run say make -j 20, they expect about 20 workers churning cpu at the same time tops. By not explicitly using tasksets et al they let the kernel distribute the work as it sees fit. If the box has more than 20 hw threads, lld is really pulling off a number on that expectation.

The very fact that even in your own tests there is a rather low value at which more threads stop providing any benefit is an argument for limiting them in a bigger manner than just the taskset.

All in all the current behavior is not sane by any means.

If you think gauging how many threads to spawn is too “magic”, at least put in a limit which damage-controls the current state.

As suggested previosly – min(8, count from taskset) or whatever passed in by --threads, if used. if someone feels like spawning more than 8, it’s their explicit choice

Yes, I agree completely. The default of spawning as many threads as there are cores is something from the past, which should no longer be done today. It might have seemed to scale in the dual or quad core times, but not anymore… :slight_smile:

I feel that min(N, taskset_count) is magical but if it delivers noticeable improvement I think using it is fine. It seems that everyone observes possibly negative statistics on x86-64 hosts when the used concurrency is larger than 8? Created ⚙ D147493 [ELF] Cap parallel::strategy to 8 cores when --threads= is unspecified for min(8, taskset_count).

I have tested an aarch64 host in the past. I saw negative statistics when the used concurrency is larger than 8 or 16. If it doesn’t take much effort, having more numbers will be nice and appreciated:)

Thanks for the patch.

I feel that min(N, taskset_count) is magical

cmon mate

The original work which I linked in the opening comment already should have introduced a cap to maintain basic sanity (as a bare minimum solution), seeing how scalability of the thingy is not very good. Even if it was embarrassingly parallel, there is only so much work to feed to threads, once more showing that rolling with whatever hw thread count one found is going to be wasteful past $some_count.
Again, the real solution would spawn roughly the number of threads which manage to do any work to begin with.

Perhaps instead of spawning threads upfront the code could react to the backlog, but then again this may be complicated to get right.

This reminds me a little bit of code from the 90s or earlier with ideas like “do a malloc sized to 10% of RAM”. Or more to the point, not everything is a laptop. :wink:

All in all, some degree of resource management is necessary, however magical or not it may happen to be. Note I never claimed introducing the cap is a clean solution, just that it damage controls the original code.

With the cap in place interested parties can take their time implementing Something Better™. My interest in the area boiled down to making sure lld does not eat my CPU for no reason, which for my usecase is accomplished.

I also note, referring to your earlier comment:

I think it is better for a build system to do the scheduling work than letting lld be too smart. Spawned as a job action, lld processes don’t know the global information and make the best scheduling.

The build system knowing better than lld how many threads lld can handle is some magic right there.

If hypothetically you were to patch lld to participate in the job server, spawning threads upfront is already doing the damage. Even more so if you were to eat job tokens to spend them on spawning said threads.

To put it differently, threaded lld nicely playing with the build system demands informed threading.

I have tested an aarch64 host in the past. I saw negative statistics when the used concurrency is larger than 8 or 16. If it doesn’t take much effort, having more numbers will be nice and appreciated:)

As noted earlier this all induces contention in the kernel, both on Linux and FreeBSD (and really any other system). I don’t have lock prof results handy, but not suffering the problem alleviates contention in the VM subsystem and within umtx (futex-equivalent) on FreeBSD, to give you one example.

If you want ready-to-use numbers just take what I pasted in the first comment. Really hard to argue with it. :wink:

Thought experiment: LLD2. It will not be a replacement of LLD. In the contrary, the sum of LLD and LLD2 will cover more use-cases. If you have 64 cores and want just to link object files into a binary, then LLD2 will be the right linker. If you need linker-scripts, then LLD will be the right linker. LLD2 will be tuned for maximum speed and parallelism with reduced feature set. It will support the common case and (distributed) thin lto. Linux will ship both linkers!