I find that on Linux glibc x86-64,
--threads=8often gives the peak performance.
Then do min(8, taskset count), see below ![]()
With more threads ld.lld may become slower, especially with the glibc memory allocator, but not that bad with mimalloc/jemalloc/tcmalloc/snmalloc/etc
In my test above the primary bottleneck was the kernel (on both systems). In particular Linux carries massive technical debt where there is no dedicated process abstraction – everything is a task_struct with linkage protected with a global lock (tasklist_lock) and most of the time is spend contending on it (and other locks).
Your own response though strengthens my position: if the program can’t make any sensible use of these threads, why even spawn them? They are actively detrimental to stated goal of improving performance.
ld.lld respects
sched_getaffinity(Linux) /cpuset_getaffinity(FreeBSD) /std::thread::hardware_concurrency(others), so if you confine the ld.lld process to a few cores, ld.lld will respect it.
I’m aware, I added the FreeBSD support.
When I was adding it I noted this only damage-controls some of the state and most definitely does not solve the problem.
Consider a sample workload: building FreeBSD from scratch.
there is tons of libraries and binaries to link. no matter what taskset you are going to set over make invocation there is nothing solved here – excess threads keep getting spawned up to said limit by each lld and you can have quite a few running at the same time. This is particularly nasty if you have a high-core box like I do (see samples above). One has to damage control by manually passing --threads=1, which should not be necessary.
You may also notice that lld’s willingness to thread is completely detached from the -j argument to make (or an equivalent).
I think it is better for a build system to do the scheduling work than letting lld be too smart. Spawned as a job action, lld processes don’t know the global information and make the best scheduling.
But the current behavior is literally getting in the way of sensible scheduling.
Normally whatever build system you have assumes that the stuff it spawns is single-threaded or participating in “job server” to regulate the workload.
lld spawning numerous threads comes out of the left field and no knowledge of job servers does not help here, not that I would recommend adding it though.
If people run say make -j 20, they expect about 20 workers churning cpu at the same time tops. By not explicitly using tasksets et al they let the kernel distribute the work as it sees fit. If the box has more than 20 hw threads, lld is really pulling off a number on that expectation.
The very fact that even in your own tests there is a rather low value at which more threads stop providing any benefit is an argument for limiting them in a bigger manner than just the taskset.
All in all the current behavior is not sane by any means.
If you think gauging how many threads to spawn is too “magic”, at least put in a limit which damage-controls the current state.
As suggested previosly – min(8, count from taskset) or whatever passed in by --threads, if used. if someone feels like spawning more than 8, it’s their explicit choice