[RFC] Adding GNU Make Jobserver Support to LLVM for Coordinated Parallelism

1. Introduction and Motivation

LLVM and its tools, such as Clang and LLD, have increasingly adopted internal parallelism to speed up complex tasks. For example, Clang can use multiple threads for device offload compilation (--offload-jobs=N), and LLD can parallelize Link-Time Optimization (--thinlto-jobs=N). These features are controlled by LLVM’s thread pool implementation (StdThreadPool) and the llvm::parallel library, which typically scale based on the number of hardware cores available.

This creates a significant performance problem when LLVM tools are run by a parallel build system like make -jN or ninja -jN. The build system, unaware of the tool’s internal parallelism, dispatches N independent LLVM processes. Each of these processes may in turn spawn M of its own threads. This leads to a “thread explosion” of N * M total threads, which can severely overload the system, increase CPU and memory contention, and ultimately make the entire build slower than a more constrained approach.

To add a concrete example of this pain point, in our own out-of-tree development, we use a --parallel-jobs=N flag that suffers from this lack of coordination. In our Continuous Integration (CI) environment, we are constantly forced to make a difficult trade-off: setting N to a high value risks overloading the system and causing build timeouts, while setting it to a low value leads to inefficient resource utilization and slower builds. This highlights the practical need for a robust coordination mechanism.

This RFC proposes a solution: to make LLVM’s parallelism primitives “cooperative” by integrating support for the GNU Make jobserver protocol.

2. Background: What is a Jobserver?

For context, it might be useful to provide a brief background on the jobserver protocol, as it’s a specialized feature of build systems.

When you run make -jN, make needs a way to ensure that no more than N recipes are running at once. It also needs to solve a deeper problem: what if one of its child processes (e.g., a shell script or another tool) wants to run its own parallel sub-tasks? Without coordination, the system would once again become overloaded.

The jobserver is GNU Make’s solution to this. It’s a communication mechanism passed from a parent process (make) to its children. In summary:

  • A Pool of “Job Slots”: Before starting, make creates a pool of N job slots, or “tokens”.
  • Communication Channel: On Unix, this is typically a pipe. make writes N-1 single-character tokens into the pipe. On Windows, a named semaphore is used. The details are passed to child processes via the MAKEFLAGS environment variable.
  • The Implicit Slot: Every child process is automatically granted one job slot just for being invoked. This is its “implicit” slot.
  • Acquiring More Slots: If a child process wants to use more than one core for its own tasks, it must read additional tokens from the jobserver pipe. If the pipe is empty, the read will block until another process finishes and returns a token.
  • Releasing Slots: When a child process completes a unit of parallel work, it must write its acquired token back to the pipe for others to use.

This simple but effective protocol ensures that the total number of active threads across the parent make process and all its children never exceeds the user-specified limit N. This protocol has become a de-facto standard for coordination, supported not only by GNU Make but also recently by Ninja. This broad adoption makes it a modern and widely applicable solution.

3. The Problem in LLVM

LLVM’s current parallelism, found in our StdThreadPool implementation and the llvm::parallel library (which provides parallelFor, parallelSort, etc.), is completely unaware of the jobserver. It queries the system for hardware concurrency and acts independently. This is the root cause of the thread explosion issue seen in key areas:

  1. Device Offloading: As my initial patch demonstrates, running make -j16 on a project that uses --offload-jobs=4 is a recipe for system overload.
  2. Link-Time Optimization (LTO): A parallel LTO link using lld --thinlto-jobs=8 inside a make -j16 build exhibits the exact same problem.

This is a generic limitation of LLVM’s parallel support. It lacks a mechanism to coordinate with the larger build environment.

4. Proposed Solution

I propose to add a native, platform-agnostic jobserver client to LLVM and integrate it into our threading libraries. The implementation is already available for review in PR #145131: https://github.com/llvm/llvm-project/pull/145131

The high-level design is as follows:

  1. New Library (llvm/Support/Jobserver): A new, lightweight library provides a JobserverClient. It handles parsing MAKEFLAGS and contains platform-specific backends for Unix (pipe/FIFO) and Windows (semaphores).

  2. New Concurrency Strategy: A new jobserver_concurrency() strategy is added to llvm/Support/Threading.h.

  3. Integration with LLVM’s Parallelism Libraries: The new strategy is integrated into both of LLVM’s main parallel execution mechanisms:

    • StdThreadPool: The thread pool’s worker loop is modified to acquire a job slot before executing a task.
    • ThreadPoolExecutor: The executor backing the llvm::parallel library is also updated. This ensures that high-level parallel algorithms like parallelFor, parallelForEach, parallelSort, and parallelTransformReduce will all respect the jobserver limit automatically when the strategy is enabled.
  4. User-Facing Control: Tools like Clang and LLD can then expose an option (e.g., --offload-jobs=jobserver or --thinlto-jobs=jobserver) to enable this cooperative behavior. When specified, the number of threads will be governed by the jobserver instead of hardware_concurrency().

5. Addressing Potential Concerns

Comments on the initial PR raised valid questions about feature scope.

  • Concern: “Is build job management the compiler’s job?”
    This proposal is not about making LLVM a build system. It’s about making LLVM a “good citizen” that can coordinate with existing build systems. The jobserver protocol is the standard, established way for tools to do this. The alternative of LLVM’s parallel tools remaining ignorant of their environment is the direct cause of the performance issues we’re seeing.

  • Concern: “Alternative: Decompose the work for the build system.”
    One could imagine a system where Clang or LLD outputs a dependency graph (e.g., a JSON file) of its sub-tasks and lets the build system schedule them. While feasible for some scenarios, this would be a massive undertaking for tightly integrated tasks like LTO or the new offload driver. It would require significant changes to our tools and deep, complex integration with each build system. The jobserver approach, by contrast, is a widely supported and comparatively simple solution that directly solves the resource contention problem.

6. Feedback and Discussion

I believe this feature would be a valuable addition to LLVM, making our tools perform better and more predictably in standard parallel build environments. I’m opening this RFC to gather feedback on the idea and the proposed direction.

Some topics that might be useful to discuss include:

  • The significance of the “thread explosion” problem in use cases like parallel LTO and offloading.
  • The suitability of the jobserver protocol as a coordination mechanism for LLVM tools.
  • The proposed introduction of a new llvm/Support/Jobserver library.
  • General thoughts on the proposed implementation and its integration with LLVM’s parallelism libraries.

Thank you for your time and feedback.

8 Likes

Hi,

I am not to familiar with jobserver before this post. It does sound like it solves a real problem. You say it’s already supported in make, ninja and msbuild? Which compilers support this? Is it already implemented in gcc and ld? Is there any practical difference in your suggested implementation and what gcc does in that case?

Thanks for the RFC!

I can’t really offer useful input on the details of the implementation plan, but having jobserver support at least in some key places (like ThinLTO parallelism) would be quite welcome.

The current “state of the art” is to work around this by setting the number of parallel link jobs to 1 via ninja pools, but this both yields suboptimal build performance (if the build contains many heterogenous links) and can still result in resource exhaustion (because it does not prevent other compilations in parallel, it only prevents parallel LTO links). Properly managing the parallelism with job server integration would fix this for good.

cc @MaskRay for the lld perspective.

I only found Ninja supports this protocol as a client (i.e. if you launch ninja under a GNU make, ninja will honor MAKEFLAGS) do you know if Ninja supports itself as a job server?

I share @Artem-B’s concern about feature creep, as compilers are tasked with build system responsibilities… That said, I agree this solution addresses the issue for make-based build systems.

Some prior art:

I know that GCC supports jobserver in its gcc -flto=jobserver option for LTO. I haven’t checked what options GCC passes to GNU ld. If we ever decide to integrate the feature for clang -fuse-ld=lld, I think we should make Clang pass --threads to LLD instead of making LLD read the MAKEFLAGS environment variable.

rust or cargo supports jobserver: Jobserver - The rustc book @ojeda

POSIX Jobserver (GNU make) has some guidance how tool should implement the jobserver feature.

Hi, thanks for your comment and the questions.

First, I appreciate you giving me the chance to correct a point from the original post. You are right to question the breadth of support, and I was mistaken about MSVC. The jobserver protocol is not supported by MSBuild or the MSVC compiler itself. My confusion arose from the fact that GNU Make can use its jobserver on Windows. For Windows users, the primary way to leverage this protocol today is through build systems like Ninja.

To answer your questions directly, here is a summary of the current support landscape for the jobserver protocol:

Build System Support:

  • GNU Make: The originator of the protocol. (link)
  • Ninja: Added support recently, which is a strong signal of its modern relevance. (link)
  • Cargo: The Rust build system and package manager uses it to coordinate rustc invocations. (link)

Compiler and Toolchain Support:

  • GCC: Yes, GCC has supported this for years, specifically for its Link-Time Optimization (LTO) parallelism. It auto-detects the jobserver from MAKEFLAGS to constrain its LTO threads. Users can also explicitly opt-in with the -flto=jobserver flag. (link)
  • GNU Linkers (ld, gold): No, ld.bfd and gold do not implement the jobserver protocol. While they have multithreading capabilities (e.g., gold’s --threads flag), their parallelism is not coordinated with the build system.
  • Rust (rustc): Yes, the Rust compiler is another great example. It automatically detects and uses the jobserver to limit its parallel code generation, making it a “good citizen” in parallel builds. (link)

Regarding the practical difference between our proposal and GCC’s implementation:

  • GCC’s support is specific to its LTO implementation implemented by launching make (link)
  • Our proposal is to build this support into LLVM’s core parallelism libraries (StdThreadPool and ThreadPoolExecutor).

By doing this, we provide a generic solution for any parallel task within the LLVM ecosystem that uses llvm::parallel (e.g., parallelFor, parallelSort) or StdThreadPool directly. This includes LTO in lld and Clang’s offloading compilation, but it also future-proofs other tools. Any new parallel feature in opt, llc, or other LLVM tools could instantly and correctly coordinate with the build system.

Thanks again for the feedback!

I only found Ninja supports this protocol as a client

ninja had a PR to implement master support, but it looks like it’s been dropped:

Ninja’s issue for adding jobserver client support is one impressive 9-year long thread: Add GNU make jobserver client support · Issue #1139 · ninja-build/ninja · GitHub
The issues discussed there have a fair amount of overlap with the situation we’re discussing here.

I think that cooperation with the job server, if it’s available, is a reasonable trade-off with a somewhat limited scope of impact on clang (the jobserver protocol has been around for ages and is unlikely to change, I think). It does provide a way to mitigate a real issue, at least in some cases. Even though ninja itself does not provide the jobserver server implementation out of the box, it would not be too hard to provide a stand-alone jobserver tool that could provide job orchestration for ninja builds for those who need it. E.g. ninja/misc/jobserver_pool.py at master · digit-google/ninja · GitHub

I think in this particular case the benefits do outweigh (though just barely) my general concerns about the feature creep. It would be a stronger case if ninja did provide jobserver functionality, but even without it it gives us a reasonable way to mitigate the issue.

I think you are right. ninja only implemented jobserver client protocol. It still needs a parent make to be able coordinate the commands it launched. However, it could be done with some simple wrapper Makefile to let make pass the jobserver env vars to ninja, e.g.


# The '.PHONY' target prevents 'make' from getting confused if a file
# named 'all' or 'clean' ever exists.
.PHONY: all clean

# The default target that runs when you just type 'make'.
# The '+' prefix is crucial. It tells 'make' to continue passing the
# jobserver information to the child process (ninja) even if it's not
# another 'make' instance.
all:
	+@ninja

# A target to pass through the 'clean' command to ninja.
clean:
	+@ninja clean

# This allows you to forward any other ninja target to the ninja command.
# For example, `make my_target` will run `ninja my_target`.
%:
	+@ninja $@

Basically instead of running ninja -j16 target, we use make -j16 target. make will be responsible to open the pipes and read/write the tokens whereas ninja is launched to launch other commands under the control of the env vars.

I don’t see how this proposal would help this. The primary reason people need to limit links to 1 is because ld.bfd uses such an excessive amount of memory that the machine will run out of RAM. Adding use of the jobserver protocol won’t prevent this, since it’s geared around ensuring the CPU parallelism is limited, not RAM usage.

I’m referring to ThinLTO builds here. The problem is that the machine may have enough RAM to support an all-thread ThinLTO link, but it may not have enough RAM to support both an all-thread LTO link and an all-thread (minus one) compilation at the same time, if the scheduling is unlucky and the compilation jobs that run in parallel are memory hungry.

For example, if we build LLVM we make sure that things like libLLVM, libclang etc all get built individually first to ensure no parallel compilation during linking. Otherwise we’ll run into random OOMs.

So even if limiting to one link job may still be necessary if the linker itself uses a lot of memory (independent of LTO), proper jobserver integration at least avoids OOM issues related to the combination of internal linker parallelism and external compiler parallelism.

(Though seeing the above notes that ninja only supports the jobserver protocol as a client, I guess this is not actually going to help in practice…)

FWIW, from reading those reviews, I think the server support was split out from the initial PR, to reduce the scope as much as possible to get something mergeable. At least from reading earlier stages of the PR, the intent was to pick up submitting the jobserver server support as a separate PR later. (The client support has only been merged for a couple of weeks now.)

I’m not entirely sure how well this would work in practice, for something like thread pools and how at least the linker uses them.

For cases where we have a number of discrete, not-miniscule tasks (like compiling one source file), we can blockingly request tokens for each task, when we get a token we start a compilation job.

For thread pools, as used e.g. within LLD for regular parallelism in the linking (not specifically LTO compilation) - a pool of N threads is started on startup. Then within certain steps of the linking process, small tasks are executed across the thread pool, e.g. for each object file or each section chunk. Between these steps, the threads in the pool are idle.

Translating that to use a job server doesn’t seem entirely straightforward to me, but I may be overlooking something.

We can’t block startup of the thread pool on actually getting tokens for as many threads as we may want. We could request more tokens non-blockingly, and when we get tokens, we’d add a new thread to the pool. Depending on the pool size, we’d more or less end up grabbing as many tokens as are available. This may be fine for a short, intensive linking process, but for a process that takes a longer time, with some threads being idle, we’d want to return tokens to the jobserver when the parallel step has finished (and reduce the size of the thread pool at the same time).

An alternative approach would be to request a token from the jobserver for each individual parallel piece of work, and start working on each piece when we get a token. This would make it ideal in resource usage (and would work like e.g. ninja/make work themselves) - but as the pieces of work (e.g. object file or section chunk) are miniscule and very numerous, I’m afraid that would slow down these otherwise very tight loops notably.

For doing the LTO compilation, scheduling the inividual compilation tasks through a jobserver may be quite a good fit though (but I’m not sure how that’s implemented right now, whether that also uses a thread pool or something else).

So I’m not entirely sure how this would work for a generic thread pool - but I’d be interested in hearing if you have any concrete ideas on how the thread pools would interact with the jobserver!

two comments.

  1. is it really appropriate to use this for the parallel algorithms? It might be most of the time, but threads that are usually not runnable shouldn’t be taking a token.
  2. it’s a problem if the compiler forgets to release a token back to the jobserver. You can get around this by always allowing at least one job to run, but then you’re not fixing the whole problem, I’m not sure what GCC does in this case.

Still, I think it’s a good idea. fwiw msvc has parallel LTO (actually parallel non-lto as well) and manages things using a named semaphore. It always allows a few jobs to run though, which can degrade performance.

Oh, and make also uses a named semaphore on windows, you probably want to preserve that, even though windows does have named pipes.

I like jobserver support in general for LLVM tools.

This is pretty much the only other option, and is what Xcode does for parallel module builds now, but it requires deep build system integration. It’s not really viable for CMake driven builds.

I believe you just put this in the slow path: if the work stealing loop cannot find new work with a few milliseconds, it returns its token and goes to sleep. If new work arrives, it then can wake up and try to get a new token. That keeps that all on the slow path, so performance should work out pretty well in both regimes. (This goes along quite well with my very recent RFC to make a work-stealing scheduler available to LLVM, so I’ve put some thought into this lately.)

3 Likes

The first job token is implicit, you only need to acquire/release tokens for the second, third, … jobs spawned. From experience with a package manager I know that it rarely happens that tokens are not released – the top-level make that creates the job server warns about it on exit.

GCC does not deal with acquiring/releasing job tokens itself, it generates a makefile and makes it makes responsibility. This obviously has downsides and doesn’t work for LLVM’s use cases. It does have some advantages other than simplified implementation: this sub-make also respects flags set by the user when invoking top-level make, such as --load-avg=M which prevents spawning jobs when the system is busy (complementary to -j N), and flags like --output-sync=... to make build logs orderly.

1 Like

Yeah, both do.