Parallel/thread-safe algorithms, allocators, and containers

This post documents what I know about the lack of parallel/thread-safe facility which restricts our choices in parallel algorithms. Hope it spurs discussions on improving the status quo.

Parallel algorithms

MLIR and lld are main users.

llvm/Support/Parallel.h has some straightforward implementations of some parallel algorithms. It would be good know how they compare with mature external libraries.

We don’t provide a counterpart to [algorithms.parallel_for_each.feeder].
This can be used to parallelize section based garbage collection in various ports of lld.

Our parallelForEachN does not support nested usage. The outermost loop is parallel while the inner ones are sequential.
See ⚙ D61115 Parallel: only allow the first TaskGroup to run tasks parallelly for the limitation.
In lld/ELF, writeSections may benefit from nested support (Why isn't ld.lld faster? | MaskRay)

Our parallelForEachN uses a coarse TaskSize estimate:

enum { MaxTasksPerGroup = 1024 }; 
...
auto TaskSize = NumItems / parallel::detail::MaxTasksPerGroup;

For a loop with a trip count less than 1024, TaskSize==1, and one thread is usually assigned striding indexes, losing locality. E.g. with 4 threads, thread x may get assigned indexes i*4+x, assuming every thread spends exactly the same time finishing one task. The optimal assignment can be: 0 gets [0,N/4), 1 gets [N/4,N/4*2), …

BumpPtrAllocator

llvm/Support/Allocator.h defines BumpPtrAllocator whose Allocate function cannot be called concurrently.
In various ports of lld, section and symbol initialization places may benefit from multi-threading.

Personally I lean toward that we don’t need thread-safe BumpPtrAllocator. We just need thread-local BumpPtrAllocator in lld/Common/Memory.cpp.
Many modern allocators focus on NUMA aware allocations and per-thread memory reference locality (evidenced by ⚙ D119909 [ELF] Parallelize initializeLocalSymbols).

Note: in lld, we tend to use many make<X>(...) which instantiate SpecificBumpPtrAllocator<T>. One instance appears to cost several kilobytes. See ⚙ D108850 [LLD] Remove global state in lldCommon (lld/Common/Memory.cpp) that there is additional complexity in making it free of global states.

malloc implementation

For clang and chrome, I have measured that linking against libmalloc.a makes ld.lld 1.12x as fast. (jemalloc / (older tcmalloc from gperftools): ~1.10x)
(lld 14 ELF changes | MaskRay)

If I maintained the lld package for a distro, I’d suggest that we do something like -DCMAKE_EXE_LINKER_FLAGS=-Wl,--push-state,$HOME/Dev/mimalloc/out/release/libmimalloc.a,--pop-state for both clang and lld.
The ~100KiB size increase pays off.

That said, I am unsure what to do with ⚙ D101427 Linux support for mimalloc as a custom allocator which adds CMake support.
Perhaps distributors can weigh in what they expect from CMake support.

Thread-safe containers

i.e. the container performs internal synchronization, so that multiple threads can concurrently call member functions without having to manage the synchronization themselves.

llvm-project lacks vector and unordered_map.
This restricts our choices in parallel algorithms.

third-party software

It is an open question whether we can outsource some work to a piece of third-party software.

@aganea mentioned on Parallel input file parsing - #7 by aganea

I feel those two topics point out again the lack in LLVM of thread-safe, lock-free containers & memory allocator. This boils down to the question of whether we should reference or vendor external packages from LLVM. Right now it seems that only “optional” external libs are referenced by LLVM, and any “required” external library has to duplicated/vendored in the monorepo (and inherently fit the LLVM licence).

Writing a cross-platform, lock-free allocator isn’t trivial (although we have SCUDO), and nor is re-writing all the common containers (map, vector, deque, etc) in a thread-safe, lock-free manner.
If we could rely on IntelTBB, that would solve both raised issues (since IntelTBB has tbbmalloc). Adding in mimalloc or rpmalloc would further improve the situation.

Would it be an option to let LLD rely on external submodules?

The question, aside from the usual performance/maintenance status, usually also involves license and portability (whether the software is at least as portable as llvm-project itself: OS support, architecture support).

It’s useful to know our gap to a mature implementation. Say, if the in-tree implementation is within 10% slower for that particular algorithm/container we need, perhaps we don’t bother with outsourcing.

5 Likes

@smeenai @int3 @oontvoo @tobiashieta @River707 @aganea

4 Likes

First - I am a bit biased here since we already embed rpmalloc in our toolchain dist. But my preference would be to go with an already existing implementation since alloc is a well understood area and getting it “just right” is tricky.

rpmalloc supports a lot of platforms already, has simple and easily expandable code and is under public domain. And the maintainer is very reasonable and we can probably get some help if needed.

Another option is something like tcmalloc which is developed by google so there is a lot of prior art of upstreaming from google to llvm. But I don’t know if it’s a better or worse fit than rpmalloc.

That’s my two cents at least.

1 Like

We just need thread-local BumpPtrAllocator in lld/Common/Memory.cpp .

I’m wondering how that would work with parallel input file parsing. Presumably the thread that parses a given file has to allocate memory, and that memory may be accessed from a different thread later on…

rpmalloc supports a lot of platforms already, has simple and easily expandable code and is under public domain.

The question is if it’ll perform better than a bump allocator for our use case (where we basically never free anything.) I guess we can run some benchmarks to see how it performs even before parallelization.

Our parallelForEachN does not support nested usage. The outermost loop is parallel while the inner ones are sequential.

Another (minor) issue is that if parallelForEach is run in multiple different threads, there is no global coordination to limit the number of threads. (See e.g. ⚙ D115416 [lld-macho] Make writing map file asynchronous ; cc @thevinster)

1 Like

Thanks @MaskRay for bringing this topic up! This has been a huge pain point in MLIR for years, and really holds back the multi-threaded usability of the project.

I should clarify here that we (MLIR) dropped all uses of Parallel.h. Lots of problems with performance/usability (e.g. quite a few problems with MSVC that we never quite solved), the way the thread pool is set up, etc. prompted us to drop it. We now use a single ThreadPool held by the MLIRContext, but ThreadPool has its own host of usability issues and annoyances (e.g. it also doesn’t support nested parallelism).

I don’t think I’ve considered this one much in the past, but this is potentially something that MLIR could take advantage of. We currently use a hand-rolled sharded dense map for uniquing attributes/types/etc. that keeps an Allocator per shard. I’m not sure how much of a problem access to the allocator itself is though, given the bigger issue of the DenseMap needing a lock. It would be something that I’d only really want to consider after having a good concurrent hashmap.

This is quite important. MLIR has a few, I dare say, “creative” solutions to reduce lock contention, and we would only really be able to get rid of those if we have something usable on all of the platforms supported.

– River

1 Like

To +1 this thread, we need a solution to this in the LLVM world. I encountered this most recently in the CIRCT community (e.g. see the talk I gave at the llvm dev meeting complaining about this) which cares a lot about parallel compilation and not-strictly nested parallel for loops.

I am currently working on a library (currently internally named “LLCL” - llvm concurrency library) containing a lot of this, including efficient lock free data structures, low dependency abstractions, and flexible platform abstractions etc. It is still a bit early to share, but we can probably do so in a month if there is interest. This does NOT include a malloc.

-Chris

1 Like

Swift has some work in this area:
https://github.com/apple/swift/pull/33487

Inspiration? Cooperation?

Can you hoist the thread pool out of the parallelFor calls and re-use it?

I recently looked into LLVM multi-threading, and as mentioned above there seem to be two similar multi-threading strategies available (Parallel.h and ThreadPool.h). Here are some (hopefully accurate) notes I jotted down that may be relevant to this discussion:

  1. One approach is the ThreadPool API, which is defined in ThreadPool.cpp and ThreadPool.h. This API is used in the following:
    • clang/tools
    • clang-tools-extra
    • llvm/tools
    • dwarf linker
    • (and MLIR according to @River707)

    When a ThreadPool is initialized, a certain number of llvm::threads (which is a thin wrapper around std::threads) are created, and they then execute tasks added to a queue of packaged tasks.

    The ThreadPool destructor, which is called when the ThreadPool falls out of scope, waits for all of the threads to join.

  2. The second is the ThreadPoolExecutor API, which is defined in Parallel.cpp and Parallel.h. This API is used almost exclusively by LLD.

    The ThreadPoolExecutorAPI is similar to the ThreadPool, in that the executor maintains a vector of std::threads and a Stack of tasks that can be assigned to the threads, but there are two main differences:

    1. The thread pool is further abstracted into some higher-level functions:
      • parallelForEach
      • parallelTransformReduce
      • parallelSort
    2. The ThreadPoolExecutor is a static class member so it is initialized at the start of the process instead of whenever an instance of the thread pool is needed. The threads are not destroyed until the process terminates. These persistent threads were causing issues for us, which is why I investigated this in the first place.

    I’m looking forward to learning more about LLCL, sounds like it may address some of the issues we were having with the current multi-threading approaches!

Thanks for a great summary, @MaskRay.

Another limitation I found was that llvm::outs()/errs() is not thread safe. This was a bit of a painful realization, as I had to disallow all ‘developer’ prints when running with multiple threads, instead of getting interleaved but still mostly useful output.

Note that MLIR is associating a ThreadPool instance to an MLIRContext (equivalent to the LLVMContext) and also exposing higher-level functions which don’t depend on a global thread pool instance but operate on a specific context: https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/IR/Threading.h#L116-L149