This post documents what I know about the lack of parallel/thread-safe facility which restricts our choices in parallel algorithms. Hope it spurs discussions on improving the status quo.
Parallel algorithms
MLIR and lld are main users.
llvm/Support/Parallel.h
has some straightforward implementations of some parallel algorithms. It would be good know how they compare with mature external libraries.
We don’t provide a counterpart to [algorithms.parallel_for_each.feeder].
This can be used to parallelize section based garbage collection in various ports of lld.
Our parallelForEachN
does not support nested usage. The outermost loop is parallel while the inner ones are sequential.
See ⚙ D61115 Parallel: only allow the first TaskGroup to run tasks parallelly for the limitation.
In lld/ELF, writeSections
may benefit from nested support (Why isn't ld.lld faster? | MaskRay)
Our parallelForEachN
uses a coarse TaskSize estimate:
enum { MaxTasksPerGroup = 1024 };
...
auto TaskSize = NumItems / parallel::detail::MaxTasksPerGroup;
For a loop with a trip count less than 1024, TaskSize==1, and one thread is usually assigned striding indexes, losing locality. E.g. with 4 threads, thread x
may get assigned indexes i*4+x
, assuming every thread spends exactly the same time finishing one task. The optimal assignment can be: 0 gets [0,N/4)
, 1 gets [N/4,N/4*2)
, …
BumpPtrAllocator
llvm/Support/Allocator.h
defines BumpPtrAllocator
whose Allocate
function cannot be called concurrently.
In various ports of lld, section and symbol initialization places may benefit from multi-threading.
Personally I lean toward that we don’t need thread-safe BumpPtrAllocator
. We just need thread-local BumpPtrAllocator
in lld/Common/Memory.cpp
.
Many modern allocators focus on NUMA aware allocations and per-thread memory reference locality (evidenced by ⚙ D119909 [ELF] Parallelize initializeLocalSymbols).
Note: in lld, we tend to use many make<X>(...)
which instantiate SpecificBumpPtrAllocator<T>
. One instance appears to cost several kilobytes. See ⚙ D108850 [LLD] Remove global state in lldCommon (lld/Common/Memory.cpp
) that there is additional complexity in making it free of global states.
malloc implementation
For clang and chrome, I have measured that linking against libmalloc.a makes ld.lld 1.12x as fast. (jemalloc / (older tcmalloc from gperftools): ~1.10x)
(lld 14 ELF changes | MaskRay)
If I maintained the lld package for a distro, I’d suggest that we do something like -DCMAKE_EXE_LINKER_FLAGS=-Wl,--push-state,$HOME/Dev/mimalloc/out/release/libmimalloc.a,--pop-state
for both clang and lld.
The ~100KiB size increase pays off.
That said, I am unsure what to do with ⚙ D101427 Linux support for mimalloc as a custom allocator which adds CMake support.
Perhaps distributors can weigh in what they expect from CMake support.
Thread-safe containers
i.e. the container performs internal synchronization, so that multiple threads can concurrently call member functions without having to manage the synchronization themselves.
llvm-project lacks vector and unordered_map.
This restricts our choices in parallel algorithms.
third-party software
It is an open question whether we can outsource some work to a piece of third-party software.
@aganea mentioned on Parallel input file parsing - #7 by aganea
I feel those two topics point out again the lack in LLVM of thread-safe, lock-free containers & memory allocator. This boils down to the question of whether we should reference or vendor external packages from LLVM. Right now it seems that only “optional” external libs are referenced by LLVM, and any “required” external library has to duplicated/vendored in the monorepo (and inherently fit the LLVM licence).
Writing a cross-platform, lock-free allocator isn’t trivial (although we have SCUDO), and nor is re-writing all the common containers (map, vector, deque, etc) in a thread-safe, lock-free manner.
If we could rely on IntelTBB, that would solve both raised issues (since IntelTBB has tbbmalloc). Adding in mimalloc or rpmalloc would further improve the situation.Would it be an option to let LLD rely on external submodules?
The question, aside from the usual performance/maintenance status, usually also involves license and portability (whether the software is at least as portable as llvm-project itself: OS support, architecture support).
It’s useful to know our gap to a mature implementation. Say, if the in-tree implementation is within 10% slower for that particular algorithm/container we need, perhaps we don’t bother with outsourcing.