Concurrent Hashmap?

I'm looking into parallelizing LLD, and one of the things that would
probably help is a concurrent hashmap. I was unable to find an
existing implementation under ADT/, which was somewhat surprising.
Should I contribute an implementation?

Jez

lld's ELF implementation at least already has some parallelism - I
think Rui already experimented pretty broadly with various concurrent
mapping/set to do string deduplication in parallel, so you might see
how that's been implemented (maybe it didn't end up being done in
parallel because it couldn't be made efficient)

Probably worth prototyping your improvements with such a data
structure - but I think it'd be worth having the data about how useful
it is before adding it to ADT.

One obvious place where a concurrent unordered map can help: the global symbol table
concurrently shared by object files/shared objects/archive files/bitcode files.
I estimate this as a small one-digit number percentage of the whole link time
(the whole input file scanning takes 10+% time).

mold uses Intel TBB for concurrent data structures. It uses a bunch of
concurrent vectors, two concurrent hashmaps (one for symtab, the other
for comdat group selection), two concurrent unordered maps only for ICF.
(As I mentioned in
https://lists.llvm.org/pipermail/llvm-dev/2021-March/149363.html ,
ELF/ICF.cpp is not a good reference if you want fast ICF:( )

Parallel symbol resolution is tricky. Some challenges:

1. a.so b.a and b.a a.so may have different semantics. I believe mold
   currently cannot handle this correctly.
2. How to ensure determinism when features like --trace-symbol and diagnostics reporting are plugged into
   parallel symbol resolution.
3. weak references

These are things I can think of off the top of my head.
Probably none is insurmountable but in the end the complexity may likely overweigh its advantage:)

If we have a concurrent vector in llvm-project, there may be multiple
places we can improve things...

When you say you'd like to parallelize LLD, which driver do you mean? COFF, ELF, wasm...? They all have separate codebases.

There's already a specialized lock-free hashtable for the Debug Types merging the COFF driver: https://github.com/llvm/llvm-project/blob/e7a371f9fd0076c187f4cd1a9c7546867faeb19b/lld/COFF/DebugTypes.cpp#L992

I'd be really interested to hear what kind of design you had in mind for the concurrent hashmap?

If you contribute any concurrent container into ADT, I'd like to see an application along (that is, a patch that uses the container in LLD for example). If the container is used in a tight loop, it needs to be lock-free if we want it to scale on many-core machines. And in that case we're pretty much limited to a 64-bit key/value pair if we don't want to make things complicated. We could also have a sharded container that would fit more cases, but tweaking it really depends on its usage.

-----Message d'origine-----

Just because it wasn’t totally clear to me - is this about having a single LLD process have better parallelism?

What we’d love is that we could use LLD as a library (to save on process launch cost) multithreaded, but last I checked there was still lots of globals used all over the place.

Cheers,
-Neil.

Neil:

I think this is a separate discussion from what Jez suggests, but I had a demo patch that allowed the LLD/COFF driver to be thread-safe when used “as-a-library”: https://reviews.llvm.org/D86353 - with this we can link several programs in parallel in the same process, it makes the lld::safeLldLink() API fully thread-safe. Only limitation though was that ThreadPools were running single-threaded only on the current thread, because there was no way (yet) to schedule tasks in common, across LLD instances.

Making all global state thread_local in D86353 was simply a cheap’n dirty solution to have visibility on what needs to be changed. I think a more flexible solution would be to move all global (LLD) state onto the stack, into a “context” structure (or a collection of “context” structures, if that makes more sense).

I’d like to do that change soon, and I’d like to hear what folks think about that? (sorry for hijacking the thread)

We’d definitely use this functionality if available, so it’s a +1 from myself. Ideally we’d have it possible to work on the three main desktop platforms (so the ELF/MACHO code would be able to be run as a thread-safe library on multiple threads too), but COFF alone would be a huge win for starters for us.

(sorry for hijacking too!)

Alexandre Ganea <alexandre.ganea@ubisoft.com> 9:05 AM (2 hours ago) to
me, llvm-dev@lists.llvm.org

When you say you'd like to parallelize LLD, which driver do you mean?

I meant the Mach-O driver; sorry for omitting context.

There's already a specialized lock-free hashtable for the Debug Types merging the COFF driver

Oh, that's cool! Let me see if I can build upon it...

I'd be really interested to hear what kind of design you had in mind for the concurrent hashmap?

If I were going to hand-roll one I would probably go for a simple
array of hashmaps, one lock per map, sharded by key. Lock-free is nice
but harder to write (but having a starting point helps!)

David Blaikie <dblaikie@gmail.com> Apr 7, 2021, 5:39 PM (18 hours ago)
to Fangrui, me, llvm-dev

I think Rui already experimented pretty broadly with various concurrent

mapping/set to do string deduplication in parallel, so you might see
how that's been implemented (maybe it didn't end up being done in
parallel because it couldn't be made efficient)

It looks like LLD-ELF's string table uses a simple DenseMap, so I
guess parallelization didn't end up being committed.

For starters, I wanted to parallelize section/symbol ordering (using a
hashmap to connect symbols with their corresponding entries in the
order file), which would be fairly straightforward if I had a
concurrent map. It is a relatively small part of the overall link time
though. Parallelizing the loading of inputs would be a much bigger
win, but as MaskRay pointed out, doing symbol resolution right is
hard...

I may also try parallelizing the sorting of sections. Parellel.h lacks
a stable sort implementation, but I suppose it isn't too hard to write
a parallel mergesort.

Fangrui Song <maskray@google.com> Apr 7, 2021, 11:48 PM (12 hours ago)
to me, David, llvm-dev

(As I mentioned in [llvm-dev] [RFC] Annotating global functions and variables to prevent ICF during linking , ELF/ICF.cpp is not a good reference if you want fast ICF:( )

We haven't implemented ICF yet, but we'll keep that in mind :slight_smile:

Neil Henning <neil.henning@unity3d.com> 9:43 AM (2 hours ago) to
Alexandre, me, llvm-dev@lists.llvm.org

Just because it wasn't totally clear to me - is this about having a single LLD process have better parallelism?

Yes. I believe library-fying LLD has been a contested issue, but I
have no dog in this fight, so please use another thread to hash it out
:slight_smile:

Jez

If you are attempting to parallelize the symbol resolution phase, I can try to share some thoughts on how to generalize that table over to parallel symbol resolution.

Right now, symbol resolution in COFF, and presumably the other linkers, is interleaved with input loading, since referencing a symbol can pull in stuff from an archive. The first thing that needs to be done is to split that up into phases, where the linker merges all the symbols of all of the inputs it knows about, and then successively loads more inputs and symbols. Some of this work is already done, but at least for COFF, more restructuring needs to be done.

Once the restructuring is done, the merging algorithm should be pretty simple. As inputs, there are:

  • Symbol hash table from previous round of symbol resolution
  • New object files to load, each with a symbol table

The output should be:

  • The hashtable, composed of 64-bit cells indicating which symbol prevailed, identified by a pair of 32-bit numbers, object file index, and symbol table index

I recommend splitting up symbol resolution into a fully data parallel phase and a “shuffle” phase that does the concurrent hash table insertion. This makes rehashing very simple: if the table gets too full, simply abort the concurrent insertions, sync up the worker threads, and retry with a bigger table. The map phase should read the object file and pre-calculate a good hash for every external symbol. After all symbols have been hashed, the shuffle phase should attempt to concurrently insert into the table, and CAS into the table if the current symbol has a higher precedence than the symbol in the table. The algorithm for deciding which symbol prevails is format-specific and implements the rules for things like weak symbols and link line order. It should be deterministic.

That sketch leaves a lot to be desired, but that’s the project I’ve been planning for COFF for a while now.

  • AtomicHashMap –

Could you do a dynamic hash instead? In other words grow the hash instead of starting over - what's going to be the cost of retrying as opposed to growing?

Cheers,
Wol

https://github.com/facebook/folly/blob/master/folly/AtomicHashMap.h

* AtomicHashMap --
*

Dynamic / Linear Hashing

* A high-performance concurrent hash map with int32_t or int64_t keys.

Fixed size keys - pick 32-bit 64-bit whatever.

Supports
* insert, find(key), findAt(index), erase(key), size, and more.

ditto

Memory cannot
* be freed or reclaimed by erase. Can grow to a maximum of about 18 times the
* initial capacity, but performance degrades linearly with growth.

Bucket based, so you could allocate and free buckets as required. Performance is pretty much O(1) if I've got that right :slight_smile:

Can also be
* used as an object store with unique 32-bit references directly into the
* internal storage (retrieved with iterator::getIndex()).
*
* Advantages:
* - High-performance (~2-4x tbb::concurrent_hash_map in heavily
* multi-threaded environments).
* - Efficient memory usage if initial capacity is not over estimated
* (especially for small keys and values).
* - Good fragmentation properties (only allocates in large slabs which can
* be reused with clear() and never move).
* - Can generate unique, long-lived 32-bit references for efficient lookup
* (see findAt()).
*
* Disadvantages:
* - Keys must be native int32_t or int64_t, or explicitly converted.
* - Must be able to specify unique empty, locked, and erased keys
* - Performance degrades linearly as size grows beyond initialization
* capacity.
* - Max size limit of ~18x initial size (dependent on max load factor).
* - Memory is not freed or reclaimed by erase.

* Copyright (c) Facebook, Inc. and its affiliates.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at

Neil Nelson

Read and find are very fast - basically scan one bucket.

Write, update and delete obviously require locking a bucket, but as the table grows the chances of a collision drop.

And as the table fills, the cost of growing the table remains constant, namely splitting just one bucket.

Downsides, you need to avoid a pathological hash algorithm.

Cheers,
Wol

At a really high level, I believe starting the insertion over has the same algorithmic complexity as rehashing with the partially filled table, it’s all O(n). Eventually you build a table of size n, and along the way you have to build a table of size n/2, n/4, n/8…, and if you sum up all the work to build those tables, you get 2n, or O(n).

At a lower level, for common inputs, rehashing could be faster by a constant factor. Suppose there are n input symbols and m unique output symbols, and m is significantly smaller than n. Rehashing only takes O(m) time, but reinserting takes O(n) time. The overall work is still O(n). However, rehashing requires a much fancier concurrent data structure, probably with higher constant factor overhead. You may gain as much as you lose. And you open yourself up to concurrency bugs. Even if you take a tested concurrent data structure off the shelf, they can be tricky to use correctly.

So, at least for PDB global type hashing, I felt it was better to implement a less general data structure that only does bulk inserts followed by lookups. I also knew how many insertions I was going to do, so I could also conservatively overallocate a big hash table and sidestep rehashing. We can’t quite do the same for symbols because symbol resolution can discover new inputs.

    Could you do a dynamic hash instead? In other words grow the hash
    instead of starting over - what's going to be the cost of retrying as
    opposed to growing?

At a really high level, I believe starting the insertion over has the
same algorithmic complexity as rehashing with the partially filled
table, it's all O(n). Eventually you build a table of size n, and along
the way you have to build a table of size n/2, n/4, n/8..., and if you
sum up all the work to build those tables, you get 2n, or O(n).

Hmmm ... looking at it like that, dynamic hashing, although the maths is
completely different, comes to the same result

cost( 1->2 ) == cost( 2->3 ) == cost( 3->4 )

and going direct cost ( 1->4 ) is the same as summing the previous three
(because in practice, it IS the previous 3).

At a lower level, for common inputs, rehashing could be faster by a
constant factor. Suppose there are n input symbols and m unique output
symbols, and m is significantly smaller than n. Rehashing only takes
O(m) time, but reinserting takes O(n) time. The overall work is still
O(n). However, rehashing requires a much fancier concurrent data
structure, probably with higher constant factor overhead. You may gain
as much as you lose. And you open yourself up to concurrency bugs. Even
if you take a tested concurrent data structure off the shelf, they can
be tricky to use correctly.

Well, provided we're happy with a colliding read/write returning either
the pre or post state indeterminately, dynamic hashing scores here - I
think I can easily spec a NR&1W implementation. I might need more help
implementing it, seeing as I'm a FORTRAN/C guy, not a C++ guy, but I
pick things up quick.

So, at least for PDB global type hashing, I felt it was better to
implement a less general data structure that only does bulk inserts
followed by lookups. I also knew how many insertions I was going to do,
so I could also conservatively overallocate a big hash table and
sidestep rehashing. We can't quite do the same for symbols because
symbol resolution can discover new inputs.

Well, that's the thing, dynamic hashing can grow with minimal cost.
About the only thing I can think of that might require a global lock
(and we might well be able to get round that) is expanding the bucket
table (which would effectively be "just" a re-alloc).

Anything else, the table could grow in the background, locking at most
one bucket at a time, allowing reads to proceed completely unhindered.

Cheers,
Wol

Hi,

For parallelizing LLD, another approach could be something similar to
clojure collection instead of a concurrent hashmap.
It might change the processing design pattern for LLD a bit, but it
might be very fast.

Kind Regards

James