[compiler-rt] Improve atomic locking?


I am wondering if there wouldn't be more room for improving the
locking of a pointer when an atomic operation is being made since I've
noticed that one could increase the SPINLOCK_COUNT in
lib/builtins/atomic.c to (1 << 13) which is a 8x increase of available
locks if we also change the type of the atomic lock which currently is
uintptr_t to a single byte (uint8_t) which I think is more appropriate
since the lock can only have two states for other systems than
freebsd, macos.

Am I missing something? This seems like choosing uintptr_t was intentional.

~ Paul

It’s mostly about false sharing and depends a bit on how the atomics are implemented for any given architecture and microarchitecture. In many implementations, the atomicity is guaranteed by the cache coherency protocol, so the granularity is one cache line. With 64 byte cache lines and an 8-byte lock, you have 8 locks per cache line, so with 2^10 locks there’s a 1/128 chance of false sharing causing contention on the locks even when two addresses map to different locks. With int8_t on the same system you now have 64 locks per cache line, so a 1/16 chance of false sharing.

Additionally, on many architectures there are no atomic instructions for atomics that work on quantities smaller than a register (or a 32-bit value on 64-bit systems). For example, on MIPS you have 32- and 64-bit load-linked and store-conditional. If you want to do an 8-bit compare and swap, then you will encounter contention with other operations on the same 32-bit value, even on a microarchitecture that supports word-granularity atomics (using something finer grained than the cache coherency protocol). Even on systems where the ISA supports 8-bit atomics, they’re often microcoded as shift and mask sequences.

Note: On Linux, we should be using futexes, but no one who uses Linux appears to care enough to implement this code path.

Do you have an example in which lock contention here is a bottleneck? I’d be interested in seeing it. If you do encounter contention here, then you’d likely see more of a speedup simply bumping SPINLOCK_COUNT without changing the underlying type. There’s no reason on most systems not to allocate 64KB for the spinlock array: memory is cheap, most of it won’t be in the cache, and a more sparse structure will reduce contention.