Atomic LL/SC loops in llvm

So, taking PR25526 as context, and after reading the internet, it seems to me that creating an atomic LL/SC loop at the IR level – as is the modern way to do it in LLVM – is effectively impossible to do correctly. Nor, for that matter, was it correct to create such a loop at isel time, as the implementation prior to r205525 did (and, as some targets still do, not having been updated yet to use IR expansion.)

In summary (see below for more details): to guarantee forward progress in an LL/SC loop, you typically must use only a subset of the instruction-set (excluding loads, stores, floating-point, branches), the entire loop must fit within some amount of contiguous memory, and it must only execute a small number of instructions.

I see no way that those properties can be guaranteed in LLVM, without emitting the entire loop as a fixed blob.

For ARM, there’s the newly-added ARMExpandPseudo::ExpandCMP_SWAP code, which basically does that. But, it seems kinda ugly, and a bit horrible to have to write all that code to generate a couple instructions – and that’s only handling cmpxchg, not atomicrmw, which would need similar treatment. And, even with all that, it’s still creating multiple basic blocks, prior to basic-block placement, so I think there’s no guarantee that the blocks get placed contiguously. (It kinda seems like it’d be easier at this point to just expand to inline-asm…)

Plus, that’s used only with -O0 at the moment. So, it generates worse code (which allowed the implementation to be simpler). It can only returns a success/fail value, instead of branching directly to success/fail successors. It doesn’t handle the optimizations like skipping the barrier on early cmpxchg failure. Probably more such issues, too.

Anyone got any other ideas for how to do this, without needing to introduce a bunch of per-target code? AtomicExpandPass was nice in that the target only needed to provide a couple of intrinsics, and it generated the rest.

It’s always a pain when the abstractions are leaky…

The typical constraints on an LL/SC loop – details vary between architectures – are that between the load-linked and store-conditional instructions you must:

  1. Not store to the region of memory (“granule”) containing the address you’re trying to atomically-modify. The size of the granule might be only the memory addresses being operated on, or the containing cache-line (most common), or maybe even all of memory, or anything in between. Any such store clears the load-linked reservation – sometimes even if from the same CPU.

  2. Not cause data or instruction cache filling/eviction activity. E.g. with a direct-mapped cache, any load, or a branch to a sufficiently far-away instruction might cause a guaranteed cache-line conflict – and that might kill the reservation on every loop iteration. With a more typical n-way cache, you’d have more leeway of course.

  3. Not cause any other architectural exceptions/traps that clear the reservation state. So, that typically excludes floating point, memory barriers, etc.

  4. Not take any branches. (This seems the rarest of the constraints; it’s possible it’s only relevant to Alpha and RISC-V, although maybe others just forgot to mention it as being potentially problematic.)

  5. Execute only a small number of instructions within the loop.

That last restriction seems most odd as a hard constraint, as opposed to just a performance win. It is apparently because a common implementation is for the LL instruction to pull the cache-line local, and invalidate it from all remote caches (as if the LL were actually a write). The subsequent SC will then succeed only if the cacheline has not been invalidated since the LL. With a naive implementation, LL operations on two CPUs executing the same loop might just ping-pong the cacheline back and forth between them forever, without either CPU being able to successfully execute the SC – by the time each got to it, the cacheline will have already been invalidated by the other CPU. Delaying the remote invalidation request for a little while in this situation is sufficient to prevent that problem. (But how long is a little while?)

If you fail to adhere to the requirements, there’s two kinds of issues it might cause. Firstly, it could cause the loop to deterministically never terminate, even when there’s nobody else contending on the atomic. E.g. ll ; spill to stack; sc . On some CPUs, the spill to a nearby stack address will clear the reservation. This issue is what motivated the recent ARM changes for -O0. Or, apparently on some Alpha CPUs, taking a branch deterministically kills the reservation.

More insidiously, with that same example, you can introduce the possibility of live-lock between LL/SC loops of two processors. A store from another processor necessarily clears the reservation of other CPUs (unlike an LL, which does so in some implementations, but not all). So, the problem with indefinite cacheline ping-pong can be introduced by the extra spill in the loop – even when it appears that it doesn’t cause a problem on your processor.

Here’s a relevant set of emails from Linus, also discussing why it’s not possible to reliably expose LL/SC intrinsics to C code – which is the same problems it’s not possible to reliably expose them to LLVM IR.

The MIPS IV manual is nicely specific in its documentation of what you can depend upon:

“An LL on one processor must not take action that, by itself, would cause an SC for the same block on another processor to fail.” (which avoids the potential of live-lock between two ll/sc loops where one CPU’s LL invalidates the other CPU’s reservation before the other can get to the SC operation.)

Forward progress can then be guaranteed, unless:
“A load, store, or prefetch is executed on the processor executing the LL/SC.” or
“The instructions executed starting with the LL and ending with the SC do not lie in a 2048-byte contiguous region of virtual memory.”.
Also: “Exceptions between the LL and SC cause SC to fail, so persistent exceptions must be avoided. Some examples of these are arithmetic operations that trap, system calls, floating-point operations that trap or require software emulation assistance.”

Similarly, the RISC-V manual – which I mention only because it is so clear and concise about its architectural requirements – says:
“We mandate that LR/SC sequences of bounded length (16 consecutive static instructions) will eventually succeed, provided they contain only base ISA instructions other than loads, stores, and taken branches.” (their “base ISA” excludes floating point or multiply/divide instructions.)

Unfortunately, neither ARM nor PPC appear to precisely document the architectural constraints under which forward progress must be guaranteed by the implementation. They certainly have the same underlying implementation issues that give rise to the above rules – that much seems documented – they just don’t appear to make explicit guarantees on how you can guarantee success. ARM does “recommend” that LL/SC loops fit within 128 bytes, though.

Hi James,

Thanks for the very detailed summary, there's some really interesting
details in there that I wasn't aware of.

Anyone got any other ideas for how to do this, without needing to introduce
a bunch of per-target code?

My only vaguely plausible idea to actually fix it so far was some kind
of do-not-touch region for the register allocator to avoid spills
(and, in light of your explanations, other things). Maybe some kind of
MBB level no-spill/volatile flag would even be enough.

IME most of the optimization benefits come at ISel time (choosing
better branch sequences, INC vs ADD etc), so I'm not *too* worried
about hobbling later machine passes.

For completeness, I've also been tempted to simply check for spills
and report_fatal_error given how often it's actually come up in the
last few years. That's really unfriendly though, so probably not a
good idea.

That last restriction seems most odd as a hard constraint, as opposed to
just a performance win. It is apparently because a common implementation is
for the LL instruction to pull the cache-line local, and invalidate it from
all remote caches (as if the LL were actually a write). The subsequent SC
will then succeed only if the cacheline has not been invalidated since the
LL. With a naive implementation, LL operations on two CPUs executing the
same loop might just ping-pong the cacheline back and forth between them
forever, without either CPU being able to successfully execute the SC -- by
the time each got to it, the cacheline will have already been invalidated by
the other CPU. Delaying the remote invalidation request for a little while
in this situation is sufficient to prevent that problem. (But how long is a
little while?)

I have a suspicion this kind of unpredictable starvation (or something
like it) is why ARM added such CISCy instructions to handle things in
v8.1.

Cheers.

Tim.

One possibility would be to present the sequence as an instruction bundle. That will avoid any instructions/spills/reloads getting moved in the middle of it. Which on the other hand doesn't sound that different from just expand a pseudo instruction...

- Matthias

I wondered about that, but can bundles span multiple basic blocks (or,
I suppose the real question is, contain terminators)? It might also
conflict with the targets' own use of bundles. But certainly worth
thinking about more.

Tim.

Thanks for the writeup, that is indeed pretty ugly. Simple asm(:::“memory”) isn’t sufficient either, since the regalloc can decode to spill :frowning:

Replied too early… Below:

Yeah, I found that info in the ARM ARM, however, it provides no information about forward progress guarantees. There’s no subset of instructions, maximal length of loop, or other such info about what sorts of LL/SC loops can be guaranteed not to live-lock. You can infer from erratum and FAQs that at least some of the ARM implementations likely do have some kind of cache control delay within which your loop should fit, but that’s about all I can find.

The lack of documented rules is perhaps only a theoretical concern, as if you follow all of the rules collected from other architectures, it’s a pretty sure bet you’ll be as safe as possible on ARM and PPC too. The only question is if it’s safe to be more lax, and e.g. allowing different branch sequences, or not. (And of course that’d only be interesting if there were an actual performance advantage to doing so).

But, it is certainly unfortunate that there appears to be no actual specification which a user can rely on, as MIPS has.

BTW, on another note, the “no taken branches” restriction from Alpha is explained like this:

“Branch instructions between the LDx_L/STx_C pair may be mispredicted, introducing load and store instructions that evict the locked cache block. To prevent that from happening, there is a bit in the instruction fetcher that is set for a LDx_L reference and cleared on any other memory reference. When this bit is set, the branch predictor predicts all branches to fall through.”

That seems like an interesting detail. I now wonder if other architectures have a similar issue with a mis-predicted branch in the LL/SC loop, and how they deal with it.

Yeah, I found that info in the ARM ARM, however, it provides no
information about forward progress guarantees. There's no subset of
instructions, maximal length of loop, or other such info about what sorts
of LL/SC loops can be guaranteed not to live-lock. You can infer from
erratum
<http://lists.infradead.org/pipermail/linux-arm-kernel/2014-May/254392.html&gt;
and FAQs
<Documentation – Arm Developer; that
at least some of the ARM implementations likely do have some kind of cache
control delay within which your loop should fit, but that's about all I can
find.

The lack of documented rules is perhaps only a theoretical concern, as if
you follow all of the rules collected from other architectures, it's a
pretty sure bet you'll be as safe as possible on ARM and PPC too. The only
question is if it's safe to be more lax, and e.g. allowing different branch
sequences, or not. (And of course that'd only be interesting if there were
an actual performance advantage to doing so).

But, it is certainly unfortunate that there appears to be no actual
specification which a user can rely on, as MIPS has.

True, we should probably get that clarified.

BTW, on another note, the "no taken branches" restriction from Alpha is

explained like this:
"Branch instructions between the LDx_L/STx_C pair may be mispredicted,
introducing load and store instructions that evict the locked cache block.
To prevent that from happening, there is a bit in the instruction fetcher
that is set for a LDx_L reference and cleared on any other memory
reference. When this bit is set, the branch predictor predicts all branches
to fall through."

That seems like an interesting detail. I now wonder if other architectures
have a similar issue with a mis-predicted branch in the LL/SC loop, and how
they deal with it.

I'm not sure we care about Alpha, the memory model sure doesn't :wink: