compare and swap

I was working on compare and swap and ran into the following problem.
Several architectures implement this with a load locked, store
conditional sequence. This is good, for those archs I can write
generic code to legalize a compare and swap (and most other atomic
ops) to load locked store conditional sequences (then the arch only
had to give the instr for ldl, stc to support all atomic ops (this
applies to mips, arm, ppc, and alpha)). However, I have to split the
basic block at the CAS instruction and create two more basic blocks.

This isn't currently possible during legalize, nor during the initial
SelectionDAG formation (the tricks switch lowering uses only work for
terminator instructions).

Anyone have an idea? The patch as it stands is attached below. X86
is a pseudo instruction because the necessary ones and prefixes aren't
in the code gen yet, but I would imagine they will be (so ignore that
ugliness). The true ugliness can be seen in the alpha impl which open
codes it, including a couple relative branches. The code sequence for
alpha is identical to ppc, mips, and arm, so it would be nice to lower
these to the correct sequences before code gen rather than splitting
(or hiding as I did here) basic blocks after code gen.

Andrew

lcs.patch (14.8 KB)

The current *hack* solution is to mark your pseudo instruction with usesCustomDAGSchedInserter = 1. That allows the targets to expand it at scheduling time by providing a EmitInstrWithCustomInserter() hook. You can create new basic blocks then.

Evan

I guess that can work in the short term. It just seems wasteful for
each target that uses ldl/stc sequences to have to all implement it.
But if that is what we can do right now, I'll give that a shot.

Thanks,

Andrew

I agree with Evan. This solution is not wonderful but it is the best we have right now.

-chris

Andrew,

why is the intrinsic name not CAS? And having another version that returns
just a bool might be better in some cases ( 1. does CAS return the value on
all architectures? 2. you can just jump based on a flag and don't need to
compare it again). Just my 2 cents though ...

torvald

I was going from chandler's docs, but it could be renamed trivially
(and I almost did at several points).

1) yes, but on some it may be easier to have a bool version than others.
2.a) to get the bool, the x86 (and some others) backend would have to
generate the compare instruction anyway, so you don't save anything by
having a bool version.
2.b) in the case of a load locked store conditional based backend, the
bool version would save a compare if the store conditional has the
typical returns success or failure semantics.

So, yes, a CAS that returned bool could be useful. However, it is
pretty easy to pattern match
CAS -> Compare
in those backends that can save the compare by testing the result of
the store conditional.

Andrew

Torvald Riegel wrote:

Anyone have an idea? The patch as it stands is attached below. X86
is a pseudo instruction because the necessary ones and prefixes aren't
in the code gen yet, but I would imagine they will be (so ignore that
ugliness). The true ugliness can be seen in the alpha impl which open
codes it, including a couple relative branches. The code sequence for
alpha is identical to ppc, mips, and arm, so it would be nice to lower
these to the correct sequences before code gen rather than splitting
(or hiding as I did here) basic blocks after code gen.

Andrew,

why is the intrinsic name not CAS?

Because, fundamentally, it loads, compares, and conditionally stores. There is no concept of a "swap" in SSA, so removing that aspect of the atomic primitives makes the *LLVM* representation easier to understand.

And having another version that returns just a bool might be better in some cases ( 1. does CAS return the value on all architectures?

Check the page (http://chandlerc.net/llvm_atomics.html -- the implementation info is still current, even though the docs are not) for how this gets implemented. As Andrew has already pointed out, on x86, the LLVM behavior maps to the underlying architecture. Other architectures which might avoid a compare can easily do so by pattern matching in the codegen. I'm not saying this is 100% correct mapping of all architectures, but it seems very clean, and not to introduce performance issues on any.

2. you can just jump based on a flag and don't need to compare it again). Just my 2 cents though ...

Again, pattern matching can enable the architectures which don't need to compare again, to in fact not do so, but some architectures will *need* to compare again in order to determine the bool value.

My strongest feeling is that "swap" has no place in an SSA IR, and the idea of atomically loading, comparing, and storing is far more in keeping. In fact, I thought the "swap" instrinsic had even been re-named to "ls" for load-store at some point this summer.. Do you have those changes Andrew? In any event, those are the reasons I had for moving away from "swap" in the design process, and as Andrew said he was primarily basing the implementation on that work.

-Chandler

Torvald Riegel wrote:
>> Anyone have an idea? The patch as it stands is attached below. X86
>> is a pseudo instruction because the necessary ones and prefixes aren't
>> in the code gen yet, but I would imagine they will be (so ignore that
>> ugliness). The true ugliness can be seen in the alpha impl which open
>> codes it, including a couple relative branches. The code sequence for
>> alpha is identical to ppc, mips, and arm, so it would be nice to lower
>> these to the correct sequences before code gen rather than splitting
>> (or hiding as I did here) basic blocks after code gen.
>
> Andrew,
>
> why is the intrinsic name not CAS?

Because, fundamentally, it loads, compares, and conditionally stores.
There is no concept of a "swap" in SSA, so removing that aspect of the
atomic primitives makes the *LLVM* representation easier to understand.

CAS is compare-and-SET.

> And having another version that returns
> just a bool might be better in some cases ( 1. does CAS return the value
> on all architectures?

Check the page (http://chandlerc.net/llvm_atomics.html -- the
implementation info is still current, even though the docs are not) for
how this gets implemented. As Andrew has already pointed out, on x86,
the LLVM behavior maps to the underlying architecture. Other

X86 provides the flag.

architectures which might avoid a compare can easily do so by pattern
matching in the codegen. I'm not saying this is 100% correct mapping of
all architectures, but it seems very clean, and not to introduce
performance issues on any.

If the codegen can really do it if necessary, then this is fine. Having two
intrinsics is not perfect, but it might give programmers more confidence in
the generated code.

Torvald

Food for thought, on x86 it is typical to have a lock-free algorithm like so:

int cmp, newcmp = *ptr;
do { cmp = newcmp; }
while((newcmp = cas(ptr, exch, cmp)) != cmp);

Which translates (optimized) into:

mov eax, [ptr]
loop:
lock cmpxchg [ptr], exch
jnz loop

cmpxchg fills eax with the old [ptr] value and sets ZF on success, so
it can be used without extra load and compare instructions. I am not
sure if LLVM has any concept of volatile, but [ptr] will probably be
treated as such in any algo and a CAS that returns a bool will force
you to use an extra mov (that can't be optimized out, due to volatile)
to get a new compare value. So having an intrinsic that returns the
old value is important.

Of course, I am quite ignorant of LLVM's capabilities, so I could be
completely off base here.

Food for thought, on x86 it is typical to have a lock-free algorithm like
so:

int cmp, newcmp = *ptr;
do { cmp = newcmp; }
while((newcmp = cas(ptr, exch, cmp)) != cmp);

Yes, that's lists etc. If you want to use CAS for locking though (there is no
TAS intrinsic), you would do while(!cas(&lock, 0, myID)); or have an
load-loop if you'd do test-and-test-and-set things.

Which translates (optimized) into:

mov eax, [ptr]
loop:
lock cmpxchg [ptr], exch
jnz loop

cmpxchg fills eax with the old [ptr] value and sets ZF on success, so
it can be used without extra load and compare instructions. I am not
sure if LLVM has any concept of volatile, but [ptr] will probably be
treated as such in any algo and a CAS that returns a bool will force
you to use an extra mov (that can't be optimized out, due to volatile)
to get a new compare value. So having an intrinsic that returns the
old value is important.

Agreed. I guess it really depends how much we want to have. If minimal is
sufficient, then load+store and some form of CAS are all we need.

Torvald

Swap seemed much more logical than ls. Memory isn't in SSA form, and
swapping a value in memory makes perfect sense in LLVM. LS isn't bad
and for operations where something happens in between (add or compare)
it makes sense. I guess the question is it a better naming
convention:
L Op S
or the terms the ops are normally known by (swap, cas, etc).

I don't think we need the other intrinsics (esp lss), so I haven't
added them yet. If there is a compelling case why the other ones you
have are necessary, then we could add them.

Andrew

Andrew Lenharth wrote:

Swap seemed much more logical than ls. Memory isn't in SSA form, and
swapping a value in memory makes perfect sense in LLVM.

I guess I always tried to make the intrinsics fit SSA form rather than memory representations. Within the backend it certainly makes more sense to tie them more firmly to the underlying hardware representations, I felt the load [op] store model fit SSA very well while lowering cleanly to the hardware, even when [op] is omitted. Either way though, its just a matter of syntax and fluff, not substance. =D

-Chandler

> LS isn't bad

and for operations where something happens in between (add or compare)
it makes sense. I guess the question is it a better naming
convention:
L Op S
or the terms the ops are normally known by (swap, cas, etc).

I don't think we need the other intrinsics (esp lss), so I haven't
added them yet. If there is a compelling case why the other ones you
have are necessary, then we could add them.

The only other one was "lss" i thought? The only reason I know of for that one is that x86 happens to support it directly (lock; xsub iirc...), but if that is even a common use case, it could likely be extracted by the backend rather than representing it in the IR... i think?

-Chandler