cmpxchg on floats

Hi LLVM-dev,

when working on MLIR-to-LLVM-IR conversion, I noticed that it is possible to programmatically construct a cmpxchg instruction operating on floats (or actually any type since there is no assertion on pointer element type here https://github.com/llvm/llvm-project/blob/9c2e708f0dc547d386ea528450a33ef4bd2a750b/llvm/lib/IR/Instructions.cpp#L1501), but LangRef specifies that only integers and pointers are accepted (https://llvm.org/docs/LangRef.html#cmpxchg-instruction). Does somebody rely on other types being accepted in cmpxchg or should we add an assertion for the element type to match the LangRef?

Does the code generator support this? If not, we should add an assertion to the verifier. If so, we can consider relaxing langref or making it more specific about floats.

-Chris

We've relaxed `atomicrmw xchg` to support floating point types but not
cmpxchg -- the cmpxchg comparison behavior is not a floating point
comparison, so that would be potentially misleading. I'd say adding
the assertion is a good idea.

Cheers,
Nicolai

We (C, C++, and LLVM) are generally moving towards supporting FP as a first-class thing with all atomic operations †, including cmpxchg. It’s indeed usually specified as a bitwise comparison, not a floating-point one, although IIRC AMD has an FP cmpxchg. Similarly, some of the operations are allowed to have separate FP state (say, atomic add won’t necessarily affect the scalar FP execution’s exception state, might have a different rounding mode, etc).

I’m OK with MLIR matching the current state of LLVM, but I want to point out that the docs might be incorrect, or are expected to change in the future.

† See http://wg21.link/p0020 as well as r360421 D58251 D50491 D26266 D26266 r264845 D15471

We’ve relaxed atomicrmw xchg to support floating point types but not
cmpxchg – the cmpxchg comparison behavior is not a floating point
comparison, so that would be potentially misleading. I’d say adding
the assertion is a good idea.

Cheers,
Nicolai

Does the code generator support this? If not, we should add an assertion to the verifier. If so, we can consider relaxing langref or making it more specific about floats.

I agree with this. Generally it’s a bit tricky because, some of these are tied to the atomic expand pass, where a backend says what it supports.

We (C, C++, and LLVM) are generally moving towards supporting FP as a first-class thing with all atomic operations †, including cmpxchg. It’s indeed *usually* specified as a bitwise comparison, not a floating-point one, although IIRC AMD has an FP cmpxchg.

We do, but this should arguably be a different instruction because it
behaves differently from bitwise compare.

Cheers,
Nicolai

We (C, C++, and LLVM) are generally moving towards supporting FP as a first-class thing with all atomic operations †, including cmpxchg. It’s indeed *usually* specified as a bitwise comparison, not a floating-point one, although IIRC AMD has an FP cmpxchg.

We do, but this should arguably be a different instruction because it
behaves differently from bitwise compare.

I wholeheartedly agree to separating them :slight_smile:

We don't really FP cmpxchg in hardware to implement it, do we? It can be
lowered as load, FP compare, if not equal cmpxchg load?

Joerg

> We (C, C++, and LLVM) are generally moving towards supporting FP as a
> first-class thing with all atomic operations †, including cmpxchg. It’s
> indeed *usually* specified as a bitwise comparison, not a floating-point
> one, although IIRC AMD has an FP cmpxchg. Similarly, some of the
> operations are allowed to have separate FP state (say, atomic add won’t
> necessarily affect the scalar FP execution’s exception state, might
> have a different rounding mode, etc).

We don't really FP cmpxchg in hardware to implement it, do we? It can be
lowered as load, FP compare, if not equal cmpxchg load?

Two points here:

1. Hardware with native fcmpxchg already exists.
2. It's incorrect even if I replace your "if not equal" by "if equal"
(which I assume is what you meant).

On the latter, assume your float in memory is initially -0.0, thread 1
does cmpxchg(-0.0, +0.0) and thread 2 does fcmpxchg(+0.0, 1.0). The
memory location is guaranteed to be 1.0 after both threads have run,
but this is no longer true with your replacement, because the
following ordering of operations is possible:

- Thread 2 loads -0.0, compares to +0.0 => comparison is equal
- Thread 1 does cmpxchg, memory value is now changed to +0.0
- Thread 2 does cmpxchg(-0.0, 1.0) now, testing whether the memory
location is unchanged --> this fails, so the memory location stays
+0.0

Cheers,
Nicolai

Right, I agree. I think this argues for this being a separate ‘fcmpxchg’ instruction, because the condition code is different.

-Chris

Thread 2 does the cmpxchg with the loaded value, not the value it is
tested for. So thread 2 would be using +0.0 as well.

Joerg

> > > We (C, C++, and LLVM) are generally moving towards supporting FP as a
> > > first-class thing with all atomic operations †, including cmpxchg. It’s
> > > indeed *usually* specified as a bitwise comparison, not a floating-point
> > > one, although IIRC AMD has an FP cmpxchg. Similarly, some of the
> > > operations are allowed to have separate FP state (say, atomic add won’t
> > > necessarily affect the scalar FP execution’s exception state, might
> > > have a different rounding mode, etc).
> >
> > We don't really FP cmpxchg in hardware to implement it, do we? It can be
> > lowered as load, FP compare, if not equal cmpxchg load?
>
> Two points here:
>
> 1. Hardware with native fcmpxchg already exists.
> 2. It's incorrect even if I replace your "if not equal" by "if equal"
> (which I assume is what you meant).
>
> On the latter, assume your float in memory is initially -0.0, thread 1
> does cmpxchg(-0.0, +0.0) and thread 2 does fcmpxchg(+0.0, 1.0). The
> memory location is guaranteed to be 1.0 after both threads have run,
> but this is no longer true with your replacement, because the
> following ordering of operations is possible:
>
> - Thread 2 loads -0.0, compares to +0.0 => comparison is equal
> - Thread 1 does cmpxchg, memory value is now changed to +0.0
> - Thread 2 does cmpxchg(-0.0, 1.0) now, testing whether the memory
> location is unchanged --> this fails, so the memory location stays
> +0.0

Thread 2 does the cmpxchg with the loaded value, not the value it is
tested for. So thread 2 would be using +0.0 as well.

Please re-read the sequence of events carefully. Thread 2 did read
-0.0, so that's the value it's using.

Cheers,
Nicolai

Right and so it can fail as a weak cmpxchg does. It will pick up the
correct value in the next iteration. This is not really that different
from spurious failures of LL/SC etc.

Joerg

> >
> > > > > We (C, C++, and LLVM) are generally moving towards supporting FP as a
> > > > > first-class thing with all atomic operations †, including cmpxchg. It’s
> > > > > indeed *usually* specified as a bitwise comparison, not a floating-point
> > > > > one, although IIRC AMD has an FP cmpxchg. Similarly, some of the
> > > > > operations are allowed to have separate FP state (say, atomic add won’t
> > > > > necessarily affect the scalar FP execution’s exception state, might
> > > > > have a different rounding mode, etc).
> > > >
> > > > We don't really FP cmpxchg in hardware to implement it, do we? It can be
> > > > lowered as load, FP compare, if not equal cmpxchg load?
> > >
> > > Two points here:
> > >
> > > 1. Hardware with native fcmpxchg already exists.
> > > 2. It's incorrect even if I replace your "if not equal" by "if equal"
> > > (which I assume is what you meant).
> > >
> > > On the latter, assume your float in memory is initially -0.0, thread 1
> > > does cmpxchg(-0.0, +0.0) and thread 2 does fcmpxchg(+0.0, 1.0). The
> > > memory location is guaranteed to be 1.0 after both threads have run,
> > > but this is no longer true with your replacement, because the
> > > following ordering of operations is possible:
> > >
> > > - Thread 2 loads -0.0, compares to +0.0 => comparison is equal
> > > - Thread 1 does cmpxchg, memory value is now changed to +0.0
> > > - Thread 2 does cmpxchg(-0.0, 1.0) now, testing whether the memory
> > > location is unchanged --> this fails, so the memory location stays
> > > +0.0
> >
> > Thread 2 does the cmpxchg with the loaded value, not the value it is
> > tested for. So thread 2 would be using +0.0 as well.
>
> Please re-read the sequence of events carefully. Thread 2 did read
> -0.0, so that's the value it's using.

Right and so it can fail as a weak cmpxchg does. It will pick up the
correct value in the next iteration. This is not really that different
from spurious failures of LL/SC etc.

Okay, I see what you mean. cmpxchg isn't weak by default though, so my
assumption is that a potential fcmpxchg would be the same. Can we
agree that your translation is correct for `fcmpxchg weak`, but not
for default `fcmpxchg`? You can make it correct by wrapping it in a
retry loop, at the cost of reducing performance even further compared
to a native fcmpxchg.

Cheers,
Nicolai

Sure, the point where this all started was the absense of native
fcmpxchg. So the short summary is that we can easily extend the current
IR to allow float types and provide a sane lowering.

Joerg

That’s correct, but I’m mainly interested in bitwise comparison. That’s what C, C++, and (IMO) LLVM IR mean when an FP value is passed to cmpxchg.

Separately, there are operations such as atomic fadd and atomic fsub which make senses, can be supported directly by HW, and can have separate FP state.

I bring this up because I believe the discussion which started this thread can benefit from a bit wider perspective than “the LangRef says exactly *this*”.