atomic intrinsics

I'm beginning to survey Chapter 29, <atomic> (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3092.pdf) for what is actually required/desired in the way of compiler intrinsics regarding atomic synchronization. It appears to be a superset of the gcc __sync_* intrinsics (http://gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/Atomic-Builtins.html#Atomic-Builtins). I would like to start a conversation on exactly what intrinsics we would like to support in clang. Maybe we want the full set specified by <atomic>. Or maybe we want a subset and have the rest build on this subset. At this point I don't know, and I'm looking for people with more expertise in this area than I have.

There are approximately 14 operations, crossed with various memory ordering constraints specified by <atomic> and I've summarized them below:

void store(volatile type* x, type source) : memory_order_relaxed,
                                           memory_order_release,
                                           memory_order_seq_cst

type load(volatile type* x) : memory_order_relaxed,
                             memory_order_consume,
                             memory_order_acquire,
                             memory_order_seq_cst

type exchange(volatile type* x, type source) : memory_order_relaxed,
                                              memory_order_consume,
                                              memory_order_acquire,
                                              memory_order_release,
                                              memory_order_acq_rel,
                                              memory_order_seq_cst

bool compare_exchange_strong(volatile type* x, type* expected, type desired,
                            success_order, failure_order)
                            : memory_order_relaxed, ...
                              memory_order_consume, ...
                              memory_order_acquire, ...
                              memory_order_release, ...
                              memory_order_acq_rel, ...
                              memory_order_seq_cst, ...

bool compare_exchange_weak(volatile type* x, type* expected, type desired,
                          success_order, failure_order)
                            : memory_order_relaxed, ...
                              memory_order_consume, ...
                              memory_order_acquire, ...
                              memory_order_release, ...
                              memory_order_acq_rel, ...
                              memory_order_seq_cst, ...

type fetch_add(volatile* x, type y): memory_order_relaxed,
                                    memory_order_consume,
                                    memory_order_acquire,
                                    memory_order_release,
                                    memory_order_acq_rel,
                                    memory_order_seq_cst

type fetch_sub(volatile* x, type y): memory_order_relaxed,
                                    memory_order_consume,
                                    memory_order_acquire,
                                    memory_order_release,
                                    memory_order_acq_rel,
                                    memory_order_seq_cst

type fetch_or(volatile* x, type y): memory_order_relaxed,
                                    memory_order_consume,
                                    memory_order_acquire,
                                    memory_order_release,
                                    memory_order_acq_rel,
                                    memory_order_seq_cst

type fetch_xor(volatile* x, type y): memory_order_relaxed,
                                    memory_order_consume,
                                    memory_order_acquire,
                                    memory_order_release,
                                    memory_order_acq_rel,
                                    memory_order_seq_cst

type fetch_and(volatile* x, type y): memory_order_relaxed,
                                    memory_order_consume,
                                    memory_order_acquire,
                                    memory_order_release,
                                    memory_order_acq_rel,
                                    memory_order_seq_cst

bool test_and_set(volatile flag* x): memory_order_relaxed,
                                    memory_order_consume,
                                    memory_order_acquire,
                                    memory_order_release,
                                    memory_order_acq_rel,
                                    memory_order_seq_cst

void clear(volatile flag* x): memory_order_relaxed,
                             memory_order_consume,
                             memory_order_acquire,
                             memory_order_release,
                             memory_order_acq_rel,
                             memory_order_seq_cst

void fence() : memory_order_relaxed,
              memory_order_consume,
              memory_order_acquire,
              memory_order_release,
              memory_order_acq_rel,
              memory_order_seq_cst

void signal_fence() : memory_order_relaxed,
                     memory_order_consume,
                     memory_order_acquire,
                     memory_order_release,
                     memory_order_acq_rel,
                     memory_order_seq_cst

-Howard

A couple of us have been working, sporadically, on the matching LLVM
intrinsics: http://docs.google.com/Doc?docid=0AYWBeVVqyP7dZGRiNG1oeHpfMjJkejVnOThkZA&hl=en.
We suspect, although we're not certain, that we can get away with just
atomic load, store, add, exchange, compare_exchange, and fence, and
have the backend match certain cmpxchg-loops and lower them to the
appropriate atomic sequence when that's available. We really only need
add and exchange because we expect them to be more common than the
other operations, so they may benefit from a smaller encoding.

We aren't modeling the difference between cmpxchg_strong and
cmpxchg_weak on the assumption that the backend can figure out whether
the code can tell the difference. (I haven't thought hard about
whether that's true.) We only have a single order argument to cmpxchg
and we've omitted memory_order_consume to keep things simple in the
first version.

We haven't converted the proposal to use instructions yet (we started
out with intrinsics), so the format below is just rough, but the basic
schema is:

%old = i32 atomic_exchange i32* %ptr, i32 %new, order,
single_thread|cross_thread [, volatile]

The instructions can take any type pointer, up to a target-dependent
maximum size. Clang would be responsible for turning its intrinsics
into library calls when their arguments get too big. Note that the
maximum size becomes part of the platform ABI because you can't mix
locked calls and atomic instructions on the same address.

The single_thread|cross_thread argument lets us merge
atomic_signal_fence() with atomic_thread_fence(). It may also be
useful for things like atomic-increments that only need to be atomic
to communicate with a signal handler—such an increment could compile
to inc instead of lock;xadd.

So ... here's what I'd propose for clang's builtins:

__atomic_foo(T*, args, memory_order, bool cross_thread)

foo is: load, store, add, exchange, compare_exchange_weak, fence,
test_and_set, clear
args and the return type depend on foo.

I'm including test_and_set and clear because the right lowering to
LLVM IR might depend on the target. (If I remember correctly, on
PA-RISC test_and_set is an exchange(0), while everywhere else it's an
exchange(1).)

The volatile bit on the instruction would be set by whether T is volatile.

When the order and cross_thread arguments aren't obviously constant,
Clang would emit a switch to pick between the appropriate
instructions. Alternately, Clang could specify that they have to be
constant, and the library would emit the switch. Alternately, if you
really don't want the switch, we could change the proposed LLVM
intrinsics to take variable arguments for those and have the backend
emit the conditionals if necessary.

Jeffrey

Thanks much for the very detailed answer Jeffrey!

Are there any changes to the C++0X working draft:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3092.pdf

that you believe need to be made in this area in order to significantly increase the quality of the llvm/clang implementation? Personally I'm wondering about ways to ensure that the memory order is a compile-time constant. This question has a tight deadline. I'm turning in national body comments on the C++0X FCD in about 8 hours.

-Howard

While you should certainly try to turn in any comments you can in the next 8 hours, the UK is waiting slightly longer, so you could pass comments via there (which includes me), if you find anything later.

Chris

I've been talking to Lawrence Crowl about the atomics along the way,
so I'm pretty happy with the state of the standard. Even if LLVM
requires cmpxchg-loops for most of these operations, I think the extra
operations on the atomics provide a better user-facing programming
model. The idea to allow signal-only operations on things other than
fences is something of an experiment, so I'm not sure it belongs in
the standard yet.

On constants, I believe their thinking was that users will generally
pass a constant in voluntarily, and that compilers are pretty good at
propagating constants these days, so the fact that it's strictly a
variable shouldn't hurt things. The tricky bit for clang seems to be
that LLVM is good at propagating constants, but clang isn't, so we may
have to generate more IR than if the memory_order were passed in, say,
a template argument. To me, that's not a good enough argument to
change the user-facing syntax.

Ok, thanks.

-Howard