atomic intrinsics

I'm working on libc++'s <atomic>. This header requires close cooperation with the compiler. I set out to implement most of this header using only gcc's atomic intrinsics which clang already implements. The experience is not satisfying. :wink:

The needs of <atomic> are great. I've identified many intrinsics that need to optionally exist. And their existence needs to be individually detectible. If any individual intrinsic doesn't exist, <atomic> can lock a mutex and do the job. But <atomic> needs to know how to ask the question: Do you have this atomic intrinsic for this type? (type is integral or a void*).

The atomic intrinsics are basically:

load
store
exchange
compare_exchange
fetch_add
fetch_sub
fetch_or
fetch_and
fetch_sub
fetch_xor

The first 4 must work on all integral types plus void*. The arithmetic ones work on integral types except bool, and void* only supports fetch_add and fetch_sub.

The really complicating point is that these mostly support six different "memory orderings"

relaxed
consume
acquire
release
acq_rel
seq_cst

(cleverly spelled to always take 6 chars ;-)) Some of the operations above only need to work with a subset of these orderings. The compare_exchange comes in two flavors: strong and weak, and takes two orderings, not one. One ordering for success, one for failure. And only certain combinations.

The definitions of the orderings are here:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3126.pdf

I thought about trying to summarize them here, but knew I would get it wrong. I've put together a comprehensive list of intrinsics below, each specialized to an operation and to an ordering, or combination of orderings. I've only included intrinsics below with "legal orderings". The library can take care of detecting illegal memory orderings if that is desired.

I suggest that we take advantage of clang's __has_feature macro to detect if an intrinsic for a type exists. For example if:

   bool __atomic_load_relaxed(const volatile bool* atomic_obj);

exists, I suggest that:

__has_feature(__atomic_load_relaxed_bool) returns true, else false. Note that it is possible on some platforms that __has_feature(__atomic_load_relaxed_bool) might return true, but __has_feature(__atomic_load_relaxed_long_long) might return false.

Below is the list of intrinsics (holding breath). Is this a direction that the clang community can rally around?

-Howard

I'm still working on <atomic>, as described below. But I paused my development to write a "Compiler writer's guide to <atomic>" which I've placed here:

http://libcxx.llvm.org/atomic_design.html

This details exactly what intrinsics must appear, and in what form, and which are optional. This document also describes how the library deals with optional intrinsics which are not supplied. In a nutshell, the library calls the best intrinsic the compiler supplies, and if none are, locks a mutex to do the job.

Comments welcome. Is this a design that the clang community can rally around?

-Howard

It mostly looks good to me, but I wonder if the instrinsics should be
organized by size rather than argument type. In particular, x86-64 can
handle pair<void*,void*> atomically using cmpxchg16b, but there's no
primitive type that large (unless you want to use an mmx type?).

I actually did think about doing this by size, but decided the __atomic_* API was easier for me if I did it by type. I agree that size is probably what the compiler writer is more concerned about.

Caveat: There is a generalized atomic<T> template, which I haven't coded yet, but I was thinking about testing the size and pod-ness of T and reinterpreting T to a scalar when appropriate in order to get pair<void*,void*> lock-free when possible. Though as you point out, on x86-64 that will never happen with my design. Hmm... I'll think on this more.

-Howard

How about something like this:

__atomic_load_seq_cst(__obj)

that is processed by the front end into the llvm IR intrinsic (unless it has special knowledge) and then either emitted by the backend as a call to that function or inlined as atomic code if the backend knows how to do that. That way the compiler can make the choice, but we also don't get people using the "intrinsics" in a non-portable way thinking that it's the actual api, and just use the API - and the front end knows what to do.

This would then get us instead of:

// load

template <class _Tp>
_Tp
__load_seq_cst(_Tp const volatile* __obj)
{
   unique_lock<mutex> _(__not_atomic_mut());
   return *__obj;
}

// load bool

inline _LIBCPP_INLINE_VISIBILITY
bool
__choose_load_seq_cst(bool const volatile* __obj)
{
#if __has_feature(__atomic_load_seq_cst_b)
   return __atomic_load_seq_cst(__obj);
#else
   return __load_seq_cst(__obj);
#endif
}

Just a library call of __atomic_load_seq_cst that the backend will call if it doesn't know how to emit and so you only have to write:

// load

template <class _Tp>
_Tp
__atomic_load_seq_cst(_Tp const volatile* __obj)
{
   unique_lock<mutex> _(__not_atomic_mut());
   return *__obj;
}

and the front end would process it based on name and type depending on what it can do. The backend can then implement 0, 1, N, or All of the intrinsics that can be lowered to target code.

You were mentioning a bit more to this in private mail, I'll let you summarize that here :slight_smile:

-eric

<nod> Thanks Eric. I should state right up front that I'm fine with this direction. But it appears to me to need much more support from the front end (which is why I didn't propose it). If we go this direction, there are no optional intrinsics for the front end. The front end has to implement essentially everything specified in <atomic>. Here is a list:

type: bool, char, signed char, unsigned char, short, unsigned short, int,
     unsigned int, long, unsigned long, long long, unsigned long long,
     char16_t, char32_t, wchar_t, void*

type __atomic_load_relaxed(const volatile type* atomic_obj);
type __atomic_load_consume(const volatile type* atomic_obj);
type __atomic_load_acquire(const volatile type* atomic_obj);
type __atomic_load_seq_cst(const volatile type* atomic_obj);

void __atomic_store_relaxed(volatile type* atomic_obj, type desired);
void __atomic_store_release(volatile type* atomic_obj, type desired);
void __atomic_store_seq_cst(volatile type* atomic_obj, type desired);

type __atomic_exchange_relaxed(volatile type* atomic_obj, type desired);
type __atomic_exchange_consume(volatile type* atomic_obj, type desired);
type __atomic_exchange_acquire(volatile type* atomic_obj, type desired);
type __atomic_exchange_release(volatile type* atomic_obj, type desired);
type __atomic_exchange_acq_rel(volatile type* atomic_obj, type desired);
type __atomic_exchange_seq_cst(volatile type* atomic_obj, type desired);

bool __atomic_compare_exchange_weak_relaxed_relaxed(volatile type* atomic_obj,
                                                 type* expected, type desired);

bool __atomic_compare_exchange_weak_consume_relaxed(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_weak_consume_consume(volatile type* atomic_obj,
                                                 type* expected, type desired);

bool __atomic_compare_exchange_weak_acquire_relaxed(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_weak_acquire_consume(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_weak_acquire_acquire(volatile type* atomic_obj,
                                                 type* expected, type desired);

bool __atomic_compare_exchange_weak_release_relaxed(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_weak_release_consume(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_weak_release_acquire(volatile type* atomic_obj,
                                                 type* expected, type desired);

bool __atomic_compare_exchange_weak_acq_rel_relaxed(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_weak_acq_rel_consume(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_weak_acq_rel_acquire(volatile type* atomic_obj,
                                                 type* expected, type desired);

bool __atomic_compare_exchange_weak_seq_cst_relaxed(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_weak_seq_cst_consume(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_weak_seq_cst_acquire(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_weak_seq_cst_seq_cst(volatile type* atomic_obj,
                                                 type* expected, type desired);

bool __atomic_compare_exchange_strong_relaxed_relaxed(volatile type* atomic_obj,
                                                 type* expected, type desired);

bool __atomic_compare_exchange_strong_consume_relaxed(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_strong_consume_consume(volatile type* atomic_obj,
                                                 type* expected, type desired);

bool __atomic_compare_exchange_strong_acquire_relaxed(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_strong_acquire_consume(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_strong_acquire_acquire(volatile type* atomic_obj,
                                                 type* expected, type desired);

bool __atomic_compare_exchange_strong_release_relaxed(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_strong_release_consume(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_strong_release_acquire(volatile type* atomic_obj,
                                                 type* expected, type desired);

bool __atomic_compare_exchange_strong_acq_rel_relaxed(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_strong_acq_rel_consume(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_strong_acq_rel_acquire(volatile type* atomic_obj,
                                                 type* expected, type desired);

bool __atomic_compare_exchange_strong_seq_cst_relaxed(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_strong_seq_cst_consume(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_strong_seq_cst_acquire(volatile type* atomic_obj,
                                                 type* expected, type desired);
bool __atomic_compare_exchange_strong_seq_cst_seq_cst(volatile type* atomic_obj,
                                                 type* expected, type desired);

For what it's worth, an implementation based on size would be very likely to automatically work for address-space qualified pointers, which are clearly foremost on everyone's mind. :slight_smile:

More relevantly, it would avoid the need for seven different intrinsics to support int, unsigned int, long, unsigned long, char32_t, wchar_t, and void* in the common case where these are exactly the same operation.

John.

Why isn't automatic fallback in case of an unsupported intrinsic for a particular size possible in this case?

Sebastian

In the current design (http://libcxx.llvm.org/atomic_design.html) the library asks the question of whether an intrinsic exists or not, as they are optional. Thus in the eyes of the compiler writer (front and back ends), the fallback is automatic. Not every intrinsic need be implemented.

Eric is suggesting to move that branch into the front end. The compiler intrinsic exists in the front end, and the library calls it. The front end asks the back end if it is supported and decides what to do. I.e. to the library and to the back end, fallback is automatic, because the front end is now doing the decision making.

-Howard

Right, I mixed up Eric's suggestion and the suggestion about distinguishing intrinsic support by size instead of type.

Sebastian

I don't know for sure yet, but I'm guessing that the front end design would more easily handle "unusual sizes" like struct{char _[128];};. The only operations needed for these objects are load, store, exchange, and compare_exchange_strong/weak (no arithmetic).

-Howard

If other compilers want to adopt clang-compatible atomic intrinsics, then they will most likely adopt the entire set, or none.

The other issue is that a particular target might only support some subset of these operations. In this case, it's probably still cleaner to have the fallback code in the back end, because then optimisations can work more easily, with the high-level knowledge of the operations.

David

-- Sent from my Difference Engine

I've been assuming that if we go this direction (and I think it is a very good idea), that the fallback (locking) code would be in compiler-rt, though I have no strong opinion concerning where it should be.

-Howard

In the hopes of clarification, I've put three design descriptions up at:

http://libcxx.llvm.org/atomic_design.html

A and/or B is what Eric proposed. C is what I've been working toward. A is my preference. But I can't implement A without buy-in from the front end team. Heck, I can't actually implement any of them without said buy-in. :slight_smile: I chose C originally because that was where I thought I could most likely get buy-in.

If any of those docs are unclear, just let me know and I'll attempt to clarify what is intended.

-Howard

To be more specific for A:

The front end handles every intrinisic listed, and the library provides a default and unoptimized implementation. The back end will provide a better optimized implementation if it can, otherwise falling back to a call to the library function.

-eric

In the above sentence I presume "the library" refers to compiler_rt, otherwise the clang would be tied to libc++.

-Howard

Either/or. It depends on whether you want these intrinsics to be generally useful or tied to the implementation of <atomic>.

-eric

I think if we do a good job with this, other implementations will follow. Therefore I recommend compiler-rt (similar to dealing with complex arithmetic, count leading zeros, etc.)

-Howard

Sure :slight_smile:

You get to write it either way :wink:

-eric

Fine with me, that part's easy. Someone else gets to write the lock-free assembly. :slight_smile:

-Howard