atomic intrinsics

Sure :slight_smile:

You get to write it either way :wink:

Fine with me, that part's easy. Someone else gets to write the lock-free assembly. :slight_smile:

Heh. :slight_smile:

-eric

* volatile: The optimizers can do a better job with non-volatile
atomic operations. I believe in your initial atomic_flag
implementation you forwarded the argument's volatility to the
intrinsic, and if the compiler implements the intrinsics it can detect
whether the argument is volatile and optimize accordingly, so that's
all good. But in these API proposals you only have the volatile
intrinsics listed, which might mislead compiler authors to think that
every atomic op has to be volatile. I think it'd be worth mentioning
that notionally there are volatile and non-volatile versions of each
intrinsic.

* "POD" isn't quite the right restriction. [atomics.types.generic]p1
says "The type of the template argument T shall be trivially
copyable." There may be an NB comment suggesting an extra restriction,
but I don't remember exactly what it said.

* I'm not sure if you want this in the API doc, but the choice of
mutex vs. lock-free instructions to implement each size/type is an ABI
decision, and that needs to be noted somewhere for each target. That
is, if clang-2.9 implements the 16-byte atomic by calling out to
compiler-rt, which uses a spinlock to make it atomic, and then
clang-2.10 implements it by using the cmpxchg16b instruction, that's
an ABI change. To avoid ABI changes, we'd need to implement all the
atomics up to the largest platform-supported size using instructions
in the first version that supports them at all.

Jeffrey

Thanks Jeffrey, comments below:

In the hopes of clarification, I've put three design descriptions up at:

http://libcxx.llvm.org/atomic_design.html

A and/or B is what Eric proposed. C is what I've been working toward. A is my preference. But I can't implement A without buy-in from the front end team. Heck, I can't actually implement any of them without said buy-in. :slight_smile: I chose C originally because that was where I thought I could most likely get buy-in.

If any of those docs are unclear, just let me know and I'll attempt to clarify what is intended.

* volatile: The optimizers can do a better job with non-volatile
atomic operations. I believe in your initial atomic_flag
implementation you forwarded the argument's volatility to the
intrinsic, and if the compiler implements the intrinsics it can detect
whether the argument is volatile and optimize accordingly, so that's
all good. But in these API proposals you only have the volatile
intrinsics listed, which might mislead compiler authors to think that
every atomic op has to be volatile. I think it'd be worth mentioning
that notionally there are volatile and non-volatile versions of each
intrinsic.

Updated only Design A:

http://libcxx.llvm.org/atomic_design_a.html

* "POD" isn't quite the right restriction. [atomics.types.generic]p1
says "The type of the template argument T shall be trivially
copyable." There may be an NB comment suggesting an extra restriction,
but I don't remember exactly what it said.

Updated Design A:

http://libcxx.llvm.org/atomic_design_a.html

* I'm not sure if you want this in the API doc, but the choice of
mutex vs. lock-free instructions to implement each size/type is an ABI
decision, and that needs to be noted somewhere for each target. That
is, if clang-2.9 implements the 16-byte atomic by calling out to
compiler-rt, which uses a spinlock to make it atomic, and then
clang-2.10 implements it by using the cmpxchg16b instruction, that's
an ABI change. To avoid ABI changes, we'd need to implement all the
atomics up to the largest platform-supported size using instructions
in the first version that supports them at all.

I haven't taken action yet on this as I'm not yet sure how to modify the design doc. Your point reminded me of something else:

The library has functions that look like:

    bool atomic_is_lock_free(const volatile type*);
    bool atomic_is_lock_free(const atomic_itype*);

To implement this, the library must be able to ask the compiler this question in some way. So perhaps I should add to the list of intrinsics:

    bool __atomic_is_lock_free(const type*); // volatile implied by comment just added to A

Does the existence of this intrinsic mitigate your ABI concern? I.e. if the client cares about this ABI, he has to query it.

-Howard

I don't think it mitigates the ABI concern, although I could be wrong.
Say you have:

TU1.cc

I see what you mean. I'll mention this in the design doc. I believe it will be an issue for any design we choose (A, B, or C).

-Howard

I've updated:

http://libcxx.llvm.org/atomic_design.html

with this concern. And I've added:

bool __atomic_is_lock_free(const type* atomic_obj);

to:

http://libcxx.llvm.org/atomic_design_a.html
http://libcxx.llvm.org/atomic_design_b.html

-Howard

After some thought, I think I personally prefer proposal A, slightly modified such that __atomic_is_lock_free is a pseudofunction like __is_pod(). The builtins have exactly those names and are "overloaded"; memory ordering parameters are required to be constant expressions but not necessarily literals. Arguments not of fundamental type, or not exactly matching a permitted type for the intrinsic, are not required to be supported by the implementation (so calling swap with a long* and an int, or a Base** and a Derived*, or a ConvertibleToIntPtr and a ConvertibleToInt, may or may not have sensible results).

I'm not worried about the amount of work this imposes on the compiler writers because (1) most of that work is actually pretty much copy-and-paste and (2) it's all required for an ABI-conformant implementation anyway. So the fact that proposal B nicely adapts as the compiler incrementally improves support is not interesting to me because the compiler isn't actually permitted to incrementally improve support. :slight_smile: Also I really don't like the idea of this library requiring almost a hundred new differently-named builtins.

This doesn't strictly matter for you, but we would implement this in Clang by having the frontend do the ABI-compliant lowerings for the locking implementations and then emitting everything that can be done lock-free as an LLVM intrinsic call.

Anyway, just my vote.

John.

I like the changes, and yeah, I don't think anyone liked B :slight_smile:

-eric

Thanks for looking at this John.

I see what you mean. I'll mention this in the design doc. I believe it will be an issue for any design we choose (A, B, or C).

After some thought, I think I personally prefer proposal A, slightly modified such that __atomic_is_lock_free is a pseudofunction like __is_pod().

I see, like:

   bool __atomic_is_lock_free(type);

?

The builtins have exactly those names and are "overloaded"; memory ordering parameters are required to be constant expressions but not necessarily literals.

If the memory orderings are required to be constant expressions, this is really Design B, just with some renaming. It means the library must still switch() to convert the run time memory orderings it receives into compile time orderings. E.g.:

inline _LIBCPP_INLINE_VISIBILITY
bool atomic_compare_exchange_weak_explicit(volatile atomic_bool* __obj,
                                           bool* __exp, bool __desr,
                                           memory_order __s, memory_order __f)
{
    __f = __translate_memory_order(__f);
    switch (__s)
    {
    case memory_order_relaxed:
        return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 0, 0);
    case memory_order_consume:
        switch (__f)
        {
        case memory_order_relaxed:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 1, 0);
        case memory_order_consume:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 1, 1);
        }
    case memory_order_acquire:
        switch (__f)
        {
        case memory_order_relaxed:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 2, 0);
        case memory_order_consume:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 2, 1);
        case memory_order_acquire:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 2, 2);
        }
    case memory_order_release:
        switch (__f)
        {
        case memory_order_relaxed:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 3, 0);
        case memory_order_consume:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 3, 1);
        case memory_order_acquire:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 3, 2);
        }
    case memory_order_acq_rel:
        switch (__f)
        {
        case memory_order_relaxed:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 4, 0);
        case memory_order_consume:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 4, 1);
        case memory_order_acquire:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 4, 2);
        }
    case memory_order_seq_cst:
        switch (__f)
        {
        case memory_order_relaxed:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 5, 0);
        case memory_order_consume:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 5, 1);
        case memory_order_acquire:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 5, 2);
        case memory_order_seq_cst:
            return __atomic_compare_exchange_weak(&__obj->__v_, __exp, __desr, 5, 5);
        }
    }
}

This is not a problem for libc++. However I've heard rumblings that this won't be ok for the C library. Want to comment Blaine?

Arguments not of fundamental type, or not exactly matching a permitted type for the intrinsic, are not required to be supported by the implementation (so calling swap with a long* and an int, or a Base** and a Derived*, or a ConvertibleToIntPtr and a ConvertibleToInt, may or may not have sensible results).

I'm afraid my spec is ambiguous, sorry. In:

bool __atomic_compare_exchange_strong(type* atomic_obj,
                                      type* expected, type desired,
                                      int mem_success, int mem_failure);

It was my intention that in each argument that specifies "type", the types are all required to be the same. In C++ templates, this would be:

template <class type>
bool __atomic_compare_exchange_strong(type* atomic_obj,
                                      type* expected, type desired,
                                      int mem_success, int mem_failure);

Concerning arguments "type" that are not scalars: I believe I can handle this in libc++ with some fancy template meta programming, and pass them through to the front end type-punned as scalars if there is a scalar of the appropriate size, and otherwise route them to a library-lock function. I can do this because "type" is a-priori known to be trivially copyable.

However, and again, this is significant functionality at the library level that I don't know will be addressible for the C library. Also it misses the optimization Jeffrey mentioned for pair<void*,void*> on x86-64. I.e. x86-64 can atomically handle 16 byte objects, but there is no 16 byte scalar which can be used to slip it to the front end (unless the front end accepts long double I guess).

This doesn't strictly matter for you, but we would implement this in Clang by having the frontend do the ABI-compliant lowerings for the locking implementations and then emitting everything that can be done lock-free as an LLVM intrinsic call.

Does this translate to a call to compiler-rt for the locking implementations? I ask because I'm probably going to be the one to implement these in compiler-rt if they are needed.

Anyway, just my vote.

Thanks much John.

-Howard

Thanks for looking at this John.

I see what you mean. I'll mention this in the design doc. I believe it will be an issue for any design we choose (A, B, or C).

After some thought, I think I personally prefer proposal A, slightly modified such that __atomic_is_lock_free is a pseudofunction like __is_pod().

I see, like:

  bool __atomic_is_lock_free(type);

?

That's what I was thinking.

The builtins have exactly those names and are "overloaded"; memory ordering parameters are required to be constant expressions but not necessarily literals.

If the memory orderings are required to be constant expressions, this is really Design B, just with some renaming.

I see what you mean. I hadn't actually realized that the library actually takes these as runtime parameters; that seems very un-C++-ish to me. Certainly it's not very "pay for what you use", although I guess the presumption is that it will quickly optimize to the right thing.

This is not a problem for libc++. However I've heard rumblings that this won't be ok for the C library. Want to comment Blaine?

Are the committees actually going to agree enough here to permit re-use? The main advantage of requiring the parameters to be constant expressions is that the builtins don't become intimately tied to a specific library. I'm very concerned about building compiler APIs on top of competing and incomplete standards.

If we accept this as a runtime argument, we will invoke undefined behavior on out-of-range enumerators. Is that acceptable to C and C++?

Arguments not of fundamental type, or not exactly matching a permitted type for the intrinsic, are not required to be supported by the implementation (so calling swap with a long* and an int, or a Base** and a Derived*, or a ConvertibleToIntPtr and a ConvertibleToInt, may or may not have sensible results).

I'm afraid my spec is ambiguous, sorry. In:

bool __atomic_compare_exchange_strong(type* atomic_obj,
                                     type* expected, type desired,
                                     int mem_success, int mem_failure);

It was my intention that in each argument that specifies "type", the types are all required to be the same.

Right, I just clarifying that.

In C++ templates, this would be:

template <class type>
bool __atomic_compare_exchange_strong(type* atomic_obj,
                                     type* expected, type desired,
                                     int mem_success, int mem_failure);

Concerning arguments "type" that are not scalars: I believe I can handle this in libc++ with some fancy template meta programming, and pass them through to the front end type-punned as scalars if there is a scalar of the appropriate size, and otherwise route them to a library-lock function. I can do this because "type" is a-priori known to be trivially copyable.

Okay. If the libraries need to work with arbitrary trivially-copyable types, we should definitely do that in the builtins instead of in the library. I'm willing to have the builtins take trivially-copyable types.

This doesn't strictly matter for you, but we would implement this in Clang by having the frontend do the ABI-compliant lowerings for the locking implementations and then emitting everything that can be done lock-free as an LLVM intrinsic call.

Does this translate to a call to compiler-rt for the locking implementations? I ask because I'm probably going to be the one to implement these in compiler-rt if they are needed.

Multiple calls, I think; an explicit acquire and release call. Unless you think there's some benefit in having your compiler-rt function implement memcpy?

Note that these functions become platform ABI requirements.

John.

I see, like:

bool __atomic_is_lock_free(type);

?

That's what I was thinking.

Thanks, I'll update the spec.

The builtins have exactly those names and are "overloaded"; memory ordering parameters are required to be constant expressions but not necessarily literals.

If the memory orderings are required to be constant expressions, this is really Design B, just with some renaming.

I see what you mean. I hadn't actually realized that the library actually takes these as runtime parameters; that seems very un-C++-ish to me. Certainly it's not very "pay for what you use", although I guess the presumption is that it will quickly optimize to the right thing.

Agreed. This wouldn't have been the direction I would've designed. On the other hand, I suspect the front end can handle it.

This is not a problem for libc++. However I've heard rumblings that this won't be ok for the C library. Want to comment Blaine?

Are the committees actually going to agree enough here to permit re-use?

I suspect so. There are some formidable forces on both sides aimed at making it so. I am not one of them (I don't attend C). But it is my estimation that the individuals involved in assuring C/C++ compatibility in this department will succeed.

The main advantage of requiring the parameters to be constant expressions is that the builtins don't become intimately tied to a specific library. I'm very concerned about building compiler APIs on top of competing and incomplete standards.

Very valid concern. As far as I can tell, the C and C++ committees are converging on run time memory orderings. Am I correct Blaine?

If we accept this as a runtime argument, we will invoke undefined behavior on out-of-range enumerators. Is that acceptable to C and C++?

Yes. Nasal demons encouraged. :slight_smile: This is a very low-level API. Hand holding is not expected. Ultimate efficiency is.

Arguments not of fundamental type, or not exactly matching a permitted type for the intrinsic, are not required to be supported by the implementation (so calling swap with a long* and an int, or a Base** and a Derived*, or a ConvertibleToIntPtr and a ConvertibleToInt, may or may not have sensible results).

I'm afraid my spec is ambiguous, sorry. In:

bool __atomic_compare_exchange_strong(type* atomic_obj,
                                    type* expected, type desired,
                                    int mem_success, int mem_failure);

It was my intention that in each argument that specifies "type", the types are all required to be the same.

Right, I just clarifying that.

In C++ templates, this would be:

template <class type>
bool __atomic_compare_exchange_strong(type* atomic_obj,
                                    type* expected, type desired,
                                    int mem_success, int mem_failure);

Concerning arguments "type" that are not scalars: I believe I can handle this in libc++ with some fancy template meta programming, and pass them through to the front end type-punned as scalars if there is a scalar of the appropriate size, and otherwise route them to a library-lock function. I can do this because "type" is a-priori known to be trivially copyable.

Okay. If the libraries need to work with arbitrary trivially-copyable types, we should definitely do that in the builtins instead of in the library. I'm willing to have the builtins take trivially-copyable types.

That's great. Anything else sent in is undefined behavior.

This doesn't strictly matter for you, but we would implement this in Clang by having the frontend do the ABI-compliant lowerings for the locking implementations and then emitting everything that can be done lock-free as an LLVM intrinsic call.

Does this translate to a call to compiler-rt for the locking implementations? I ask because I'm probably going to be the one to implement these in compiler-rt if they are needed.

Multiple calls, I think; an explicit acquire and release call. Unless you think there's some benefit in having your compiler-rt function implement memcpy?

Note that these functions become platform ABI requirements.

I am ignorant in this department and will happily implement whatever specs/API the front end deems necessary.

-Howard

Is that sufficient? The library spec contains a per-object is_lock_free(), presumably so that it can return false for unaligned objects. Are we confident that all such cases can be resolved in the library, so that the only query to the compiler needs to be about the underlying type?

Sebastian

Thanks for looking at this John.

I see what you mean. I'll mention this in the design doc. I believe it will be an issue for any design we choose (A, B, or C).

After some thought, I think I personally prefer proposal A, slightly modified such that __atomic_is_lock_free is a pseudofunction like __is_pod().

I see, like:

bool __atomic_is_lock_free(type);

?

That's what I was thinking.

Is that sufficient? The library spec contains a per-object is_lock_free(), presumably so that it can return false for unaligned objects. Are we confident that all such cases can be resolved in the library, so that the only query to the compiler needs to be about the underlying type?

Good question. I think we're ok on the C++ side. I don't know about the C side. In N3126 (the current draft) 29.4 [atomics.lockfree]/2 says:

The function atomic_is_lock_free (29.6) indicates whether the object is lock-free. In any given program execution, the result of the lock-free query shall be consistent for all pointers of the same type.

This part of the atomic_is_lock_free specification is separated from the rest of the atomic_is_lock_free specification by 10 pages because putting the specification all in one place would be just what the reader expected. :wink:

-Howard

The library spec says it returns true/false consistently for all objects of that type.

John.

I checked with Lawrence Crowl concerning the C side. The C spec currently says that different invocations of atomic_is_lock_free for the same pointer type may return different results. Lawrence tells me that the C spec is based on an old C++ spec and is due to be changed to be consistent with C++. Therefore I believe we're good to go with bool __atomic_is_lock_free(type).

I've updated Design A under:

http://libcxx.llvm.org/atomic_design.html

with the above change, and a few more details. I'm hearing consensus on Design A. If you feel differently please speak up. I plan on commencing work on libc++'s <atomic> immediately, assuming Design A.

One question about Design A that has yet to be discussed is the possible defaulting of memory orderings in the intrinsics. This defaulting, if it exists, is described at the very bottom of:

http://libcxx.llvm.org/atomic_design_a.html

I have no strong feeling either way. I can easily live with or without the memory order defaults. But I should either remove the "If desired" from the description (mandating the defaults), or remove the description of the defaults altogether (banning them). Please weigh in with your opinions.

Thanks,
Howard

I've also updated:

http://libcxx.llvm.org/atomic_design_a.html

with C++ implementations of each operation for the purpose of documenting the required semantics of said operations.

Corrected a type-o in __atomic_store: Its return should be void, not type.

-Howard

I checked with Lawrence Crowl concerning the C side. The C spec currently says that different invocations of atomic_is_lock_free for the same pointer type may return different results. Lawrence tells me that the C spec is based on an old C++ spec and is due to be changed to be consistent with C++. Therefore I believe we're good to go with bool __atomic_is_lock_free(type).

Excellent.

One question about Design A that has yet to be discussed is the possible defaulting of memory orderings in the intrinsics. This defaulting, if it exists, is described at the very bottom of:

http://libcxx.llvm.org/atomic_design_a.html

I have no strong feeling either way. I can easily live with or without the memory order defaults. But I should either remove the "If desired" from the description (mandating the defaults), or remove the description of the defaults altogether (banning them). Please weigh in with your opinions.

The defaulting scheme for the two-parameter versions is very complicated and precludes implementing the type-checking for these builtins with a single C++ function declaration. If we're going to allow defaulting, I think they should both default to 5.

John.

This would make it very easy to drop into undefined behavior. For example:

while (__atomic_compare_exchange_strong(&atomic_obj, &expected, desired, memory_order_acquire))
    ...

In the above example, only mem_failure is defaulted (to 5 as you suggest), and this violates the requirement that mem_failure <= mem_success (mem_success is 2).

-Howard

Sorry, took Friday off.

I'm actively corresponding with Lawrence on atomics with the goal of simplifying both standards - and of course interoperable semantics and in fact some limited interoperable syntax as well.

I wasn't, however, aware of the above issue so I'll check that out.

Blaine

Here's a brief C++ simulation of the "default behavior" that the C++ <atomic> header is specified to have for compare_exchange_strong and compare_exchange_weak. This functionality either goes into the lib, or into the intrinsic. Which is best, I don't really have an opinion. But if the intrinsics are to have defaults, I think they should match the behavior of the standard.

The demo below models undefined behavior with an assert:

#include <iostream>
#include <cassert>

typedef enum memory_order
{
    memory_order_relaxed, memory_order_consume, memory_order_acquire,
    memory_order_release, memory_order_acq_rel, memory_order_seq_cst
} memory_order;

int
translate_memory_order(int o)
{
    switch (o)
    {
    case 4:
        return 2;
    case 3:
        return 0;
    }
    return o;
}

void test(int mem_success = memory_order_seq_cst)
{
    int mem_failure = translate_memory_order(mem_success);
    std::cout << "mem_success = " << mem_success << '\n';
    std::cout << "mem_failure = " << mem_failure << '\n';
    assert(mem_success >= mem_failure);
    assert(mem_failure != memory_order_release);
    assert(mem_failure != memory_order_acq_rel);
    std::cout << "------------------------\n";
}

void test(int mem_success, int mem_failure)
{
    std::cout << "mem_success = " << mem_success << '\n';
    std::cout << "mem_failure = " << mem_failure << '\n';
    assert(mem_success >= mem_failure);
    assert(mem_failure != memory_order_release);
    assert(mem_failure != memory_order_acq_rel);
    std::cout << "------------------------\n";
}

int main()
{
    std::cout << "test()\n";
    test();
    std::cout << "test(memory_order_relaxed)\n";
    test(memory_order_relaxed);
    std::cout << "test(memory_order_consume)\n";
    test(memory_order_consume);
    std::cout << "test(memory_order_acquire)\n";
    test(memory_order_acquire);
    std::cout << "test(memory_order_release)\n";
    test(memory_order_release);
    std::cout << "test(memory_order_acq_rel)\n";
    test(memory_order_acq_rel);
    std::cout << "test(memory_order_seq_cst)\n";
    test(memory_order_seq_cst);
    std::cout << "test(memory_order_acquire, memory_order_consume)\n";
    test(memory_order_acquire, memory_order_consume);
    std::cout << "test(memory_order_acquire, memory_order_seq_cst)\n";
    test(memory_order_acquire, memory_order_seq_cst);
}

test()
mem_success = 5
mem_failure = 5