Atomic Operation and Synchronization Proposal v2

Hello,

This is the second major revision of the atomic proposal for LLVM. I
will try and give a brief overview of the motivating changes, but a
greater portion of the text has changed, along with some changes to
the proposed additions.

http://chandlerc.net/llvm_atomics.html

- The proposal has been rewritten to better delineate the goals and
purposes of LLVM, and these additions to LLVM. The why and to what
purpose has been a source of confusion, and this is addressed
directly.
- The explanation of memory barriers and why the representation was
chosen has been reworked to address concerns raised.
- The explanation of load-linked, store-conditionally atomic
operations has been expanded and reworded to be more clear and
precise. The paper to investigate theoretical details in has been
added to the references.
- Both the hardware survey, and all implementation tables have been
updated to include Alpha architecture stuff.
- The entire set of additions has been moved to intrinsic functions
instead of instructions. If these merit and can benefit from moving up
to instructions, they may, but this eases the implementation burden
significantly.
- The types supported have been narrowed to only integers. The sizes
of integers supported is determined by the target implementation.
- GCC builtins have been added with rough outlines of how they could
be lowered to the proposed intrinsic functions.
- Other minor (and not-so-minor) cosmetic cleaning took place to make
these integrate smoothly and present a more cohesive proposal.

A few things were not changed because they can easily be added later
if and when desired, and would complicate the initial implementation.
Incremental is good!

- Vector type support
- Other type support
- Acquire and release implied memory constraints on atomic operations

This revision grew directly out of the feedback provided. Please
continue to review this, and hopefully we will see a first rate
representation of these constructs in LLVM soon.

Thank you again,
-Chandler Carruth

Here are some comments, quotes are from the draft.

an operation based constraint cannot guard other operations

I think constraints associated with a particular instruction usually apply
to this instruction and previous/subsequent instructions, so this wouldn't
be true. This is the case in the atomic_ops model, and also on ia64 I
think.

The single instruction constraints can, at their most flexible, constrain
any set of possible pairings of loads from memory and stores to memory

I'm not sure about this, but can we get issues due to "special" kinds of
data transfers (such as vector stuff, DMA, ...?). Memcpy implementations
could be a one thing to look at.
This kind of breaks down to how universal you want the memory model to be.

Regarding swap(): Which uses do you have in mind? To me, support for TAS
would be more useful as it is used for spinlocks (and no concurrent algo
immediately comes to my mind that uses swap() and couldn't use TAS). The
difference is that TAS can have a weaker interface, because it operates on
a boolean value only, making it easier to map on different hardware (and
TAS is faster than CAS on some architectures I think). For example, for TAS
a byte is sufficient, whereas with swap you probably would require that the
exchange has machine-word size.

These implementations assume a very conservative interpretation.
result = __sync_fetch_and_add( <ty>* ptr, <ty> value )
            call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true,
              i1 true )
%result = call <ty> @llvm.atomic.las( <ty>* %ptr, <ty> %value )

Shouldn't you have a second membar after the las() to be very conservative
(i.e., if las() is supposed to really be linearizable)? Otherwise, the
effects of the las() can be reordered with respect to effects of subsequent
instructions.

This also shows that you get additional overhead in the codegen if barriers
are not associated with an operation: To emit efficient code, codegen would
have to check whether membars are next to an operation, and whether they
can be merged with the operation. If codegens don't do it, then runtime
overhead will be higher due to the unnecessary barriers. This also implies
that there is no reordering of las() with other operations until codegen
phase, so you would at least have to have some sort of compiler barrier, or
require the las() target to be volatile (does LLVM volatile ordering
guarantees apply to non-volatile loads/stores also, or just to volatile
ones?)

If you use the other approach instead (ordering constraint attached to
instructions), then you have to support more intrinsics, but you don't need
this kind of analysis, and you wouldn't need compiler reordering barriers
assuming that they are implicit for anything that carries any reordering
constraint.

I would guess that with the second approach (constraints for operations),
codegen implementations could be actually easier because you can
concentrate on just one operation with constraints.

result = __sync_fetch_and_or( <ty>* ptr, <ty> value )

or/and/... should be added to the list of supported intrinsics at some time.
x86 has built-in support for that and doesn't need the CAS loop.

Torvald

Right. For example, the Cray X1 has a much richer set of memory ordering
instructions than anything on the commodity micros:

http://tinyurl.com/3agjjn

The memory ordering intrinsics in the current llvm proposal can't take
advantage of them because they are too coarse-grained.

Now, I don't expect we'll see an llvm-based X1 code generator, but looking at
what the HPC vendors are doing in this area will go a long way toward
informing the kind of operations we may want to include in llvm. The trend is
for vendors to include ever more finely targeted semantics to allow scaling to
machines with millions of cores.

If we can incrementally refine the size of the memory ordering hammers, I'm
ok with that. If it's simply a matter of adding finer-grained intrinsics
later, that's cool. But I don't want to get us into a situation where llvm
requires stricter memory ordering than is strictly necessary and we can't get
out from under the stone.

                                                      -Dave

I guess the descriptions on that page are, heh, a little terse ;-). The
Cray X1 has a dimension of synchronization that isn't covered in this
proposal, and that's the set of observers need to observe the ordering.
For example you can synchronize a team of streams in a multi-streaming
processor without requiring that the ordering of memory operations be
observed by the entire system. That's what motivates most of the variety
in that list.

There's one other specific aspect I'd like to point out here. There's an
"acquire" which orders prior *scalar* loads with *all* subsequent memory
accesses, and a "release" which orders *all* prior accesses with subsequent
*scalar* stores. The Cray X1's interest in distinguishing scalar accesses
from vector accesses is specific to its architecture, but in general, it
is another case that motivates having more granularity than just "all
loads" and "all stores".

Overall though, I'm quite happy to see that the newest revision of the
proposal has switched from LLVM instructions to LLVM intrinsics. That
will make it easier to experiment with extensions in the future. And
having the string "atomic" right there in the names of each operation
is very much appreciated :-).

Dan

Here are some comments, quotes are from the draft.

> an operation based constraint cannot guard other operations

I think constraints associated with a particular instruction usually apply
to this instruction and previous/subsequent instructions, so this wouldn't
be true. This is the case in the atomic_ops model, and also on ia64 I
think.

"guard other operations" means guarding operations other than the
atomic intrinsic functions. Specifically, loads from and stores to
shared memory are perfectly legal, and inherently atomic. However,
they might need memory synchronization routines. Tying synchronization
to only operations would force typing them to nearly every operation.

All that aside, as I stated in the proposal, if the _hardware_ demands
it, these atomic operations could be extended to provide built-in
memory synchronization semantics. A standalone memory barrier would
still be required (precisely as it is in hardware), but lowerings
could use the potentially better optimized versions of the operations.
I see this as an extension of the existing proposal, not a problem.

> The single instruction constraints can, at their most flexible, constrain
> any set of possible pairings of loads from memory and stores to memory

I'm not sure about this, but can we get issues due to "special" kinds of
data transfers (such as vector stuff, DMA, ...?). Memcpy implementations
could be a one thing to look at.
This kind of breaks down to how universal you want the memory model to be.

Issues? Perhaps. On the target architectures, these barriers would be
treated conservatively. *All* memory accesses are protected. If the
hardware provides atypical memory access mechanisms, it should also
provide mechanisms for protecting those. These memory barriers would
then be implemented to utilize those. The model proposed is extremely
universal, because it is extremely general. Implementations for
specialized hardware should take a conservative approach. If target
architectures need more specialized memory constraint constructs,
these could again be added incrementally in order to better support
fine tuning those architecture's performance characteristics.

Regarding swap(): Which uses do you have in mind? To me, support for TAS
would be more useful as it is used for spinlocks (and no concurrent algo
immediately comes to my mind that uses swap() and couldn't use TAS). The
difference is that TAS can have a weaker interface, because it operates on
a boolean value only, making it easier to map on different hardware (and
TAS is faster than CAS on some architectures I think). For example, for TAS
a byte is sufficient, whereas with swap you probably would require that the
exchange has machine-word size.

I cannot express this enough. This is *not*, repeat *not* an API.
Programmers will *not* be using these constructs directly. This is an
interface to the hardware. So what is more useful to algorithms is not
the primary consideration. Rather, what the hardware provides for
algorithms is.

Furthermore, TAS on most architectures is implemented with a swap.
Specifically, the architecture with the closest thing to a "more
efficient" test and set is x86, and the atomic_ops package, which I
did look at, uses "xchgb". This is precisely the lowering x86 would
provide for the LLVM intrinsic function "i8 @llvm.atomic.swap( i8*
%ptr, i8 swap )". Thus it makes perfect sense for LLVM to support a
"swap" intrinsic function in order to implement an API which provides
test-and-set locks. I cannot see the benefit of limiting the LLVM
representation to that of the API, when a common abstraction of how to
implement that API is available on every architecture targetted.

Lastly, most of the architectures which support and size selection at
all, support it for all of the instructions necessary to implement
these operations. 'cas' and 'swap' will lower to single byte
instructions on x86 f.ex.

> These implementations assume a very conservative interpretation.
> result = __sync_fetch_and_add( <ty>* ptr, <ty> value )
> call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true,
> i1 true )
> %result = call <ty> @llvm.atomic.las( <ty>* %ptr, <ty> %value )

Shouldn't you have a second membar after the las() to be very conservative
(i.e., if las() is supposed to really be linearizable)? Otherwise, the
effects of the las() can be reordered with respect to effects of subsequent
instructions.

You are probably right here. It was very late, and as mentioned, the
GCC spec is extremely ambiguous on the precise semantics for these
intrinsics. I'm going to move to what I think are more appropriate for
the instructions, rather than trusting their descriptions.

This also shows that you get additional overhead in the codegen if barriers
are not associated with an operation: To emit efficient code, codegen would
have to check whether membars are next to an operation, and whether they
can be merged with the operation. If codegens don't do it, then runtime
overhead will be higher due to the unnecessary barriers. This also implies
that there is no reordering of las() with other operations until codegen
phase, so you would at least have to have some sort of compiler barrier, or
require the las() target to be volatile (does LLVM volatile ordering
guarantees apply to non-volatile loads/stores also, or just to volatile
ones?)

This is not true in any situation for any architecture but Itanium.
Furthermore, given the specs of GCC's intrinsics, it is not true even
for Itanium in this specific situation. The GCC specification (such
that it is) for these intrinsics indicates that they must form a "full
barrier". Itanium's instruction-based barriers cannot provide this
functionality, forcing it to fall back on a full fence on either side
of the instruction. Is this poor code? Yes it is, but it is required
to match the semantics provided by GCC.

On every other architecture, there _are_ no instruction-based
barriers. x86 almost has instruction based barriers, as for some chips
_every_ instruction is a full barrier, but this is not what you are
aiming for either and is slowly changing in newer x86 implementations.
For the other architectures, memory barriers _are_ separate
instructions. x86 has separate instructions for memory barriers.

The memory barriers, as proposed, will provide all necessary barriers
to the compiler reordering.

If you use the other approach instead (ordering constraint attached to
instructions), then you have to support more intrinsics, but you don't need
this kind of analysis, and you wouldn't need compiler reordering barriers
assuming that they are implicit for anything that carries any reordering
constraint.

I would guess that with the second approach (constraints for operations),
codegen implementations could be actually easier because you can
concentrate on just one operation with constraints.

As stated above, this assumes that all architectures work this way,
which they simply do not. This is a hardware interface, and as such
follows the hardware. As such, the codegen will not be easier or
harder in either of these cases.

> result = __sync_fetch_and_or( <ty>* ptr, <ty> value )

or/and/... should be added to the list of supported intrinsics at some time.
x86 has built-in support for that and doesn't need the CAS loop.

It doesn't have built-in support. These are "fetch and *" operations.
x86 has support for an atomic "or" to memory which does not include
any fetch. x86 has "xadd" which allows a fetch and add, and fetch and
sub (through negation) without the compare and swap, however neither
"or" nor "and" are available in this fashion. Perhaps I am missing
something in the x86 instruction set, but I can't find any other way
to perform these operations. The many non-fetching atomic operations
in x86 are more geared at dispatching and joining of threads rather
than synchronizing contending threads.

-Chandler Carruth

> > The single instruction constraints can, at their most flexible, constrain
> > any set of possible pairings of loads from memory and stores to memory
>
> I'm not sure about this, but can we get issues due to "special" kinds of
> data transfers (such as vector stuff, DMA, ...?). Memcpy implementations
> could be a one thing to look at.
> This kind of breaks down to how universal you want the memory model to be.

Right. For example, the Cray X1 has a much richer set of memory ordering
instructions than anything on the commodity micros:

http://tinyurl.com/3agjjn

Thanks for this link! Very interesting to see an architecture which
pays much more attention to its memory ordering.

The memory ordering intrinsics in the current llvm proposal can't take
advantage of them because they are too coarse-grained.

From what I can clean, this coarseness comes in two flavors -- global

v. local memory access, and type-based granularities. Is this a
correct interpretation? (I'm clearly not going to be an expert on the
X1. ;])

Now, I don't expect we'll see an llvm-based X1 code generator, but looking at
what the HPC vendors are doing in this area will go a long way toward
informing the kind of operations we may want to include in llvm. The trend is
for vendors to include ever more finely targeted semantics to allow scaling to
machines with millions of cores.

Absolutely! Like I said, its great to see this kind of information. A
few points about the current proposal:

1) It currently only deals with integers in order to make it simple to
implement, and representable across all architectures. While this is
limiting, I think it remains a good starting point, and shouldn't
cause any problems for later expansion to more type-aware
interpretations.

2) The largest assumption made is that all memory is just "memory".
After that, the most fine grained interpretation of barriers available
was chosen (note that only SPARC can do all the various
combinations... most only use one big fence...). The only major thing
I can see that would increase this granularity is to treat different
types differently, or treat them as going into different parts of
"memory"? Really not sure here, but it definitely is something to look
into. However, I think this may require a much later proposal when the
hardware is actively being used at this level, and we can try and find
a more finegrained way of targetting all the available architectures.
For the time being, it seems that the current proposal hits all the
architectures very neatly.

If we can incrementally refine the size of the memory ordering hammers, I'm
ok with that. If it's simply a matter of adding finer-grained intrinsics
later, that's cool. But I don't want to get us into a situation where llvm
requires stricter memory ordering than is strictly necessary and we can't get
out from under the stone.

With the current version you can specify exactly what ordering you
desire. The only thing ignored is the type of the various loads and
stores. I think adding that level of granularity to the existing
highly granular pairing selection would be a smooth incremental
update. Is there another update you see needed that would be less
smooth?

Again, thanks for the information on the X1's memory architecture,
very interesting... I'm going to try and get into it a bit more in a
response to Dan Gohman's email below... =]

-Chandler

>
> > > The single instruction constraints can, at their most flexible, constrain
> > > any set of possible pairings of loads from memory and stores to memory
> >
> > I'm not sure about this, but can we get issues due to "special" kinds of
> > data transfers (such as vector stuff, DMA, ...?). Memcpy implementations
> > could be a one thing to look at.
> > This kind of breaks down to how universal you want the memory model to be.
>
> Right. For example, the Cray X1 has a much richer set of memory ordering
> instructions than anything on the commodity micros:
>
> http://tinyurl.com/3agjjn
>
> The memory ordering intrinsics in the current llvm proposal can't take
> advantage of them because they are too coarse-grained.

I guess the descriptions on that page are, heh, a little terse ;-).

A bit. ;] I was glad to see your clarification.

The
Cray X1 has a dimension of synchronization that isn't covered in this
proposal, and that's the set of observers need to observe the ordering.
For example you can synchronize a team of streams in a multi-streaming
processor without requiring that the ordering of memory operations be
observed by the entire system. That's what motivates most of the variety
in that list.

This is fascinating to me, personally. I don't know how reasonable it
is to implement directly in LLVM, however, could a codegen for the X1
in theory establish if the "shared memory" was part of a stream in a
multi-streaming processor, and use those local synchronization
routines? I'm not sure how reasonable this is. Alternatively, to
target this specific of an architecture, perhaps the LLVM code could
be annotated to show where it is operating on streams, versus across
processors, and allow that to guide the codegen decision as to which
type of synchronization to utilize. As LLVM doesn't really understand
the parallel implementation the code is running on, it seems like it
might be impossible to build this into LLVM without it being
X1-type-system specific... but perhaps you have better ideas how to do
such things from working on it for some time?

There's one other specific aspect I'd like to point out here. There's an
"acquire" which orders prior *scalar* loads with *all* subsequent memory
accesses, and a "release" which orders *all* prior accesses with subsequent
*scalar* stores. The Cray X1's interest in distinguishing scalar accesses
from vector accesses is specific to its architecture, but in general, it
is another case that motivates having more granularity than just "all
loads" and "all stores".

This clarifies some of those instructions. Here is my thought on how
to fit this behavior in with the current proposal:

You're still ordering load-store pairings, there is juts the added
dimensionality of types. This seems like an easy extension to the
existing proposal to combine the load and store pairings with a type
dimension to achieve finer-grained control. Does this make sense as an
incremental step from your end with much more experience comparing
your hardware to LLVM's IR?

Overall though, I'm quite happy to see that the newest revision of the
proposal has switched from LLVM instructions to LLVM intrinsics. That
will make it easier to experiment with extensions in the future. And
having the string "atomic" right there in the names of each operation
is very much appreciated :-).

The atomic in the name is nice. It does make the syntax a bit less
elegant, but it'll get the ball rolling faster, and thats far more
important! Thanks for the input, and I really love the X1 example for
a radically different memory model from the architectures LLVM is
currently targeting.

-Chandler Carruth

> > Right. For example, the Cray X1 has a much richer set of memory
> > ordering instructions than anything on the commodity micros:
> >
> > http://tinyurl.com/3agjjn
> >
> > The memory ordering intrinsics in the current llvm proposal can't take
> > advantage of them because they are too coarse-grained.
>
> I guess the descriptions on that page are, heh, a little terse ;-).

A bit. ;] I was glad to see your clarification.

Yeah, sorry. I was heading out the door to a meeting when I posted the link.
I'm glad Dan clarified some things. Unfortunately, I could only link to
publicly-available documents. Our internal ISA book explains this all much
better. :slight_smile:

> The
> Cray X1 has a dimension of synchronization that isn't covered in this
> proposal, and that's the set of observers need to observe the ordering.
> For example you can synchronize a team of streams in a multi-streaming
> processor without requiring that the ordering of memory operations be
> observed by the entire system. That's what motivates most of the variety
> in that list.

This is fascinating to me, personally. I don't know how reasonable it
is to implement directly in LLVM, however, could a codegen for the X1
in theory establish if the "shared memory" was part of a stream in a
multi-streaming processor, and use those local synchronization
routines?

Absolutely. The X1 compiler is responsible for partitioning loops to
run on multiple streams and synchronizing among the streams as
necessary. That synchronization is at a level "above" general system
memory ordering. The X1 has multiple levels of parallelism:

- Vectorization

- Decoupled vector/scalar execution (this is where the lsyncs come in)

- Multistreaming (the msync operations)

- Multiprocessing (global machine-wide synchronization via gsync)

The compiler is basically responsible for the first three levels while the
user does the fourth via MPI, OpenMP, CAF, UPC, etc. In general
sometimes the user inserts directives to help the compiler with 1-3
but the compiler gets a lot of cases on its own automatically.

I'm not sure how reasonable this is. Alternatively, to
target this specific of an architecture, perhaps the LLVM code could
be annotated to show where it is operating on streams, versus across
processors, and allow that to guide the codegen decision as to which
type of synchronization to utilize. As LLVM doesn't really understand
the parallel implementation the code is running on, it seems like it
might be impossible to build this into LLVM without it being
X1-type-system specific... but perhaps you have better ideas how to do
such things from working on it for some time?

In a parallelizing compiler, the compiler must keep track of where it placed
data when it parallelized code as it must know how to handle dependencies
and insert synchronizations. In the case of the X1, the compiler partitioned
a loop to run on multiple cores, so it knows to use msyncs when that code
accesses data shared among the cores. The compiler determined which
data to share among the cores and which to keep private in each core.
Similarly, when it vectorizes, it knows the dependencies between vector and
scalar operations and inserts the necessary lsyncs.

PGI, Pathscale and Intel, for example, are starting to talk about automatic
OpenMP. They will need to insert synchronizations across cores similarly to
what's done on the X1. Those will probably be some form of MFENCE.

The abstraction here is the "level" of parallelism. Vectorization is very
fine-grained. Most implementations in hardware do not need explicit software
syncs between scalar and vector code.

The next level up is multithreading (we call that multistreaming on the X1
for historical reasons). Depending on architecture, this could happen within
a single core (MTA style) or across multiple cores (X1 style), providing two
distinct levels of parallelism and possibly two distinct sets of sync
instructions in the general case.

Then you've got a level of parallelism around the set of sockets that are
cache coherent (so-called "local" processors, or a "node" in X1 parlance).
You might have another set of sync instructions for this (the X1 does not).
Then you have the most general case of parallelism across "the system"
where communication time between processors is extremely long. This is
the "gsync" level on the X1.

Other more sophisticated architectures may have even more levels of
parallelism.

So in thinking about extending your work (which again, I want to stress is
not immediately necessary, but still good to think about), I would suggest
we think in terms of level of parallelization or perhaps "distance among
participants." It's not a good idea to hard-code things like "vector-scalar
sync" but I can imagine intrinsics that say, "order memory among these
participants," or "order memory at this level," or "order memory between these
levels," where the levels are defined by the target architecture. If a target
doesn't have as many levels as used in the llvm code, then it can just choose
to use a more expensive sync instruction. In X1 terms, a gsync is a really
big hammer, but it can always be used in place of an lsync.

I don't know if any plans exist to incorporate parallelizing transformations
into llvm, but I can certainly imagine building an auto-parallelizing
infrastructure above it. That infrastructure would have to communicate
information down to llvm so it could generate code properly. How to do that
is an entirely other can of worms. :slight_smile:

> There's one other specific aspect I'd like to point out here. There's an
> "acquire" which orders prior *scalar* loads with *all* subsequent memory
> accesses, and a "release" which orders *all* prior accesses with
> subsequent *scalar* stores. The Cray X1's interest in distinguishing
> scalar accesses from vector accesses is specific to its architecture, but
> in general, it is another case that motivates having more granularity
> than just "all loads" and "all stores".

This clarifies some of those instructions. Here is my thought on how
to fit this behavior in with the current proposal:

You're still ordering load-store pairings, there is juts the added
dimensionality of types. This seems like an easy extension to the
existing proposal to combine the load and store pairings with a type
dimension to achieve finer-grained control. Does this make sense as an
incremental step from your end with much more experience comparing
your hardware to LLVM's IR?

This would work for X1-style lsyncs, but we should think about whether this is
too architecture-specific. Decoupled execution doesn't fit completely snugly
into the "levels of parallelism" model I outlined above, so it's a bit of an
oddball. It's parallelism, but of a different form. Commodity micros have
decoupled execution but they handle syncs in hardware (thus moving to/from
a GPR and XMM is expensive).

The X1 fsync falls into the same category. It's there because the X1 does not
have precise traps for floating point code and doesn't really have anything to
do with parallelization. Ditto isync (all modern processors have some form
of this to guard against self-modifying code).

The bottom line is that I don't have easy cut-and-dry answers. I suspect this
will be an organic process and we'll learn how to abstract these things in an
architecture-independent manner as we go.

                                                          -Dave

I take that back. Maybe.

If by "type" you literally mean the data type of the value (int, float, etc.)
and the extent of the data (vector or scalar), then it won't handle the
X1 case where integer scalar instructions feed floating point vector
instructions and similar combinations.

If by "type" you only mean the extent of the data, then it would work fine.

                                         -Dave

Please ignore this statement, keep get-and-set. I should implement more
locks ...
Nevertheless, I'd still like to see test-and-set.

torvald

> Here are some comments, quotes are from the draft.
>
> > an operation based constraint cannot guard other operations
>
> I think constraints associated with a particular instruction usually
> apply to this instruction and previous/subsequent instructions, so this
> wouldn't be true. This is the case in the atomic_ops model, and also on
> ia64 I think.

"guard other operations" means guarding operations other than the
atomic intrinsic functions. Specifically, loads from and stores to
shared memory are perfectly legal, and inherently atomic. However,
they might need memory synchronization routines. Tying synchronization
to only operations would force typing them to nearly every operation.

You can provide intrinsics for load/store (and perhaps vector stuff, or
anything else that goes to "memory"), in the same way you do for CAS etc.
This of course assumes that everything besides memory (as from the LLVM IR
perspective) is thread-private.

memory synchronization semantics. A standalone memory barrier would
still be required (precisely as it is in hardware), but lowerings
could use the potentially better optimized versions of the operations.
I see this as an extension of the existing proposal, not a problem.

I know that we're discussing a detail here, but it could be important. I
would say that you don't need standalone barriers, because barriers are
always used to order operations that interact with shared memory. We see
all operations that interact with memory (loads/stores/...). If we
wouldn't, then one could argue that LLVMs view of memory is missing
something.
For example, reading from a real-time clock could be such a case. There, you
could want to make sure that a store is visible before you get a timestamp.
But in this case, the clock should actually be sth like a read-only memory
location.

> Regarding swap(): Which uses do you have in mind? To me, support for
> TAS would be more useful as it is used for spinlocks (and no concurrent
> algo immediately comes to my mind that uses swap() and couldn't use
> TAS). The difference is that TAS can have a weaker interface, because
> it operates on a boolean value only, making it easier to map on
> different hardware (and TAS is faster than CAS on some architectures I
> think). For example, for TAS a byte is sufficient, whereas with swap
> you probably would require that the exchange has machine-word size.

I cannot express this enough. This is *not*, repeat *not* an API.
Programmers will *not* be using these constructs directly. This is an
interface to the hardware. So what is more useful to algorithms is not
the primary consideration. Rather, what the hardware provides for
algorithms is.

I do not agree at all. Let me explain why.
Assume I want to do implement a concurrent algorithm that essentially
describes how instructions can be interleaved. You got thread-private
instructions that can be ignored, so only consider the ones that are used
to communicate between threads. These are atomic.
Up to now the set of operations (or possible invocations of operations) is
ordered from the perspective of a single thread (so, one could assume that
a thread sees its own previous updates), but they're not sequential
consistent or anything. So I then add reordering constraints to enforce the
orderings that are necessary for correctness. One can certainly argue about
the interface here, but it will consist of load/store, test-and-set,
get-and-set, compare-and-set, fetch-and-add/sub/inc/dec (these are the ones
that I have seen being used in concurrent algorithms), and more that I am
missing. The point here is that this is an API.
As a next step, I have to select implementations for the atomic ops in the
algorithm, and somehow declare the reordering constraints.

On the other side (at the hardware), there are instructions that I can use
and that I cannot use, and some sort of specifying barriers. Plus all the
weird, architecture-specific things: do we need a certain alignment for
data that atomic ops operate on? Up to which size can operands be?

Now the question for us is where to put the interface at, or how many
interfaces do we need?

What you seem to favour is that we need the interface right at the hardware
level (what the hardware provides). I don't like this because it doesn't
help me in mapping my algorithmic view to hardware, especially because it
doesn't weaken the dependency to which architecture my program is going to
run on.
There is obviously a trade-off between 1) doing the mapping completely at
application-level and 2) doing it at LLVM-level.

For the latter, you wouldn't need more than offer a load and a CAS (you
can't get more universal than that). But if you do that, you're slow, and
developers will not be able to use LLVM's atomic ops if performance matters
(it does most of the time for the people that are doing concurrency stuff).

If you're doing the mapping at application-level, then you select
architecture-dependent (in this case, at the front-end/preprocessor). But
then, where is the use in having LLVM atomic ops? LLVM handles ASM quite
well, so I can just use, e.g., the atomic_ops package.

So, IMO everything is about the API. You need to map between the abstract
view and the hardware offerings somewhere, and you don't want to do that on
application-level.
So we can't dodge this problem, we need to find a good spot in the
trade-off. If we don't, then it's going to be of limited use to the people
who actually need it.

Besides this single-interface view, one could also add several interfaces.
For example, the one you propose closer to the hardware, and a library /
gcc intrinsics / ... at application level. If you take this approach, then
you could assume that the interface close to the application puts serious
effort into selecting instructions, and that any coarsening abstraction at
the lower interface will just lead to slowdown. If it doesn't, then we're
back at the previous scenario.
This scenario might also cover the case in which a user compiles some sort
of parallelism language directly to LLVM, but I'm not sure about this.

> This also shows that you get additional overhead in the codegen if
> barriers are not associated with an operation: To emit efficient code,
> codegen would have to check whether membars are next to an operation,
> and whether they can be merged with the operation. If codegens don't do
> it, then runtime overhead will be higher due to the unnecessary
> barriers. This also implies that there is no reordering of las() with
> other operations until codegen phase, so you would at least have to
> have some sort of compiler barrier, or require the las() target to be
> volatile (does LLVM volatile ordering guarantees apply to non-volatile
> loads/stores also, or just to volatile ones?)

This is not true in any situation for any architecture but Itanium.
Furthermore, given the specs of GCC's intrinsics, it is not true even
for Itanium in this specific situation. The GCC specification (such
that it is) for these intrinsics indicates that they must form a "full
barrier". Itanium's instruction-based barriers cannot provide this
functionality, forcing it to fall back on a full fence on either side
of the instruction. Is this poor code? Yes it is, but it is required
to match the semantics provided by GCC.

atomic_ops has the following generalization for IA64 CAS_full:
# define AO_compare_and_swap_full(addr, old, new_val) \
        (AO_nop_full(), AO_compare_and_swap_acquire(addr, old, new_val))
So, only one full barrier and one "built-in", but that is not the point I
want to make. The thing I'd like to highlight is that the best
implementation is not straight-forward, and that you will have to combine
membars with operations at least on some architectures in order to get good
performance.

On every other architecture, there _are_ no instruction-based
barriers. x86 almost has instruction based barriers, as for some chips
_every_ instruction is a full barrier, but this is not what you are
aiming for either and is slowly changing in newer x86 implementations.
For the other architectures, memory barriers _are_ separate
instructions. x86 has separate instructions for memory barriers.

The x86 case is another example. To be fast, you would need to merge the
full barriers with the instruction (because the instruction already does
it...). Doing this in the backends requires quite some effort, I'd say. I'm
not sure how this will develop in the future, though.

Torvald