Proposal for atomic and synchronization instructions

Hello,

After a fair amount of research and work, I have put together a
concrete proposal for LLVM representations of atomic operations and
synchronization constructs. These aim to provide the minimal
functionality in the IR for representing the hardware constructs that
threading libraries and parallel programming rely on.

http://chandlerc.net/llvm_atomics.html

While I am no expert on the various architectures, I've done my best
at providing base-line implementations for each instruction. I am sure
these will need tweaking and fixing, but should provide a very good
starting point for the targets.

Comments are more than welcome, especially suggestions for
improvement. I hope this provides a sound background and a good
starting place for bringing these features to LLVM. Many thanks also
go to Reid who has helped tremendously with putting this together, and
several of the people at Aerospace who helped in the research phase.

-Chandler Carruth

Chandler Carruth wrote:

Hello,

After a fair amount of research and work, I have put together a
concrete proposal for LLVM representations of atomic operations and
synchronization constructs. These aim to provide the minimal
functionality in the IR for representing the hardware constructs that
threading libraries and parallel programming rely on.

http://chandlerc.net/llvm_atomics.html

While I am no expert on the various architectures, I've done my best
at providing base-line implementations for each instruction. I am sure
these will need tweaking and fixing, but should provide a very good
starting point for the targets.

Comments are more than welcome, especially suggestions for
improvement. I hope this provides a sound background and a good
starting place for bringing these features to LLVM. Many thanks also
go to Reid who has helped tremendously with putting this together, and
several of the people at Aerospace who helped in the research phase.
  

This looks good; this is basically what we came up with for the SVA-OS work.

Some comments:

1) You may want to consider adding atomic load-<bitwise operation>-store
instructions in addition to load-<add/subtract> instructions. The Linux
kernel uses these for atomic bit setting/clearing, and on many systems
they can be implemented more efficiently using special assembly
instructions. That said, I have never run experiments to verify that
such instructions are critical to kernel performance.

2) You need a strategy for handling architectures that can't handle
atomic operations on certain LLVM data types. For example, 64 bit
machines can operate atomically on 64 bit operands, but some 32 bit
machines cannot. I think you can fix this with spin locking, but you
need to be careful on how you do it. I think in order to do it
correctly, you need a separate spinlock variable for each individual
instruction that requires a spinlock.

3) You say that your operations work on all first class LLVM data
types. The LLVM vector type is considered a first class type. Should
LLVM support atomic vector operations?

4) For consistency, you may need to specify that the LLVM volatile
keyword can be added to the atomic operations to prevent optimizations
from modifying or removing them. I'm not aware of any optimization
that works on atomic ops in practice, but I don't see why they couldn't
be written.

5) With the addition of membar instructions, it may be time to re-think
what the LLVM volatile keyword means. Currently, volatile prevents
removing, modifying, or reordering of a load or store. However, membar
also prevents reordering. Perhaps it would be useful to redefine
volatile to mean that a load/store cannot be removed or modified but can
be reordered, and membar alone prevents reordering.

-- John T.

Hi,

I'd like to see support for something like this. I have some comments, and I
think there is existing work that you can reuse.

TAS and CAS are _not_ theoretically equivalent. TAS is weaker because it can
solve consensus in a nonblocking way only for 2 threads (it has consensus
number 2), whereas CAS can solve consensus for any number of threads
(infinite consensus number).

"While the processor may spin and attempt the atomic operation more than once
before it is successful, research indicates this is extremely uncommon."
I don't understand this sentence, what do you mean?

You probably don't need to require CAS (compare-and-set) to return the
previous value (I think some architectures don't), but just return a boolean
value (success/failure).

What are the reasons because of which you picked the Load/Store model for
barriers and not some other kind (e.g., acquire/release/...)?

Did you have a look at the atomic_ops project?
http://www.hpl.hp.com/research/linux/atomic_ops/
It already has implementations for several architectures and several
compilers. It uses a different consistency model (different set of
constraints for operations) and groups necessary memory barriers with
instructions (helpful on some architectures). It supports a few more
operations. The author (Hans Boehm) seems to also be active in the area of
C/C++ memory models (or some support for this).

Torvald

Torvald Riegel wrote:

Hi,

I'd like to see support for something like this. I have some comments, and I
think there is existing work that you can reuse.

"reuse within the compiler."

"While the processor may spin and attempt the atomic operation more than once
before it is successful, research indicates this is extremely uncommon."
I don't understand this sentence, what do you mean?

I'm not sure I can pinpoint the paper from which the statement is based,
but I seem to recall something similar in the original LL-SC papers
(Maurice Herlihy, DEC Western Research Labs?) It's a foundation for
lock-free algorithms.

You probably don't need to require CAS (compare-and-set) to return the
previous value (I think some architectures don't), but just return a boolean
value (success/failure).

compare and swap?

What are the reasons because of which you picked the Load/Store model for
barriers and not some other kind (e.g., acquire/release/...)?

Chandler looked at what the various current LLVM architectures and
summarized what he found. What he found are the memory barriers that the
various processors support.

Did you have a look at the atomic_ops project?
http://www.hpl.hp.com/research/linux/atomic_ops/
It already has implementations for several architectures and several
compilers. It uses a different consistency model (different set of
constraints for operations) and groups necessary memory barriers with
instructions (helpful on some architectures). It supports a few more
operations. The author (Hans Boehm) seems to also be active in the area of
C/C++ memory models (or some support for this).

LLVM doesn't emit external library calls -- there is no "-lllvm" to
which programs have to link, so adding an atomic operation library is
likely to be a non-starter. LLVM is interested in emitting instructions
to make atomic operations (and higher level concurrency primitives)
possible, which is why Chandler's work is usefully important.

-scooter

That being said, it probably would not be difficult to add an optional pointer to the membarrier instructions. Thus one could map acquire/release semantics onto the membarrier, and it would be up to the various optimizations to conservatively determine whether code motion is legal given aliasing information.

Torvald Riegel wrote:
> Hi,
>
> I'd like to see support for something like this. I have some comments,
> and I think there is existing work that you can reuse.

"reuse within the compiler."

within the LLVM compiler framework, to be precise.

> "While the processor may spin and attempt the atomic operation more than
> once before it is successful, research indicates this is extremely
> uncommon." I don't understand this sentence, what do you mean?

I'm not sure I can pinpoint the paper from which the statement is based,
but I seem to recall something similar in the original LL-SC papers
(Maurice Herlihy, DEC Western Research Labs?) It's a foundation for
lock-free algorithms.

Well, the statement says that often you have low contention. But that's
something you want, not necessarily something you will get, and depends on
the workload/algorithm. I'm missing the context. Is the actual statement as
obvious as that you should try to use the atomic instructions offered by your
processor, instead of doing blocking algorithms?

> You probably don't need to require CAS (compare-and-set) to return the
> previous value (I think some architectures don't), but just return a
> boolean value (success/failure).

compare and swap?

Well, do you need the swap, or is a compare-and-set sufficient most of the
time? What do other architectures offer?

> What are the reasons because of which you picked the Load/Store model for
> barriers and not some other kind (e.g., acquire/release/...)?

Chandler looked at what the various current LLVM architectures and
summarized what he found. What he found are the memory barriers that the
various processors support.

What you would want is to have a model that is (1) easy-to-use for the
developers and (2) close to what the hardware offers. L/S membars are easy to
use, but I think some architectures such as Itanium offer different membars
with different costs. So if you pick the wrong model and have to use stronger
membars (mfence Itanium) to implement your model, than you pay for that by
decreased performance.

> Did you have a look at the atomic_ops project?
> http://www.hpl.hp.com/research/linux/atomic_ops/
> It already has implementations for several architectures and several
> compilers. It uses a different consistency model (different set of
> constraints for operations) and groups necessary memory barriers with
> instructions (helpful on some architectures). It supports a few more
> operations. The author (Hans Boehm) seems to also be active in the area
> of C/C++ memory models (or some support for this).

LLVM doesn't emit external library calls -- there is no "-lllvm" to
which programs have to link, so adding an atomic operation library is
likely to be a non-starter. LLVM is interested in emitting instructions
to make atomic operations (and higher level concurrency primitives)
possible, which is why Chandler's work is usefully important.

Please have a real look at atomic_ops first. It does have a library part to
it -- but that's just for a nonblocking stack.

All the atomic operations are macros and asm for the specific
compiler/architecture pairs.
So if you reuse that in the LLVM code generators, you save one large part of
the work. Of course you can redo all this work, surely giving you a very fast
start...

Second, I guess there has been some serious effort put into selecting the
specific model. So, for example, if you look at some of Hans' published
slides etc., there are some arguments in favor of associating membars with
specific instructions. Do you know reasons why LLVM shouldn't do this?

Has anyone looked at the memory models that are being in discussion for C/C++?
Although there is no consensus yet AFAIK, it should be good for LLVM to stay
close.

And please observe that I didn't state that the work is not important or not
useful. We should just strive to select the best model we have, and reuse
work if we can. And if we can reuse a tested implementation and model, that
is a good thing.

Torvald

Poor alpha, no code examples or entries in your tables.

Andrew

Poor alpha, no code examples or entries in your tables.

But that said, it uses a load-locked, store-conditional and has
various memory barriers which are sufficient to implement all your
proposal.

Andrew

That is good to hear. The lack of Alpha was purely a lack of
time/knowledge of the architecture on my part, hopefully I can start
learning some more about how it functions. If you send me either a
pointer to appropriate documentation I would be happy to add
appropriate information to the page. If you can provide
implementations, it would save time, but I don't mind doing a fair
portion of the grunt work.

-Chandler Carruth

> > "While the processor may spin and attempt the atomic operation more than
> > once before it is successful, research indicates this is extremely
> > uncommon." I don't understand this sentence, what do you mean?
>
> I'm not sure I can pinpoint the paper from which the statement is based,
> but I seem to recall something similar in the original LL-SC papers
> (Maurice Herlihy, DEC Western Research Labs?) It's a foundation for
> lock-free algorithms.

Well, the statement says that often you have low contention. But that's
something you want, not necessarily something you will get, and depends on
the workload/algorithm. I'm missing the context. Is the actual statement as
obvious as that you should try to use the atomic instructions offered by your
processor, instead of doing blocking algorithms?

LL/SC is not a blocking algorithm. I'm going to be changing some of
the nomenclature on the page to reflect this, but while it spins, it
does not actually lock. The idea (as I understand it, and I'm still
hoping Scott can find the reference he gave me to a DEC paper
outlining this) is that the spin only occurs if another process does
something breaking atomicity for that _particular_ LL/SC pairing. Even
when the spin occurs, it should only occur until that particular
process gets through the op without interruption. This needs some
statistical analysis however, and hopefully the research in the
literature can be located and referenced.

> > You probably don't need to require CAS (compare-and-set) to return the
> > previous value (I think some architectures don't), but just return a
> > boolean value (success/failure).
>
> compare and swap?

Well, do you need the swap, or is a compare-and-set sufficient most of the
time? What do other architectures offer?

All of the architectures offer swap, or have no penalty for swapping.
This allows much easier algorithm development to my mind, as you have
the best of both worlds -- success/failure information, and the value
from memory.

> > What are the reasons because of which you picked the Load/Store model for
> > barriers and not some other kind (e.g., acquire/release/...)?
>
> Chandler looked at what the various current LLVM architectures and
> summarized what he found. What he found are the memory barriers that the
> various processors support.

What you would want is to have a model that is (1) easy-to-use for the
developers and (2) close to what the hardware offers. L/S membars are easy to
use, but I think some architectures such as Itanium offer different membars
with different costs. So if you pick the wrong model and have to use stronger
membars (mfence Itanium) to implement your model, than you pay for that by
decreased performance.

Itanium was the only architecture to offer these semantics, while the
L/S membars are offered to varying levels of detail on several
architectures. As Itanium is not yet a fully functional target, it was
not prioritized.

Moreover, as the only instructions (to my knowledge) on Itanium to
have memory synchronization components are cmpxchg and fetchadd, these
could be implemented correctly when implementing the lowering for the
instructions in this proposal, while still providing full memory
barriers when needed outside of the atomic instructions. If there is
serious demand for building memory semantics into the atomic
instructions, "aquire" and "release" flags could be used, and
implementations appropriately handle them. This doesn't seem to anull
the need for non-operation-based memory barriers.

> > Did you have a look at the atomic_ops project?
> > http://www.hpl.hp.com/research/linux/atomic_ops/
> > It already has implementations for several architectures and several
> > compilers. It uses a different consistency model (different set of
> > constraints for operations) and groups necessary memory barriers with
> > instructions (helpful on some architectures). It supports a few more
> > operations. The author (Hans Boehm) seems to also be active in the area
> > of C/C++ memory models (or some support for this).
>
> LLVM doesn't emit external library calls -- there is no "-lllvm" to
> which programs have to link, so adding an atomic operation library is
> likely to be a non-starter. LLVM is interested in emitting instructions
> to make atomic operations (and higher level concurrency primitives)
> possible, which is why Chandler's work is usefully important.

Please have a real look at atomic_ops first. It does have a library part to
it -- but that's just for a nonblocking stack.

All the atomic operations are macros and asm for the specific
compiler/architecture pairs.
So if you reuse that in the LLVM code generators, you save one large part of
the work. Of course you can redo all this work, surely giving you a very fast
start...

The implementations for the current proposal came from architecture
manuals, the Linux kernel, and the Apache Portable Runtime. I will
definitely be looking at the atomic_ops implementations to see if
there are improvements that can be made, but ultimately this provides
another model, but not a re-usable component as this must be done
through codegen at some point.

Second, I guess there has been some serious effort put into selecting the
specific model. So, for example, if you look at some of Hans' published
slides etc., there are some arguments in favor of associating membars with
specific instructions. Do you know reasons why LLVM shouldn't do this?

My reason for not associating them is due to the majority of hardware
implementations not associating them. The override motive was to
remain very close to the hardware. Could libraries and intrinsic
functions in the FE provide these different interfaces to the
constructs?

Has anyone looked at the memory models that are being in discussion for C/C++?
Although there is no consensus yet AFAIK, it should be good for LLVM to stay
close.

Not that LLVM should shun C/C++, but those aren't its only languages.
I think its better to approach the problem from the hardware, than the
language. This keeps LLVM an accurate layer for expressing hardware
operations, and allows languages to translate their constructs to
appropriately use the hardware.-

And please observe that I didn't state that the work is not important or not
useful. We should just strive to select the best model we have, and reuse
work if we can. And if we can reuse a tested implementation and model, that
is a good thing.

I absolutely agree. This is why every aspect of the current proposal
came from a hardware representation, combined with the Linux kernel
representations. We deviated toward the hardware to ensure the ability
of these intrinsics to fully exploit and expose the hardware
capabilities while remaining lowerable (if that were a word) across
all targets.

Does this clarify some? I am quite open to trying to add support for
Itanium-style hardware representations if this is a significant issue,
it was simply not a priority, and not a problem that I well
understand. (The Linux kernel does not use these semantics that I
could find, but then it may not be the best example.)

-Chandler Carruth

1) You may want to consider adding atomic load-<bitwise operation>-store
instructions in addition to load-<add/subtract> instructions. The Linux
kernel uses these for atomic bit setting/clearing, and on many systems
they can be implemented more efficiently using special assembly
instructions. That said, I have never run experiments to verify that
such instructions are critical to kernel performance.

I was trying to keep the set of operations as small as possible.
Supporting all of the bitwise operations on the x86 architecture would
be very difficult due to their number. (BTS, BTR, BTC...) Several of
these also have no easy emulation available for other architectures.
(Imagine doing an atomic test and set of a single bit, without
affecting the rest of the bits, on SPARC..) Others do not fit well
into the SSA representation of LLVM. This is particularly true of the
non-exchanging operations on x86. Because x86 can work directly with
memory, avoiding loads and stores, many locked operations are
practical there without saving out the old value as part of the atomic
instruction. Representing this in an SSA fashion would be very
difficult I think, especially when trying to maintain atomicity across
the entire LLVM instruction which isn't necessarily implementable as a
single instruction on x86.

All the same, if there is a demand for these bitwise operations, I am
more than willing to try and work up ways to include them. Do other
people have ideas for implementing them cleanly, and/or ideas about
the demand for them over using "cas" to achieve the functionality?

2) You need a strategy for handling architectures that can't handle
atomic operations on certain LLVM data types. For example, 64 bit
machines can operate atomically on 64 bit operands, but some 32 bit
machines cannot. I think you can fix this with spin locking, but you
need to be careful on how you do it. I think in order to do it
correctly, you need a separate spinlock variable for each individual
instruction that requires a spinlock.

Indeed, which is very dangerous with potential for deadlock, etc.
After discussing this on IRC, I have adjusted the proposal to reflect
the idea that the target implementation can reject any instructions
with operands which they cannot effectively lower to an atomic
operation. This makes the available types for the instruction
architecture dependent, but when relying on the atomic architecture
implementation, this seems part of the package.

3) You say that your operations work on all first class LLVM data
types. The LLVM vector type is considered a first class type. Should
LLVM support atomic vector operations?

Thanks for pointing this out! Continuing with the changes above, I
have changed the proposal to state that these instructions only accept
integer types. This should provide the needed functionality of these
operations, while keeping a clear lowering path to the target
architecture.

4) For consistency, you may need to specify that the LLVM volatile
keyword can be added to the atomic operations to prevent optimizations
from modifying or removing them. I'm not aware of any optimization
that works on atomic ops in practice, but I don't see why they couldn't
be written.

Why would they need this keyword? If the optimization pass works on
atomic ops, would it need to understand the semantics involved, and
only modify/remove them when those semantics allowed it...
specifically, constant propagation should be able to replace the
values in these operations with constants if it can do so, no? Perhaps
I don't understand what situation you're trying to avoid/prepare for
with this.

5) With the addition of membar instructions, it may be time to re-think
what the LLVM volatile keyword means. Currently, volatile prevents
removing, modifying, or reordering of a load or store. However, membar
also prevents reordering. Perhaps it would be useful to redefine
volatile to mean that a load/store cannot be removed or modified but can
be reordered, and membar alone prevents reordering.

I personally think this would be very interesting to do, as the
barriers provide a very fine grained level of control over load/store
ordering. However, they have implications if they are used for this
purpose -- should the order enforcement of loads and stores without
interprocess implications cause the bus events that these memory
barriers do? That could result in a huge slowdown.

More likely, these barriers should provide another piece of
information about load/store reordering, not superseding volatile. If
people see value in using a barrier-type set of constraints on
volatile reordering, this should be provided separately I think.

-Chandler Carruth

This probably isn’t needed for the mainstream architectures listed in the proposal, but wouldn’t it be good to let all of the instructions support types based on the specific target rather than just limiting them to integer types? If the target supports atomic operations on, say vectors, then I’d think that should be allowed.

Torvald Riegel wrote:

Torvald Riegel wrote:

Hi,

I'd like to see support for something like this. I have some comments,
and I think there is existing work that you can reuse.

"reuse within the compiler."

within the LLVM compiler framework, to be precise.

"While the processor may spin and attempt the atomic operation more than
once before it is successful, research indicates this is extremely
uncommon." I don't understand this sentence, what do you mean?

I'm not sure I can pinpoint the paper from which the statement is based,
but I seem to recall something similar in the original LL-SC papers
(Maurice Herlihy, DEC Western Research Labs?) It's a foundation for
lock-free algorithms.

Well, the statement says that often you have low contention. But that's
something you want, not necessarily something you will get, and depends on
the workload/algorithm. I'm missing the context. Is the actual statement as
obvious as that you should try to use the atomic instructions offered by your
processor, instead of doing blocking algorithms?

As Chandler pointed out, LL/SC isn't blocking. It belongs to the
optimistic concurrency class of constructs. One of the earliest papers
(IIRC, the first paper) on LL/SC was:

Herlihy, M. 1993. A methodology for implementing highly concurrent data
objects. ACM Trans. Program. Lang. Syst. 15, 5 (Nov. 1993), 745-770.
DOI= http://doi.acm.org/10.1145/161468.161469

LL/SC on the various RISC architectures are used for spin locks, but
they don't have to be used that way. I suspect that current work on
software transactional memory is LL/SC-like on memory regions -- if you
look at the paper, there is a chunk of code in the examples that rolls
back or restarts a computation if the SC operation fails.

Please have a real look at atomic_ops first. It does have a library part to
it -- but that's just for a nonblocking stack.

It's a lot like Apple's (and gcc's) work to reconcile the Intel and PPC
vector intrinsics. Nice work but an unnecessary dependency, in my
personal and not so humble opinion.

Second, I guess there has been some serious effort put into selecting the
specific model. So, for example, if you look at some of Hans' published
slides etc., there are some arguments in favor of associating membars with
specific instructions. Do you know reasons why LLVM shouldn't do this?

You mean the papers that don't have to do with garbage collection? :slight_smile:

Seriously, I think that's the overall purpose for some of this work so
that llvm can do a better job in instruction-level parallelism.

Has anyone looked at the memory models that are being in discussion for C/C++?
Although there is no consensus yet AFAIK, it should be good for LLVM to stay
close.

Even when consensus is achieved, it still has to be implemented on the
hardware. As you point out, LL/SC is used to create spinlocks. But LL/SC
is somewhat more powerful than that.

-scooter

As Chandler pointed out, LL/SC isn't blocking. It belongs to the
optimistic concurrency class of constructs. One of the earliest papers
(IIRC, the first paper) on LL/SC was:

Herlihy, M. 1993. A methodology for implementing highly concurrent data
objects. ACM Trans. Program. Lang. Syst. 15, 5 (Nov. 1993), 745-770.
DOI= http://doi.acm.org/10.1145/161468.161469

LL/SC on the various RISC architectures are used for spin locks, but
they don't have to be used that way. I suspect that current work on
software transactional memory is LL/SC-like on memory regions -- if you
look at the paper, there is a chunk of code in the examples that rolls
back or restarts a computation if the SC operation fails.

First of all, I know LL/SC. Did I say it's equivalent to get-and-set? No.
So what are you trying to say, why is the paragraph in the proposal? You
seem to be speculating about architectures in general in one paragraph.
IMHO, I wouldn't try that, because I would have to be either imprecise or
don't state anything new.

> Please have a real look at atomic_ops first. It does have a library
> part to it -- but that's just for a nonblocking stack.

It's a lot like Apple's (and gcc's) work to reconcile the Intel and PPC
vector intrinsics. Nice work but an unnecessary dependency, in my
personal and not so humble opinion.

So when you are reinventing the wheel, it doesn't give you a dependency on
the wheel, is that what you're saying?

The idea is to review the atomic_ops model, and if it makes sense, just
reuse it. (e.g., atomic_ops seems to have (basic?) support for Alpha).

> Second, I guess there has been some serious effort put into selecting
> the specific model. So, for example, if you look at some of Hans'
> published slides etc., there are some arguments in favor of associating
> membars with specific instructions. Do you know reasons why LLVM
> shouldn't do this?

You mean the papers that don't have to do with garbage collection? :slight_smile:

Seriously, I think that's the overall purpose for some of this work so
that llvm can do a better job in instruction-level parallelism.

Can I get an answer to the actual question, please?

> Has anyone looked at the memory models that are being in discussion for
> C/C++? Although there is no consensus yet AFAIK, it should be good for
> LLVM to stay close.

Even when consensus is achieved, it still has to be implemented on the
hardware. As you point out, LL/SC is used to create spinlocks. But LL/SC
is somewhat more powerful than that.

Well, the proposals actually consider the implementation (and, e.g.,
atomic_ops is an actual implementation). How stupid would be a model that
doesn't map well to hardware?
And, sorry, I don't see how the particular LL/SC mechanism is related to my
question at all. Could we stay on topic, please?

torvald

atomic_ops has at least a basic implementation for alpha (and various other
architectures).

torvald

I was trying to keep the set of operations as small as possible.
Supporting all of the bitwise operations on the x86 architecture would
be very difficult due to their number. (BTS, BTR, BTC...) Several of
these also have no easy emulation available for other architectures.
(Imagine doing an atomic test and set of a single bit, without
affecting the rest of the bits, on SPARC..) Others do not fit well
into the SSA representation of LLVM. This is particularly true of the
non-exchanging operations on x86. Because x86 can work directly with
memory, avoiding loads and stores, many locked operations are
practical there without saving out the old value as part of the atomic
instruction. Representing this in an SSA fashion would be very
difficult I think, especially when trying to maintain atomicity across
the entire LLVM instruction which isn't necessarily implementable as a
single instruction on x86.

atomic_ops has code for more operations, you can probably find useful
information there...

All the same, if there is a demand for these bitwise operations, I am
more than willing to try and work up ways to include them. Do other
people have ideas for implementing them cleanly, and/or ideas about
the demand for them over using "cas" to achieve the functionality?

I think specialized instructions (e.g., inc/dec, or, TAS, ...) are less
costly on some architectures, compared to a full CAS. And AFAIK, CAS often
includes a full barrier (or equivalent constraints), whereas with the
specialized operations, you can select the kind of ordering constraints
that you really need.

> 2) You need a strategy for handling architectures that can't handle
> atomic operations on certain LLVM data types. For example, 64 bit
> machines can operate atomically on 64 bit operands, but some 32 bit
> machines cannot. I think you can fix this with spin locking, but you
> need to be careful on how you do it. I think in order to do it
> correctly, you need a separate spinlock variable for each individual
> instruction that requires a spinlock.

Indeed, which is very dangerous with potential for deadlock, etc.
After discussing this on IRC, I have adjusted the proposal to reflect
the idea that the target implementation can reject any instructions
with operands which they cannot effectively lower to an atomic
operation. This makes the available types for the instruction
architecture dependent, but when relying on the atomic architecture
implementation, this seems part of the package.

I would rather see it provide blocking implementations. That's not good
because it is nonblocking. Deadlock is an issue though if you are handling
signals (otherwise, you're not getting more than one lock, so you should be
okay).

torvald

> > > "While the processor may spin and attempt the atomic operation more
> > > than once before it is successful, research indicates this is
> > > extremely uncommon." I don't understand this sentence, what do you
> > > mean?
> >
> > I'm not sure I can pinpoint the paper from which the statement is
> > based, but I seem to recall something similar in the original LL-SC
> > papers (Maurice Herlihy, DEC Western Research Labs?) It's a
> > foundation for lock-free algorithms.
>
> Well, the statement says that often you have low contention. But that's
> something you want, not necessarily something you will get, and depends
> on the workload/algorithm. I'm missing the context. Is the actual
> statement as obvious as that you should try to use the atomic
> instructions offered by your processor, instead of doing blocking
> algorithms?

LL/SC is not a blocking algorithm. I'm going to be changing some of
the nomenclature on the page to reflect this, but while it spins, it
does not actually lock. The idea (as I understand it, and I'm still
hoping Scott can find the reference he gave me to a DEC paper
outlining this) is that the spin only occurs if another process does
something breaking atomicity for that _particular_ LL/SC pairing. Even
when the spin occurs, it should only occur until that particular
process gets through the op without interruption. This needs some
statistical analysis however, and hopefully the research in the
literature can be located and referenced.

Okay, now I know what you wanted to say (and btw, I know that). But I still
wouldn't put that into the proposal because we don't care how the hardware
implements it. CAS could be implemented the same way. The point is whether
you got something as powerful as CAS or LL/SC that is still nonblocking (or
looks nonblocking).

I'd suggest having a look at the so-called consensus hierarchy (and why
TAS/get-and-set is less powerful than CAS).

> > > You probably don't need to require CAS (compare-and-set) to return
> > > the previous value (I think some architectures don't), but just
> > > return a boolean value (success/failure).
> >
> > compare and swap?
>
> Well, do you need the swap, or is a compare-and-set sufficient most of
> the time? What do other architectures offer?

All of the architectures offer swap, or have no penalty for swapping.
This allows much easier algorithm development to my mind, as you have
the best of both worlds -- success/failure information, and the value
from memory.

Here is an example what can happen if you want to support an architecture
that does only return a success flag on CAS or LLSC, but not the value (no
swap).

1) What you could do to return the value is returning a load before or after
the CAS. Without blocking, you can't (easily?) make the load atomic with
the operation.

2) Suppose a user wants to use CAS for a simple spin-lock (because you don't
offer TAS, for example). You need to return a value instead of the success
boolean flag. Neither the previous nor the later load are useful. So you
have to pick a "different" value (e.g., expected value + 1, or sth like
that).

3) Now user want to use CAS for something else, for example, swinging a
pointer. How do you pick a different value? You can do, but then the user
can't use this value for anything else.

As a result, you can't support the CAS for this hardware without blocking.
Because of that, I would choose the conservative way of returning just the
success flag. And most of the concurrent implementations I've seen have the
load in it anyway (sth like while(1) { load; set up; if (CAS) break;}).

The overhead that you can get is one more load, if you need to retry.
However, I don't think that is an issue, both regarding ease of use and
performance-wise.
You usually won't need a membar for this load, because some value from after
the CAS is sufficient in most algorithms, and the CAS usually ensures that.
If the CAS failed and your CAS is a swap, then the new value is already
either in the cache (load is cheap) or in a register. You could try to
supply subsequent loads from the register, but that might be difficult to
implement in the code generator. In any case, do you have performed
measurements to test whether the potential additional load actually matters
performance-wise?

> What you would want is to have a model that is (1) easy-to-use for the
> developers and (2) close to what the hardware offers. L/S membars are
> easy to use, but I think some architectures such as Itanium offer
> different membars with different costs. So if you pick the wrong model
> and have to use stronger membars (mfence Itanium) to implement your
> model, than you pay for that by decreased performance.

Itanium was the only architecture to offer these semantics, while the
L/S membars are offered to varying levels of detail on several
architectures. As Itanium is not yet a fully functional target, it was
not prioritized.

Moreover, as the only instructions (to my knowledge) on Itanium to
have memory synchronization components are cmpxchg and fetchadd, these
could be implemented correctly when implementing the lowering for the
instructions in this proposal, while still providing full memory
barriers when needed outside of the atomic instructions. If there is
serious demand for building memory semantics into the atomic
instructions, "aquire" and "release" flags could be used, and
implementations appropriately handle them. This doesn't seem to anull
the need for non-operation-based memory barriers.

At the programming level, the membar is always conceptually associated to
some operation. You need the ordering at some point in your algorithm (when
the processor would execute a particular operation). If you wouldn't need
ordering constraints for any operation, then you wouldn't need the membar
at all.

The implementations for the current proposal came from architecture
manuals, the Linux kernel, and the Apache Portable Runtime. I will
definitely be looking at the atomic_ops implementations to see if
there are improvements that can be made, but ultimately this provides
another model, but not a re-usable component as this must be done
through codegen at some point.

To me, letting codegen emit the same instructions as the macros do _is_
reuse.
You might even be able to write a script that generates codegen-code from
the macros...

> Second, I guess there has been some serious effort put into selecting
> the specific model. So, for example, if you look at some of Hans'
> published slides etc., there are some arguments in favor of associating
> membars with specific instructions. Do you know reasons why LLVM
> shouldn't do this?

My reason for not associating them is due to the majority of hardware
implementations not associating them. The override motive was to
remain very close to the hardware. Could libraries and intrinsic
functions in the FE provide these different interfaces to the
constructs?

It might be more difficult because the membars modify operations, and can't
be easily "added" as new instructions,calls, or intrinsics.

Regarding the frontends, we're back at the language integration question, I
guess. If you don't want to change syntax or add modifiers, I guess the
best way would be to provide a function call interface for CAS, loads, ...
(either plus membars or with membars attached as in atomic_ops), and
then "inline"/lower the code to the to-be-added LLVM IR constructs in the
frontend, or in an extra pass. You might even get a fallback to library
code (requires lib to be linked, of course), when the codegen doesn't
support the atomic ops for a certain architecture.

> Has anyone looked at the memory models that are being in discussion for
> C/C++? Although there is no consensus yet AFAIK, it should be good for
> LLVM to stay close.

Not that LLVM should shun C/C++, but those aren't its only languages.
I think its better to approach the problem from the hardware, than the
language. This keeps LLVM an accurate layer for expressing hardware
operations, and allows languages to translate their constructs to
appropriately use the hardware.

I don't know any language besides Java that does have an official memory
model? Are there any?
In all other languages I've seen so far, synchronization is either only via
critical sections, or they revert to using some C/C++ library (that then
uses asm code).

I absolutely agree. This is why every aspect of the current proposal
came from a hardware representation, combined with the Linux kernel
representations. We deviated toward the hardware to ensure the ability
of these intrinsics to fully exploit and expose the hardware
capabilities while remaining lowerable (if that were a word) across
all targets.

Does this clarify some? I am quite open to trying to add support for
Itanium-style hardware representations if this is a significant issue,
it was simply not a priority, and not a problem that I well
understand. (The Linux kernel does not use these semantics that I
could find, but then it may not be the best example.)

I really do thank you for the effort you put into this. All that I want is
that LLVM gets the best option.

The whole memory-model issue is really difficult, and if you do it wrong,
you're going to pay. Just look at Java and its memory-model troubles...
For this reason, I'd like to see more discussion about it, about
alternatives, models, implementations, etc. And the decisions should be
documented somewhere (e.g., by archives of mailing lists...).

torvald

Torvald Riegel wrote:

First of all, I know LL/SC. Did I say it's equivalent to get-and-set? No.
So what are you trying to say, why is the paragraph in the proposal? You
seem to be speculating about architectures in general in one paragraph.
IMHO, I wouldn't try that, because I would have to be either imprecise or
don't state anything new.

I was rebutting your point regarding spin locks going through the loop
once; spinning for more than one iteration is generally rare. And no, I
am not speculating about architectures in general, for that matter. I
simply like LL/SC and think it's superior to most other primitives,
being a matter of good taste.

BTW: It's not my proposal. I merely work with Chandler.

Please have a real look at atomic_ops first. It does have a library
part to it -- but that's just for a nonblocking stack.

It's a lot like Apple's (and gcc's) work to reconcile the Intel and PPC
vector intrinsics. Nice work but an unnecessary dependency, in my
personal and not so humble opinion.

So when you are reinventing the wheel, it doesn't give you a dependency on
the wheel, is that what you're saying?

No. I'm not saying that at all. If you actually took a look at LLVM,
you'd notice that it stands alone. It has very few dependencies upon
outside code and it generates __no__ dependencies to outside code or
libraries. From previous experience developing for LLVM, I happen to
know that unnecessary dependencies are not viewed favorably.

The idea is to review the atomic_ops model, and if it makes sense, just
reuse it. (e.g., atomic_ops seems to have (basic?) support for Alpha).

atomic_ops may have interesting ideas on how Chandler might proceed and
implement, but using its code is very unlikely.

Second, I guess there has been some serious effort put into selecting
the specific model. So, for example, if you look at some of Hans'
published slides etc., there are some arguments in favor of associating
membars with specific instructions. Do you know reasons why LLVM
shouldn't do this?

You mean the papers that don't have to do with garbage collection? :slight_smile:

Seriously, I think that's the overall purpose for some of this work so
that llvm can do a better job in instruction-level parallelism.

Can I get an answer to the actual question, please?

Being argumentative for the sake of being argumentative isn't going to
motivate me to answer your question. LLVM is a machine IR and its
instructions have to map to something that exists and has reasonable
properties. Asking for acquire/release, which only exists on one
architecture, doesn't fit the pattern (and is arguably not such a great
idea within that particular implementation.)

Could we stay on topic, please?

Bite me.

-scooter

Comments are more than welcome, especially suggestions for
improvement. I hope this provides a sound background and a good
starting place for bringing these features to LLVM. Many thanks also
go to Reid who has helped tremendously with putting this together, and
several of the people at Aerospace who helped in the research phase.

This looks good; this is basically what we came up with for the SVA-OS work.

Some comments:

1) You may want to consider adding atomic load-<bitwise operation>-store
instructions in addition to load-<add/subtract> instructions. The Linux

Yep, this is also important for supporting the GCC synch builtins:
http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html#Atomic-Builtins

despite the description, these are not itanium specific.

We will want to support these with GCC 4.2 to support OpenMP.

2) You need a strategy for handling architectures that can't handle
atomic operations on certain LLVM data types. For example, 64 bit

Yep.

3) You say that your operations work on all first class LLVM data
types. The LLVM vector type is considered a first class type. Should
LLVM support atomic vector operations?

Yep :), potentially through spinlocking.

5) With the addition of membar instructions, it may be time to re-think
what the LLVM volatile keyword means. Currently, volatile prevents

The volatile marker and synchronization are two different things. Please don't mess with volatile :slight_smile:

-Chris

The atomic_ops library does look very interesting. Parts of its model could be adapted, and the various operations seem like they would turn into a few inlined instructions each (not libcalls).

-Chris