Reviving the new LLVM concurrency model

There was some discussion a while back about adding a C++0x-style
memory model and atomics for LLVM a while back
(http://thread.gmane.org/gmane.comp.compilers.llvm.devel/31295), but
it got stalled. I'm going to try and restart progress on it.

Attached are two patches; the first adds a section to LangRef with
just the memory model, without directly changing the documentation or
implementation of atomic intrinsics. This mostly comes from
https://docs.google.com/View?docID=ddb4mhxz_22dz5g98dd&revision=_latest,
but it's been modified a bit, so please look at the attached version.
The second fixes the one place I know of where LLVM violates that
proposed memory model.

I would appreciate any reviews (primarily for the LangRef bits; I'm
reasonably confident the patch to LICM is correct).

There was some discussion about allowing targets to implement
unaligned stores by widening, but I've left that out. No current
backend uses such an implementation, so there isn't a substantial
benefit, and allowing it would break some current optimizations (like
transforming a memset to an unaligned store). See also the FIXME in
the LangRef patch about architectures without a proper single-byte
store instruction.

-Eli

atomicp1.txt (4.84 KB)

atomicp2.txt (4.66 KB)

I noticed the patch was already merged into the current LLVM language
reference manual with new memory instructions, fence, cmpxchg and
atomicrmw. Will the instructions be available in LLVM 3.0?

The current memory model section ends with the following discussions:

"Note that in cases where none of the atomic intrinsics are used, this
model places only one restriction on IR transformations on top of what
is required for single-threaded execution: introducing a store to a
byte which might not otherwise be stored to can introduce undefined
behavior.... "

Why is introducing additional loads allowed? For example, in a program
that already contains two unordered stores to a location l, if we
introduced an unordered load to l, then the load cannot see more than
one stores, it must return undef.

The question is whether the memory model takes the two unordered
stores as a data race that has already makes the program's behavior
undefined. But the spec does not make this point clear.

My second question is whether the "undefined behavior" in "...
introducing a store to a byte which might not otherwise be stored to
can introduce undefined behavior.... " is same to the ``undef'' in the
definition of R_{byte} that can be undef in several cases. Does the
undef mean the LLVM undef values
(http://llvm.org/docs/LangRef.html#undefvalues)? If "undefined
behavior" means a program can do anything, why does R_{byte} have to
return undef? Thanks.

The current memory model section ends with the following discussions:

"Note that in cases where none of the atomic intrinsics are used, this
model places only one restriction on IR transformations on top of what
is required for single-threaded execution: introducing a store to a
byte which might not otherwise be stored to can introduce undefined
behavior.... "

Why is introducing additional loads allowed? For example, in a program
that already contains two unordered stores to a location l, if we
introduced an unordered load to l, then the load cannot see more than
one stores, it must return undef.

The intent of the model is that if the behavior of a program doesn't
depend on the value of a load, it's okay if it is undef. This allows,
for example, hoisting loads which are conditionally executed out of
loops.

My second question is whether the "undefined behavior" in "...
introducing a store to a byte which might not otherwise be stored to
can introduce undefined behavior.... " is same to the ``undef'' in the
definition of R_{byte} that can be undef in several cases. Does the
undef mean the LLVM undef values
(http://llvm.org/docs/LangRef.html#undefvalues)? If "undefined
behavior" means a program can do anything, why does R_{byte} have to
return undef? Thanks.

"can introduce undefined behavior", I meant "Specifically, in the case
where another thread might write to and read from an address,
introducing a store can change a load that may see exactly one write
into a load that may see multiple writes." Any suggestion for how to
phrase that paragraph more clearly?

-Eli

Hopefully, yes; the implementation is in progress.

-Eli

The current memory model section ends with the following discussions:

"Note that in cases where none of the atomic intrinsics are used, this
model places only one restriction on IR transformations on top of what
is required for single-threaded execution: introducing a store to a
byte which might not otherwise be stored to can introduce undefined
behavior.... "

Why is introducing additional loads allowed? For example, in a program
that already contains two unordered stores to a location l, if we
introduced an unordered load to l, then the load cannot see more than
one stores, it must return undef.

The intent of the model is that if the behavior of a program doesn't
depend on the value of a load, it's okay if it is undef. This allows,
for example, hoisting loads which are conditionally executed out of
loops.

In C++0x, if a program has a data race, then its behavior is
undefined. Does the LLVM memory model design define programs'
behaviors in the similar ideas? If so, hosting additional loads out of
a loop that may not execute at runtime --- say the loop counter is 0,
can turn a program from defined to undefined, which makes the later
compilations transform the program arbitrarily, since their input
programs are undefined. Is it expected? Although I am not sure if the
existing LLVM transformations do so.

My second question is whether the "undefined behavior" in "...
introducing a store to a byte which might not otherwise be stored to
can introduce undefined behavior.... " is same to the ``undef'' in the
definition of R_{byte} that can be undef in several cases. Does the
undef mean the LLVM undef values
(http://llvm.org/docs/LangRef.html#undefvalues)? If "undefined
behavior" means a program can do anything, why does R_{byte} have to
return undef? Thanks.

"can introduce undefined behavior", I meant "Specifically, in the case
where another thread might write to and read from an address,
introducing a store can change a load that may see exactly one write
into a load that may see multiple writes." Any suggestion for how to
phrase that paragraph more clearly?

Undefined behavior usually means any possible behavior is legal, while
that a load returns undef (by the definition) because of seeing more
than one writes is a much more precise behavior than undefined
behavior. It helps to understanding the specification, if this point
can be made more precisely, which is consistent with the discussion
from
  https://docs.google.com/View?docID=ddb4mhxz_24gh4ksvgh&revision=_latest

I noticed the patch was already merged into the current LLVM language
reference manual with new memory instructions, fence, cmpxchg and
atomicrmw. Will the instructions be available in LLVM 3.0?

Hopefully, yes; the implementation is in progress.

Will all the intrinsic functions
(http://llvm.org/docs/LangRef.html#int_atomics) eventually be replaced
by the new instructions, or there will be two versions of concurrent
memory operations left?

The current memory model section ends with the following discussions:

"Note that in cases where none of the atomic intrinsics are used, this
model places only one restriction on IR transformations on top of what
is required for single-threaded execution: introducing a store to a
byte which might not otherwise be stored to can introduce undefined
behavior.... "

Why is introducing additional loads allowed? For example, in a program
that already contains two unordered stores to a location l, if we
introduced an unordered load to l, then the load cannot see more than
one stores, it must return undef.

The intent of the model is that if the behavior of a program doesn't
depend on the value of a load, it's okay if it is undef. This allows,
for example, hoisting loads which are conditionally executed out of
loops.

In C++0x, if a program has a data race, then its behavior is
undefined. Does the LLVM memory model design define programs'
behaviors in the similar ideas?

No.

My second question is whether the "undefined behavior" in "...
introducing a store to a byte which might not otherwise be stored to
can introduce undefined behavior.... " is same to the ``undef'' in the
definition of R_{byte} that can be undef in several cases. Does the
undef mean the LLVM undef values
(http://llvm.org/docs/LangRef.html#undefvalues)? If "undefined
behavior" means a program can do anything, why does R_{byte} have to
return undef? Thanks.

"can introduce undefined behavior", I meant "Specifically, in the case
where another thread might write to and read from an address,
introducing a store can change a load that may see exactly one write
into a load that may see multiple writes." Any suggestion for how to
phrase that paragraph more clearly?

Undefined behavior usually means any possible behavior is legal, while
that a load returns undef (by the definition) because of seeing more
than one writes is a much more precise behavior than undefined
behavior. It helps to understanding the specification, if this point
can be made more precisely, which is consistent with the discussion
from
https://docs.google.com/View?docID=ddb4mhxz_24gh4ksvgh&revision=_latest

K; I'll take another look.

-Eli

Yes, they will be replaced.

-Eli

The current memory model section ends with the following discussions:

"Note that in cases where none of the atomic intrinsics are used, this
model places only one restriction on IR transformations on top of what
is required for single-threaded execution: introducing a store to a
byte which might not otherwise be stored to can introduce undefined
behavior.... "

Why is introducing additional loads allowed? For example, in a program
that already contains two unordered stores to a location l, if we
introduced an unordered load to l, then the load cannot see more than
one stores, it must return undef.

The intent of the model is that if the behavior of a program doesn't
depend on the value of a load, it's okay if it is undef. This allows,
for example, hoisting loads which are conditionally executed out of
loops.

In C++0x, if a program has a data race, then its behavior is
undefined. Does the LLVM memory model design define programs'
behaviors in the similar ideas? If so, hosting additional loads out of
a loop that may not execute at runtime --- say the loop counter is 0,
can turn a program from defined to undefined, which makes the later
compilations transform the program arbitrarily, since their input
programs are undefined. Is it expected? Although I am not sure if the
existing LLVM transformations do so.

Hoisting loads out of non-executing loops is intended to preserve
defined behavior. This is intentionally different from the C++0x
memory model, along the lines that Hans Boehm suggested in
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2338.html: "a
compiler that introduces a racing load at the machine code level, and
then discards the result of the load, generally will not have changed
the semantics of the program. … The author of C++ source code provides
an implicit promise to the compiler: There are no data races on
ordinary variables. This allows the compiler to safely perform some
common optimizations that would otherwise be invalid. If the compiler
itself introduces potentially racing loads in intermediate code, it
can, and has to, remember that these loads may race, and have to be
treated as such by subsequent optimizations."

In LLVM, we've decided to say that all loads may race rather than
tracking which loads were in the original source vs. were introduced
by optimizations. If that causes optimization problems in the future,
it could be revisited, but for now it provides a simpler model for the
IR.

C++ and Java memory models impose restrictions for locks and unlocks,
such as a thread that releases a lock must acquired the lock, or the
number of locks must be larger than the number of unlocks in the same
thread... for enabling some optimizations, for example, simplifying
trylocks (http://www.hpl.hp.com/techreports/2008/HPL-2008-56.html),
and moving some instructions inside lock acquires
(http://www.hpl.hp.com/techreports/2005/HPL-2005-217R1.html). What is
the rationale that the LLVM memory model ignores such restrictions?

The LLVM memory model doesn't include locks at all, so it can't
include restrictions about them. One would implement locks, whether
C++0x, posix, or something else, using the atomic instructions in
Eli's patches, along with syscalls like futex or kqueue.

C++0x's restrictions on locks allow an implementor to choose cheaper
LLVM instructions than they'd otherwise have to. For example, the fact
that try_lock is allowed to fail spuriously allows lock and try_lock
to be implemented with acquire cmpxchg instructions rather than
acq_rel cmpxchg instructions.

The current cmpxchg instruction still leaves a small amount of
performance on the table by having only a single ordering argument,
rather than the two arguments that C++0x takes. In the case of
try_lock, I expect this to manifest as an unnecessary fence on the
failing path on ARM and PPC. Only time will tell whether this matters
in practice, but if it does matter, I expect it to be straightforward
to add the second argument.

Jeffrey

In the definition of 'monotonic' ordering,
... "If an address is written monotonically by one thread, and other
threads monotonically read that address repeatedly, the other threads
must eventually see the write"...

Does this mean if a thread does multi-writes monotonically, monotonic
reads from other threads should see all of them? But intuitively, it
seems to be fine for a read to ``miss'' some of the writes as long as
the writes seen are monotonic in the sense that later reads should see
the same write of earlier reads, or any write monotonically after the
writes seen.

In the case there is only one monotonic write, what does 'eventually'
mean? Can we know a write must be seen when some condition holds, for
example, a number of instructions executed, the thread that did the
write executes a fence, ...?

C++ memory model does not have ``unordered'', and "monotonic", but
have "modification ordering" (is it same to the relaxed atomic
variables the LLVM IR mentions?). If I am compiling C++ to LLVM, can
all modification atomic be compiled to monotonic? And when should we
use "unordered"?

In the definition of 'monotonic' ordering,
... "If an address is written monotonically by one thread, and other
threads monotonically read that address repeatedly, the other threads
must eventually see the write"...

It's supposed to mean that if you have a something like looks like a
spinloop with monotonic reads, it shouldn't spin forever if the value
changes. I'll take another look at rewording that.

Does this mean if a thread does multi-writes monotonically, monotonic
reads from other threads should see all of them? But intuitively, it
seems to be fine for a read to ``miss'' some of the writes as long as
the writes seen are monotonic in the sense that later reads should see
the same write of earlier reads, or any write monotonically after the
writes seen.

In the case there is only one monotonic write, what does 'eventually'
mean? Can we know a write must be seen when some condition holds, for
example, a number of instructions executed, the thread that did the
write executes a fence, ...?

C++ memory model does not have ``unordered'', and "monotonic", but
have "modification ordering" (is it same to the relaxed atomic
variables the LLVM IR mentions?). If I am compiling C++ to LLVM, can
all modification atomic be compiled to monotonic? And when should we
use "unordered"?

http://llvm.org/docs/Atomics.html is an attempt to make things much
more straightforward than the stuff in LangRef.

-Eli

In the definition of 'monotonic' ordering,
... "If an address is written monotonically by one thread, and other
threads monotonically read that address repeatedly, the other threads
must eventually see the write"...

It's supposed to mean that if you have a something like looks like a
spinloop with monotonic reads, it shouldn't spin forever if the value
changes. I'll take another look at rewording that.

Does this mean if a thread does multi-writes monotonically, monotonic
reads from other threads should see all of them? But intuitively, it
seems to be fine for a read to ``miss'' some of the writes as long as
the writes seen are monotonic in the sense that later reads should see
the same write of earlier reads, or any write monotonically after the
writes seen.

In the case there is only one monotonic write, what does 'eventually'
mean? Can we know a write must be seen when some condition holds, for
example, a number of instructions executed, the thread that did the
write executes a fence, ...?

C++ memory model does not have ``unordered'', and "monotonic", but
have "modification ordering" (is it same to the relaxed atomic
variables the LLVM IR mentions?). If I am compiling C++ to LLVM, can
all modification atomic be compiled to monotonic? And when should we
use "unordered"?

http://llvm.org/docs/Atomics.html is an attempt to make things much
more straightforward than the stuff in LangRef.

This is cool.

At the end of the "optimization outside atomic" section there are
discussions about "returning undef". Is it the following correct?
* a store/store data race in LLVM leads to undefined behaviors,
* a store/load data race does not result in undefined behavior, but
the load returns undef
* if two memory accesses are of data races, then at least one of them
is NonAtomic.

My question is suppose a load L and a store S have a data race, and L
runs earlier than S in an execution, L is well-ordered with earlier
writes by happens-before, then at the point when L runs, but S has not
run yet, should the L also return undef or what ever write it can see
w/o races so far?

Although non-synchronized writes from other threads may propagate to
another thread in different orders, but the writes that a read from a
different thread can see should have already executed before the read
(in a global time). So in the above case, it seems fine to allow the
load to return a 'defined' value. Is there any case that makes 'undef'
possible?

In the definition of 'monotonic' ordering,
... "If an address is written monotonically by one thread, and other
threads monotonically read that address repeatedly, the other threads
must eventually see the write"...

It's supposed to mean that if you have a something like looks like a
spinloop with monotonic reads, it shouldn't spin forever if the value
changes. I'll take another look at rewording that.

Does this mean if a thread does multi-writes monotonically, monotonic
reads from other threads should see all of them? But intuitively, it
seems to be fine for a read to ``miss'' some of the writes as long as
the writes seen are monotonic in the sense that later reads should see
the same write of earlier reads, or any write monotonically after the
writes seen.

In the case there is only one monotonic write, what does 'eventually'
mean? Can we know a write must be seen when some condition holds, for
example, a number of instructions executed, the thread that did the
write executes a fence, ...?

C++ memory model does not have ``unordered'', and "monotonic", but
have "modification ordering" (is it same to the relaxed atomic
variables the LLVM IR mentions?). If I am compiling C++ to LLVM, can
all modification atomic be compiled to monotonic? And when should we
use "unordered"?

http://llvm.org/docs/Atomics.html is an attempt to make things much
more straightforward than the stuff in LangRef.

This is cool.

At the end of the "optimization outside atomic" section there are
discussions about "returning undef". Is it the following correct?
* a store/store data race in LLVM leads to undefined behaviors,

What exactly is a store-store "race"? That sounds wrong.

* a store/load data race does not result in undefined behavior, but
the load returns undef
* if two memory accesses are of data races, then at least one of them
is NonAtomic.

The model isn't really defined in terms of races, but these two sound
roughly correct.

My question is suppose a load L and a store S have a data race, and L
runs earlier than S in an execution, L is well-ordered with earlier
writes by happens-before, then at the point when L runs, but S has not
run yet, should the L also return undef or what ever write it can see
w/o races so far?

Although non-synchronized writes from other threads may propagate to
another thread in different orders, but the writes that a read from a
different thread can see should have already executed before the read
(in a global time). So in the above case, it seems fine to allow the
load to return a 'defined' value. Is there any case that makes 'undef'
possible?

If there is a load and a store to the same address with no
happens-before relationship, the load returns undef. "L runs earlier
than S in an execution" doesn't make sense; if the load and store
aren't atomic, the compiler is allowed to, for example, rematerialize
the load, so that it happens both before and after the store.

-Eli

In the definition of 'monotonic' ordering,
... "If an address is written monotonically by one thread, and other
threads monotonically read that address repeatedly, the other threads
must eventually see the write"...

It's supposed to mean that if you have a something like looks like a
spinloop with monotonic reads, it shouldn't spin forever if the value
changes. I'll take another look at rewording that.

Does this mean if a thread does multi-writes monotonically, monotonic
reads from other threads should see all of them? But intuitively, it
seems to be fine for a read to ``miss'' some of the writes as long as
the writes seen are monotonic in the sense that later reads should see
the same write of earlier reads, or any write monotonically after the
writes seen.

In the case there is only one monotonic write, what does 'eventually'
mean? Can we know a write must be seen when some condition holds, for
example, a number of instructions executed, the thread that did the
write executes a fence, ...?

C++ memory model does not have ``unordered'', and "monotonic", but
have "modification ordering" (is it same to the relaxed atomic
variables the LLVM IR mentions?). If I am compiling C++ to LLVM, can
all modification atomic be compiled to monotonic? And when should we
use "unordered"?

http://llvm.org/docs/Atomics.html is an attempt to make things much
more straightforward than the stuff in LangRef.

This is cool.

At the end of the "optimization outside atomic" section there are
discussions about "returning undef". Is it the following correct?
* a store/store data race in LLVM leads to undefined behaviors,

What exactly is a store-store "race"? That sounds wrong.

* a store/load data race does not result in undefined behavior, but
the load returns undef
* if two memory accesses are of data races, then at least one of them
is NonAtomic.

The model isn't really defined in terms of races, but these two sound
roughly correct.

My question is suppose a load L and a store S have a data race, and L
runs earlier than S in an execution, L is well-ordered with earlier
writes by happens-before, then at the point when L runs, but S has not
run yet, should the L also return undef or what ever write it can see
w/o races so far?

Although non-synchronized writes from other threads may propagate to
another thread in different orders, but the writes that a read from a
different thread can see should have already executed before the read
(in a global time). So in the above case, it seems fine to allow the
load to return a 'defined' value. Is there any case that makes 'undef'
possible?

If there is a load and a store to the same address with no
happens-before relationship, the load returns undef. "L runs earlier
than S in an execution" doesn't make sense; if the load and store
aren't atomic, the compiler is allowed to, for example, rematerialize
the load, so that it happens both before and after the store.

I agree that if we consider all possible executions of the program,
"L runs earlier
than S in an execution" doesn't make sense, since L and S can run in any orders.

My confusion was more about a concrete execution of the program. If
the execution schedules S before L, then L is reasonable to return any
values (at least the write from S, or the recent write happens before
L), so undef is fine with me.

If the execution runs L earlier than S, and L can only see one write
till the point, can the L still return any value? L can return any
value in term of the C++ DRF memory model because eventually the L has
a data race with the future S, which makes the program behavior
undefined--- so reading any value is a good behavior.

At this case, LLVM only allows some particular behaviors--namely, the
load can return any value, but returning a value that none earlier
writes have does not seem to be consistent for the execution. Did I
misunderstand any concept here?

For non-atomic loads, we don't guarantee that a load corresponds to a
single hardware memory access. Fo example, a 64-bit load on x86-32
will generally be split. See also
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2176.html#undefined
. Also,

-Eli

In the definition of 'monotonic' ordering,
... "If an address is written monotonically by one thread, and other
threads monotonically read that address repeatedly, the other threads
must eventually see the write"...

It's supposed to mean that if you have a something like looks like a
spinloop with monotonic reads, it shouldn't spin forever if the value
changes. I'll take another look at rewording that.

Does this mean if a thread does multi-writes monotonically, monotonic
reads from other threads should see all of them? But intuitively, it
seems to be fine for a read to ``miss'' some of the writes as long as
the writes seen are monotonic in the sense that later reads should see
the same write of earlier reads, or any write monotonically after the
writes seen.

In the case there is only one monotonic write, what does 'eventually'
mean? Can we know a write must be seen when some condition holds, for
example, a number of instructions executed, the thread that did the
write executes a fence, ...?

C++ memory model does not have ``unordered'', and "monotonic", but
have "modification ordering" (is it same to the relaxed atomic
variables the LLVM IR mentions?). If I am compiling C++ to LLVM, can
all modification atomic be compiled to monotonic? And when should we
use "unordered"?

http://llvm.org/docs/Atomics.html is an attempt to make things much
more straightforward than the stuff in LangRef.

This is cool.

At the end of the "optimization outside atomic" section there are
discussions about "returning undef". Is it the following correct?
* a store/store data race in LLVM leads to undefined behaviors,

What exactly is a store-store "race"? That sounds wrong.

If a program has two stores that are not ordered by happens-before
relations, does the program have undefined behaviors?

At the end of the "optimization outside atomic" section

...
However, LLVM is not allowed to transform the former to the latter: it
could introduce undefined behavior if another thread can access x at
the same time.
...

Is the ``undefined behavior'' introduced by the addition write "x =
xtemp"? For example, if a[i] is always false in the original program,
there is no the write.
But I don't think "int xtemp = x;" can introduce undefined behavior,
if LLVM does not take load/store race as undefined. Am I correct?

In the definition of 'monotonic' ordering,
... "If an address is written monotonically by one thread, and other
threads monotonically read that address repeatedly, the other threads
must eventually see the write"...

It's supposed to mean that if you have a something like looks like a
spinloop with monotonic reads, it shouldn't spin forever if the value
changes. I'll take another look at rewording that.

Does this mean if a thread does multi-writes monotonically, monotonic
reads from other threads should see all of them? But intuitively, it
seems to be fine for a read to ``miss'' some of the writes as long as
the writes seen are monotonic in the sense that later reads should see
the same write of earlier reads, or any write monotonically after the
writes seen.

In the case there is only one monotonic write, what does 'eventually'
mean? Can we know a write must be seen when some condition holds, for
example, a number of instructions executed, the thread that did the
write executes a fence, ...?

C++ memory model does not have ``unordered'', and "monotonic", but
have "modification ordering" (is it same to the relaxed atomic
variables the LLVM IR mentions?). If I am compiling C++ to LLVM, can
all modification atomic be compiled to monotonic? And when should we
use "unordered"?

http://llvm.org/docs/Atomics.html is an attempt to make things much
more straightforward than the stuff in LangRef.

This is cool.

At the end of the "optimization outside atomic" section there are
discussions about "returning undef". Is it the following correct?
* a store/store data race in LLVM leads to undefined behaviors,

What exactly is a store-store "race"? That sounds wrong.

If a program has two stores that are not ordered by happens-before
relations, does the program have undefined behaviors?

In the LLVM memory model, you cannot get undefined behavior by any
combination of non-atomic loads and stores to valid memory locations;
the undefined behavior only comes into play when you try to use an
'undef'. (although a load which subsequently examines the result would
be 'undef').

At the end of the "optimization outside atomic" section

...
However, LLVM is not allowed to transform the former to the latter: it
could introduce undefined behavior if another thread can access x at
the same time.
...

Is the ``undefined behavior'' introduced by the addition write "x =
xtemp"? For example, if a[i] is always false in the original program,
there is no the write.

Yes (indirectly).

But I don't think "int xtemp = x;" can introduce undefined behavior,
if LLVM does not take load/store race as undefined. Am I correct?

Yes, the speculative load is allowed.

-Eli