Meaning of loads/stores marked both atomic and volatile

Hi llvm-dev,

I read about volatile and atomic modifiers in the docs[1], and I feel
they make sense to me individually.
However, I noticed that store[2] and load[3] instructions can be
marked as both volatile and atomic.

What's the use case for using both volatile and atomic on an
instruction? Isn't it the case that atomic implies volatile? I guess
it isn't, but I don't understand why.

I'm guessing that while both volatile and atomic restrict reorderings,
volatile prevents any kind of load or store elimination optimizations
but atomic doesn't have such guarantee.
E.g. I suspect that an atomic load, which can be implemented as a pair
of a plain load and a fence instruction, can be optimized away to only
a fence instruction. If it was both volatile and atomic, then such
optimization would've been illegal.
In other words, probably very imprecisely, volatile tells the compiler
what it cannot do while atomic tells the cpu what it should do to
guarantee certain memory model (and it'd also imply extra constraints
on what a compiler can do).

My other guess is that it's only to order 'atomic' instruction with
'volatile' instruction, thus the former becomes 'atomic volatile'.

I'd appreciate links to any resources on the topic.

[1] https://llvm.org/docs/LangRef.html#volatile-memory-accesses
[2] https://llvm.org/docs/LangRef.html#store-instruction
[3] https://llvm.org/docs/LangRef.html#i-load

Hi Paweł,

What's the use case for using both volatile and atomic on an
instruction? Isn't it the case that atomic implies volatile?

You pretty much got the semantics right straight after this. The
compiler isn't allowed to add, remove or reorder volatile accesses but
it is for some atomics if no other thread could prove it had.

There are only a couple of valid uses for volatile these days (since
everyone realised that using it for inter-thread synchronization was a
bad idea); the main one is talking to memory-mapped hardware in an OS
kernel or something. I could see someone using an atomic volatile
there for something like talking to a DMA engine: write your buffer
with normal instructions, then do a store-volatile-release to tell the
DMA to start copying. I've not checked if that actually works for any
architectures I know about though.

Cheers.

Tim.

I would say there are transforms that can be done on atomics that can’t be done on volatile memory ops. For example, llvm should be able to mem2reg unescaped atomics because it knows they cannot be modified by other threads, but volatile operations will pin things in memory for use cases that are mostly outside the abstract model.

Hi Tim,

There are only a couple of valid uses for volatile these days

Do you mean volatile used alone or also the combination 'atomic volatile'?

It think that 'atomic volatile' is very useful. Consider following
pseudo-code examples, where
all loads and stores are atomic (with some memory ordering
constraints) but not volatile.

Example 1.

// shared variable
int i = 0;

// call this method from 10 threads
void foo(){
    int i = rand() % 2;
    int j = i;
    while(i == j){
        printf("In the loop\n");
    }
}

I claim that the loop can be optimized to an infinite loop by a
compiler, because apparently j == i at all times in a single threaded
program.
If loads and stores (particularly the read in loop predicated) were
also marked as volatile, it wouldn't have been possible.
Is this correct?

Example 2.

// shared variable
int i = 0;

void signalHandler(){
    i = 1;
}

void main(){
    while(i == 0){
        printf("In the loop\n");
    }
}

Here I also claim that the loop can be optimized into an infinite loop
if volatile is not used.
Is this correct?

Hi Reid,

I mean a local variable that is _Atomic qualified whose address does not
escape the function that allocates it. An unused _Atomic int, for example,
can be removed. If it were volatile, the storage and any loads and stores
would have to be preserved.

In other words, atomics come with a threading model, semantics, and rules
that permit certain transformations. Volatile still acts as an escape hatch
to throw that out the window.

> There are only a couple of valid uses for volatile these days
Do you mean volatile used alone or also the combination 'atomic volatile'?

Volatile alone.

Example 1.

// shared variable
int i = 0;

// call this method from 10 threads
void foo(){
    int i = rand() % 2;
    int j = i;
    while(i == j){
        printf("In the loop\n");
    }
}

I claim that the loop can be optimized to an infinite loop by a
compiler, because apparently j == i at all times in a single threaded
program.

The global variable i is shadowed by the local there and I can't be
sure exactly what you intended so I won't comment on it directly.

But in general terms atomic LLVM operations with at least "monotonic"
ordering forbid unrestricted store-forwarding within a thread (which I
think would be the first step in eliminating the loop). See
https://llvm.org/docs/LangRef.html#atomic-memory-ordering-constraints
where it's explicitly called out: "If an address is written
monotonic-ally by one thread, and other threads monotonic-ally read
that address repeatedly, the other threads must eventually see the
write."

Example 2.

// shared variable
int i = 0;

void signalHandler(){
    i = 1;
}

void main(){
    while(i == 0){
        printf("In the loop\n");
    }
}

Here I also claim that the loop can be optimized into an infinite loop
if volatile is not used.

This is an interesting one. Monotonic atomic is again sufficient to
synchronize with another thread (or signal handler I'd argue). But if
this is a signal handler within a thread then that is actually one of
the other valid uses of volatile in C (nearly, it has to be a
sig_atomic_t too). In LLVM IR I think you'd use an atomic with
synchscope("singlethread") for that instead.

Cheers.

Tim.

The global variable i is shadowed by the local there and I can't be
sure exactly what you intended so I won't comment on it directly.

My mistake. I intended it to be a write to the shared variable i. Let me fix it.

Example 1.

// shared variable
int i = 0;

// call this method from 10 threads
void foo(){
    i = rand() % 2; // was: int i = rand() % 2;
    int j = i;
    while(i == j){
        printf("In the loop\n");
    }
}

But in general terms atomic LLVM operations with at least "monotonic"
ordering forbid unrestricted store-forwarding within a thread

"If an address is written
monotonic-ally by one thread, and other threads monotonic-ally read
that address repeatedly, the other threads must eventually see the
write."

Ok, let's say in Example 1. monotonic atomic prevents loop
optimization because compiler assumes existence of other threads.
(Compiler (effectively) assumes existence of other threads the moment
one starts using at least monotonic atomic load/stores?)

Example 2.

// shared variable
int i = 0;

void signalHandler(){
    i = 1;
}

void main(){
    while(i == 0){
        printf("In the loop\n");
    }
}

This is an interesting one. Monotonic atomic is again sufficient to
synchronize with another thread (or signal handler I'd argue).

In Example. 2 let's consider signal handler in a single thread situation.
If monotonic atomic prevents loop optimization as in Example 1., then
I say it does the same Example 2.
It's because compiler cannot know that 'signalHandler' function is a
signal handler function, so it must assume it might be executed in
another thread.

Hi,

Ok, let's say in Example 1. monotonic atomic prevents loop
optimization because compiler assumes existence of other threads.
(Compiler (effectively) assumes existence of other threads the moment
one starts using at least monotonic atomic load/stores?)

I think that's pretty accurate, though the assumptions are of course
limited to those monotonic (or stronger) stores.

In Example. 2 let's consider signal handler in a single thread situation.
If monotonic atomic prevents loop optimization as in Example 1., then
I say it does the same Example 2.

I agree, but in certain theoretical situations a general monotonic
operation might be stronger than what's actually needed there. You
could imagine a GPU or something needing a real barrier instruction to
guarantee even a monotonic store becomes visible to other cores, but
not necessarily for one in synchscope("singlethread").

Cheers.

Tim.