Intended semantics for ``fence seq_cst``

Hi,

TL;DR: should we add a new memory ordering to fences?

``fence seq_cst`` is currently use to represent two things:
- GCC-style builtin ``__sync_synchronize()`` [0][1].
- C11/C++11's sequentially-consistent thread fence
``std::atomic_thread_fence(std::memory_order_seq_cst)`` [2].

As far as I understand:
- The former orders all memory and emits an actual fence instruction.
- The latter only provides a total order with other
sequentially-consistent loads and stores, which means that it's
possible to move non-sequentially-consistent loads and stores around
it.

The GCC-style builtin effectively does the same as the C11/C++11
sequentially-consistent thread fence, surrounded by compiler barriers
(``call void asm sideeffect "", "~{memory}"``).

The LLVM language reference [3] describes ``fence seq_cst`` in terms
of the C11/C++11 primitive, but it looks like LLVM's codebase treats
it like the GCC-style builtin. That's strictly correct, but it seems
desirable to represent the GCC-style builtin with a ninth
LLVM-internal memory ordering that's stricter than
``llvm::SequentiallyConsistent``. ``fence seq_cst`` could then fully
utilize C11/C++11's semantics, without breaking the GCC-style builtin.

From C11/C++11's point of view this other memory ordering isn't useful

because the primitives offered are sufficient to express valid and
performant code, but I believe that LLVM needs this new memory
ordering to accurately represent the GCC-style builtin while fully
taking advantage of the C11/C++11 memory model.

Am I correct?

I don't think it's worth implementing just yet since C11/C++11 are
still relatively new, but I'd like to be a bit forward looking for
PNaCl's sake.

Thanks,

JF

[0] http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/_005f_005fsync-Builtins.html
[1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36793
[2] C++11 Standard section 29.8 - Fences
[3] http://llvm.org/docs/LangRef.html#fence-instruction

Hi,

TL;DR: should we add a new memory ordering to fences?

``fence seq_cst`` is currently use to represent two things:
- GCC-style builtin ``__sync_synchronize()`` [0][1].
- C11/C++11's sequentially-consistent thread fence
``std::atomic_thread_fence(std::memory_order_seq_cst)`` [2].

As far as I understand:
- The former orders all memory and emits an actual fence instruction.

- The latter only provides a total order with other
sequentially-consistent loads and stores, which means that it's
possible to move non-sequentially-consistent loads and stores around
it.

It still acts as an acquire/release fence for any other atomic
instruction. For non-atomic instructions, if you have a race, the
behavior is undefined anyway, so you can't get a stronger guarantee
than what "fence seq_cst" provides.

I think "fence seq_cst" is completely equivalent to
__sync_synchronize(), but you could convince me otherwise by providing
a sample program for which there's a difference.

struct {
  volatile int flag;
  int value;
} s;

int get_value_when_ready() {
  while (s.flag) ;
  __sync_synchronize();
  return s.value;
}

This is "valid" legacy code on some processors, yet it's not valid to
replace __sync_synchronize with an atomic_thread_fence because, in
theory, LLVM could hoist the load of s.value. In practice it currently
doesn't, but it may in the future if my understanding is correct.

My main point is that LLVM needs to support code that was written
before C and C++ got a memory model, it doesn't matter that it's
undefined behavior and relies on a GCC-style builtin to be "correct".
The current standards offer all you need to write new code that can
express the above intended behavior, but __sync_synchronize isn't a
1:1 mapping to atomic_thread_fence(seq_cst), it has stronger semantics
and that's constraining which optimizations can be done on ``fence
seq_cst``. LLVM therefore probably wants to distinguish both, so that
it can fully optimize C++11 code without leaving legacy code in a bad
position.

Ok, so the semantics of your fence would be that it's a volatile
memory access (http://llvm.org/docs/LangRef.html#volatile-memory-accesses),
and that it provides happens-before edges for volatile accesses in the
same way that a seq_cst fence provides for atomic accesses.

FWIW, I don't think we should add that, because it's an attempt to
define behavior that's undefined for other reasons (the data race on
the volatile).

If you (PNaCl?) explicitly want to define the behavior of legacy code
that used 'volatile' for synchronization (which has always been
undefined behavior outside of assembly, even before the memory model;
it just happened to work in many cases), could you compile volatile
accesses to 'atomic volatile monotonic' accesses? Then the normal
memory model would apply, and I don't think the instructions emitted
would change at all on the platforms I'm familiar with.

Jeffrey

FWIW, I don't think we should add that, because it's an attempt to
define behavior that's undefined for other reasons (the data race on
the volatile).

I had a discussion with Chandler and others, and something I
misunderstood was pointed out: it is not an explicit goal of LLVM to
support or continue supporting legacy code that did what it had to to
express functional concurrent code. It may happen to work now, but
unless enough LLVM users express interest this may break one day, and
__sync_synchronize may not order anything (it may just emit a fence
without forcing any ordering). It was pointed out that it's not clear
that __sync_synchronize has a clear spec, and that implementing it
properly in LLVM may not be tractable or worthwhile.

If you (PNaCl?) explicitly want to define the behavior of legacy code
that used 'volatile' for synchronization (which has always been
undefined behavior outside of assembly, even before the memory model;
it just happened to work in many cases), could you compile volatile
accesses to 'atomic volatile monotonic' accesses? Then the normal
memory model would apply, and I don't think the instructions emitted
would change at all on the platforms I'm familiar with.

I actually go further for now and promote volatiles to seq_cst
atomics. This promotion happens after opt, but before most
architecture-specific optimizations. I could have used relaxed
ordering, but as a conservative first approach went with seq_cst. For
PNaCl it's correct because we only support 8/16/32/64 bit types,
require natural alignment (though we should provide better
diagnostics), packed volatiles can be split, we don't allow direct
device access, and we don't allow mapping the same physical address at
multiple virtual addresses. We could relax this at a later time, but
we'd really be better off if C++11's better-defined memory model were
used in the code.

This thread (and the side discussion) answered my concern, though with
a solution I didn't expect: trying to make user code that doesn't use
volatile or atomic, but does use __sync_synchronize, work as intended
isn't one of LLVM's goals.

You will need to do this in the frontend. The target independent optimizers
are allowed to use the memory model.

You will need to do this in the frontend. The target independent optimizers are allowed to use the memory model.

We discussed doing this, and concluded that doing it pre-opt was overly restrictive on correct code. Doing it post-opt bakes the behavior into the portable code, so in a way it’ll be reliably broken but won’t penalize good code.

FWIW it’s easy to change from one to the other: move one line of code. I hope my explanation makes sense, and it doesn’t look like I’m dismissing your comment on implementation issues.

It doesn't really make sense to me. The most likely way for the optimizer
to break any of this is in the middle end. By only fixing it afterward, I
don't see what the advantage of fixing it at all is...

As Jeffrey pointed out, the penalty is relatively low on x86.

It doesn't really make sense to me. The most likely way for the optimizer
to break any of this is in the middle end. By only fixing it afterward, I
don't see what the advantage of fixing it at all is...

Actually I think you're right: we also transform atomics to stable
intrinsics, which we then transform back to LLVM IR on the user-side. Using
these intrinsics pre-opt would be detrimental to overall performance, but
doing volatile->atomic pre-opt and doing atomic->intrinsic post-opt should
be OK.

As Jeffrey pointed out, the penalty is relatively low on x86.

Yes, we discussed performance of our approach extensively for x86-32,
x86-64, ARM, MIPS and other potential targets. Our current approach isn't
the best one for performance but it's definitely conservative, potentially
more portable in the face of bugs, while still being fast-ish and allowing
us to loosen up in future releases. It seems like a nice tradeoff for a
first launch.