__sync_synchronize doesn't generate a memory barrier

Why does __sync_synchronize() not generate a mfence instruction on x86
and x86_64? I recognize that Apple gcc does not do this either, but I
believe this is a bug in Apple gcc as well. More recent versions of
gcc implement a correct behavior (mfence on x86_64 and lock orl $0,
(%esp) on x86), but clang emits no code for this operation.

LLVM supports an instruction that emits the correct memory barrier:
  call void @llvm.memory.barrier(i1 true, i1 true, i1 true, i1 true, i1 true)
but Clang uses the following, which seems to have no effect on x86:
  call void @llvm.memory.barrier(i1 true, i1 true, i1 true, i1 true, i1 false)

This matters for multi-threaded code as memory barriers are the only
way we can force an ordering on loads and stores.


Adding Owen :slight_smile:


I think Jim worked on this.


If you have a standalone __sync_synchronize that's failing to generate an mfence, that is almost certainly a bug. That said, there are a lot of circumstances where mfences aren't actually necessary. X86 implements a very strong memory model ([1] and [2]]), guaranteeing the following:

  • Loads are not reordered with other loads.
  • Stores are not reordered with other stores.
  • Stores are not reordered with older loads.
  • In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).
  • In a multiprocessor system, stores to the same location have a total order.
  • In a multiprocessor system, locked instructions have a total order.
  • Loads and stores are not reordered with locked instructions.

Based on the last one, it is legal to eliminate mfence's immediately preceding and immediately following locked instructions, typically in the context of mfence-atomic_op-mfence. The compiler does this automatically, and could be causing your missing mfence.

The only context where you really want to generate an mfence is where you need to prevent two loads (not from the same address) from being commuted. I'm sure there's some scenario where doing so breaks sequential consistency, but I can't come up with one off the top of my head.


[1] http://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdf
[2] http://support.amd.com/us/Processor_TechDocs/24593.pdf

I've been most active on it for ARM, but I'm somewhat familiar with how x86 handles this stuff, yes. I strongly suspect this behaviour is entirely for compatibility purposes with GCC.

Other than that, I'm not aware of any reason we shouldn't generate actual instructions for the intrinsics.