ASM output with JIT / codegen barriers

In working on an LLVM backend for SBCL (a lisp compiler), there are certain sequences of code that must be atomic with regards to async signals. So, for example, on x86, a single SUB on a memory location should be used, not a load/sub/store sequence. LLVM's IR doesn't currently have any way to express this kind of constraint (...and really, that's essentially impossible since different architectures have different possibilities, so I'm not asking for this...).

All I really would like is to be able to specify the exact instruction sequence to emit there. I'd hoped that inline asm would be the way to do so, but LLVM doesn't appear to support asm output when using the JIT compiler. Is there any hope for inline asm being supported with the JIT anytime soon? Or is there an alternative suggested way of doing this? I'm using llvm.atomic.load.sub.i64.p0i64 for the moment, but that's both more expensive than I need as it has an unnecessary LOCK prefix, and is also theoretically incorrect. While it generates correct code currently on x86-64, LLVM doesn't actually *guarantee* that it generates a single instruction, that's just "luck".

Additionally, I think there will be some situations where a particular ordering of memory operations is required. LLVM makes no guarantees about the order of stores, unless there's some way that you could tell the difference in a linear program. Unfortunately, I don't have a linear program, I have a program which can run signal handlers between arbitrary instructions. So, I think I'll need something like an llvm.memory.barrier of type "ss", except only affecting the codegen, not actually inserting a processor memory barrier.

Is there already some way to insert a codegen-barrier with no additional runtime cost (beyond the opportunity-cost of not being able to reorder/delete stores across the barrier)? If not, can such a thing be added? On x86, this is a non-issue, since the processor already implicitly has inter-processor store-store barriers, so using:
   call void @llvm.memory.barrier(i1 0, i1 0, i1 0, i1 1, i1 0)
is fine: it's a noop at runtime but ensures the correct sequence of stores...but I'm thinking ahead here to other architectures where that would actually require expensive instructions to be emitted.

Thanks,
James

In working on an LLVM backend for SBCL (a lisp compiler), there are
certain sequences of code that must be atomic with regards to async
signals. So, for example, on x86, a single SUB on a memory location
should be used, not a load/sub/store sequence. LLVM's IR doesn't
currently have any way to express this kind of constraint (...and
really, that's essentially impossible since different architectures
have different possibilities, so I'm not asking for this...).

Why do you want to do this? As far as I'm aware, there's no guarantee that a memory-memory SUB will be observed atomically across all processors. Remember that most processors are going to be breaking X86 instructions up into micro-ops, which might get reordered/interleaved in any number of different ways.

All I really would like is to be able to specify the exact instruction
sequence to emit there. I'd hoped that inline asm would be the way to
do so, but LLVM doesn't appear to support asm output when using the
JIT compiler. Is there any hope for inline asm being supported with
the JIT anytime soon? Or is there an alternative suggested way of
doing this? I'm using llvm.atomic.load.sub.i64.p0i64 for the moment,
but that's both more expensive than I need as it has an unnecessary
LOCK prefix, and is also theoretically incorrect. While it generates
correct code currently on x86-64, LLVM doesn't actually *guarantee*
that it generates a single instruction, that's just "luck".

It's not luck. That's exactly what the atomic intrinsics guarantee: that no other processor can observe an intermediate state of the operation. What they don't guarantee per the LangRef is sequential consistency. If you care about that, you need to use explicit fencing.

--Owen

In working on an LLVM backend for SBCL (a lisp compiler), there are
certain sequences of code that must be atomic with regards to async
signals. So, for example, on x86, a single SUB on a memory location
should be used, not a load/sub/store sequence. LLVM's IR doesn't
currently have any way to express this kind of constraint (...and
really, that's essentially impossible since different architectures
have different possibilities, so I'm not asking for this...).

Why do you want to do this? As far as I'm aware, there's no guarantee that a memory-memory SUB will be observed atomically across all processors. Remember that most processors are going to be breaking X86 instructions up into micro-ops, which might get reordered/interleaved in any number of different ways.

I'm assuming 'memory-memory' there is a typo, and we're just talking
about, a 'sub' instruction with a memory destination. In that case,
I'll go further: the Intel IA-32 manual explicitly tells you that x86
processors are allowed to do the read and write halves of that single
instruction interleaved with other writes to that memory location from
other processors (See section 8.2.3.1 of [1]). =[ I can tell you from
bitter experience debugging code that assumed this, it does in fact
happen. I have watched reference counters miss both increments and
decrements from it on both Intel and AMD systems.

All I really would like is to be able to specify the exact instruction
sequence to emit there. I'd hoped that inline asm would be the way to
do so, but LLVM doesn't appear to support asm output when using the
JIT compiler. Is there any hope for inline asm being supported with
the JIT anytime soon? Or is there an alternative suggested way of
doing this? I'm using llvm.atomic.load.sub.i64.p0i64 for the moment,
but that's both more expensive than I need as it has an unnecessary
LOCK prefix, and is also theoretically incorrect.

As I've mentioned above, I assure you the LOCK prefix matters. The
strange thing is that you think this is inefficient. Modern processors
don't lock the bus given this prefix to a 'sub' instruction; they just
lock the cache and use the coherency model to resolve the issue. This
is much cheaper than, say, an 'xchg' instruction on an x86 processor.
What is the performance problem you are actually trying to solve here?

What they don't guarantee per the LangRef is sequential consistency. If you care about that, you need to use explicit fencing.

Side note: I regret greatly that I didn't know enough of the
sequential consistency concerns here to address them more fully when I
was working on this. =/ Even explicit fencing has subtle problems with
it as currently specified. Is this causing problems for people (other
than jyasskin who clued me in on the whole matter)?

Talking about memory consistency is always painful. In particular, there's a disconnect between how consistency models think about reorderings, versus how the compiler and hardware actually perform them.

There's a natural tension between sanity (make all atomic ops sequentially consistent) and performance (no consistency by default, frontend must supply it via fences). So far we've been pursuing the latter approach: C-level atomic intrinsics are emitted as fence-atomicop-fence. The X86 backend then has some knowledge (thanks to X86's comparatively strong memory model) of instances where fences can be folded away.

--Owen

Responding to the original email...

In working on an LLVM backend for SBCL (a lisp compiler), there are
certain sequences of code that must be atomic with regards to async
signals.

Can you define exactly what 'atomic with regards to async signals'
this entails? Your descriptions led me to think you may mean something
other than the POSIX definition, but maybe I'm just misinterpreting
it. Are these signals guaranteed to run in the same thread? On the
same processor? Is there concurrent code running in the address space
when they run?

<snip, this seems to be well handled on sibling email...>

Additionally, I think there will be some situations where a particular
ordering of memory operations is required. LLVM makes no guarantees
about the order of stores, unless there's some way that you could tell
the difference in a linear program. Unfortunately, I don't have a
linear program, I have a program which can run signal handlers between
arbitrary instructions. So, I think I'll need something like an
llvm.memory.barrier of type "ss", except only affecting the codegen,
not actually inserting a processor memory barrier.

The processor can reorder memory operations as well (within limits).
Consider that 'memset' to zero is often codegened to a non-temporal
store to memory. This exempts it from all ordering considerations
except for an explicit memory fence in the processor. If code were to
execute between those two instructions, the contents of the memory
could read "andthenumberofcountingshallbethree", or 'feedbeef', or
'0000...' or '11111...' there's just no telling.

Is there already some way to insert a codegen-barrier with no
additional runtime cost (beyond the opportunity-cost of not being able
to reorder/delete stores across the barrier)? If not, can such a thing
be added? On x86, this is a non-issue, since the processor already
implicitly has inter-processor store-store barriers, so using:
call void @llvm.memory.barrier(i1 0, i1 0, i1 0, i1 1, i1 0)
is fine: it's a noop at runtime but ensures the correct sequence of
stores...but I'm thinking ahead here to other architectures where that
would actually require expensive instructions to be emitted.

But... if it *did* require expensive instructions, wouldn't you want
them?!?! The reason we don't emit on x86 is because of its memory
ordering guarantees. If it didn't have them, we would emit
instructions to impose one because otherwise the wrong thing might
happen. I think you should trust LLVM to only emit expensive
instructions to achieve the ordering semantics you specify when they
are necessary for the architecture, and file bugs if it ever fails.

The only useful thing I can think of is if you happen to know that you
execute on some "uniprocessor" with at most one thread of execution;
and thus gain memory ordering constraints beyond those which can be
assumed across an entire architecture (this is certainly true for
x86). If it is useful to leverage this to optimize codegen, it should
be at the target level, with some target options to specify that
consistency assumptions should be greater than normal. The intrinsics
and semantics should remain the same regardless.

Responding to the original email...

In working on an LLVM backend for SBCL (a lisp compiler), there are
certain sequences of code that must be atomic with regards to async
signals.

Can you define exactly what 'atomic with regards to async signals'
this entails? Your descriptions led me to think you may mean something
other than the POSIX definition, but maybe I'm just misinterpreting
it. Are these signals guaranteed to run in the same thread? On the
same processor? Is there concurrent code running in the address space
when they run?

Hi, thanks everyone for all the comments. I think maybe I wasn't clear that I *only* care about atomicity w.r.t. a signal handler interruption in the same thread, *not* across threads. Therefore, many of the problems of cross-CPU atomicity are not relevant. The signal handler gets invoked via pthread_kill, and is thus necessarily running in the same thread as the code being interrupted. The memory in question can be considered thread-local here, so I'm not worried about other threads touching it at all.

I also realize I had (at least :slight_smile: one error in my original email: of course, the atomic operations llvm provides *ARE* guaranteed to do the right thing w.r.t. atomicity against signal handlers...they in fact just do more than I need, not less. I'm not sure why I thought they were both more and less than I needed before, and sorry if it confused you about what I'm trying to accomplish.

Here's a concrete example, in hopes it will clarify matters:

@pseudo_atomic = thread_local global i64 0
declare i64* @alloc(i64)
declare void @do_pending_interrupt()
declare i64 @llvm.atomic.load.sub.i64.p0i64(i64* nocapture, i64) nounwind
declare void @llvm.memory.barrier(i1, i1, i1, i1, i1)

define i64* @foo() {
   ;; Note that we're in an allocation section
   store i64 1, i64* @pseudo_atomic
   ;; Barrier only to ensure instruction ordering, not needed as a true memory barrier
   call void @llvm.memory.barrier(i1 0, i1 0, i1 0, i1 1, i1 0)

   ;; Call might actually be inlined, so cannot depend upon unknown call causing correct codegen effects.
   %obj = call i64* @alloc(i64 32)
   %obj_header = getelementptr i64* %obj, i64 0
   store i64 5, i64* %obj_header ;; store obj type (5) in header word
   %obj_len = getelementptr i64* %obj, i64 1
   store i64 2, i64* %obj_len ;; store obj length (2) in length slot
   ...etc...

   ;; Check if we were interrupted:
   %res = call i64 @llvm.atomic.load.sub.i64.p0i64(i64* @pseudo_atomic, i64 1)
   %was_interrupted = icmp eq i64 %res, 1
   br i1 %was_interrupted, label %do-interruption, label %continue

continue:
   ret i64* %obj

do-interruption:
   call void @do_pending_interrupt()
   br label %continue
}

A signal handler will check the thread-local @pseudo_atomic variable: if it was already set it will just change the value to 2 and return, waiting to be reinvoked by do_pending_interrupt at the end of the pseudo-atomic section. This is because it may get confused by the proto-object being built up in this code.

This sequence that SBCL does today with its internal codegen is basically like:
MOV <pseudo_atomic>, 1
[[do allocation, fill in object, etc]]
XOR <pseudo_atomic>, 1
JEQ continue
<<call do_pending_interrupt>>
continue:
...

The important things here are:
1) Stores cannot be migrated from within the MOV/XOR instructions to outside by the codegen.
2) There's no way an interruption can be missed: the XOR is atomic with regards to signals executing in the same thread, it's either fully executed or not (both load+store). But I don't care whether it's visible on other CPUs or not: it's a thread-local variable in any case.

Those are the two properties I'd like to get from LLVM, without actually ever invoking superfluous processor synchronization.

The processor can reorder memory operations as well (within limits).
Consider that 'memset' to zero is often codegened to a non-temporal
store to memory. This exempts it from all ordering considerations

My understanding is that processor reordering only affects what you might see from another CPU: the processor will undo speculatively executed operations if the sequence of instructions actually executed is not the sequence it predicted, so within a single CPU you should never be able tell the difference.

But I must admit I don't know anything about non-temporal stores. Within a single thread, if I do a non-temporal store, followed by a load, am I not guaranteed to get back the value I stored?

James

Hi, thanks everyone for all the comments. I think maybe I wasn't clear that
I *only* care about atomicity w.r.t. a signal handler interruption in the
same thread, *not* across threads. Therefore, many of the problems of
cross-CPU atomicity are not relevant. The signal handler gets invoked via
pthread_kill, and is thus necessarily running in the same thread as the code
being interrupted. The memory in question can be considered thread-local
here, so I'm not worried about other threads touching it at all.

Ok, this helps make sense, but it still is confusing to phrase this as
"single threaded". While the signal handler code may execute
exclusively to any other code, it does not share the stack frame, etc.
I'd describe this more as two threads of mutually exclusive execution
or some such. I'm not familiar with what synchronization occurs as
part of the interrupt process, but I'd verify it before making too
many assumptions.

This sequence that SBCL does today with its internal codegen is basically
like:
MOV <pseudo_atomic>, 1
[[do allocation, fill in object, etc]]
XOR <pseudo_atomic>, 1
JEQ continue
<<call do_pending_interrupt>>
continue:
...

The important things here are:
1) Stores cannot be migrated from within the MOV/XOR instructions to outside
by the codegen.

Basically, this is merely the problem that x86 places a stricter
requirement on memory ordering than LLVM. Where x86 requires that
stores occur in program order, LLVM reserves the right to change that.
I have no idea if it is worthwhile to support memory barriers solely
within the flow of execution, but it seems highly suspicious. On at
least some non-x86 architectures, I suspect you'll need a memory
barrier here anyways, so it seems reasonable to place one anyways. I
*highly* doubt these fences are an overriding performance concern on
x86, do you have any benchmarks that indicate they are?

2) There's no way an interruption can be missed: the XOR is atomic with
regards to signals executing in the same thread, it's either fully executed
or not (both load+store). But I don't care whether it's visible on other
CPUs or not: it's a thread-local variable in any case.

Those are the two properties I'd like to get from LLVM, without actually
ever invoking superfluous processor synchronization.

Before we start extending LLVM to support expressing the finest points
of the x86 memory model in an optimal fashion given a single thread of
execution, I'd really need to see some compelling benchmarks that it
is a major performance problem. My understanding of the implementation
of these aspects of the x86 architecture is that they shouldn't have a
particularly high overhead.

The processor can reorder memory operations as well (within limits).
Consider that 'memset' to zero is often codegened to a non-temporal
store to memory. This exempts it from all ordering considerations

My understanding is that processor reordering only affects what you might
see from another CPU: the processor will undo speculatively executed
operations if the sequence of instructions actually executed is not the
sequence it predicted, so within a single CPU you should never be able tell
the difference.

But I must admit I don't know anything about non-temporal stores. Within a
single thread, if I do a non-temporal store, followed by a load, am I not
guaranteed to get back the value I stored?

If you read the *same address*, then the ordering is guaranteed, but
the Intel documentation specifically exempts these instructions from
the general rule that writes will not be reordered with other writes.
This means that a non-temporal store might be reordered to occur after
the "xor" to your atomic integer, even if the instruction came prior
to the xor.

Hi, thanks everyone for all the comments. I think maybe I wasn't clear that
I *only* care about atomicity w.r.t. a signal handler interruption in the
same thread, *not* across threads. Therefore, many of the problems of
cross-CPU atomicity are not relevant. The signal handler gets invoked via
pthread_kill, and is thus necessarily running in the same thread as the code
being interrupted. The memory in question can be considered thread-local
here, so I'm not worried about other threads touching it at all.

Ok, this helps make sense, but it still is confusing to phrase this as
"single threaded". While the signal handler code may execute
exclusively to any other code, it does not share the stack frame, etc.
I'd describe this more as two threads of mutually exclusive execution
or some such.

I'm pretty sure James's way of describing it is accurate. It's a
single thread with an asynchronous signal, and C allows things in that
situation that it disallows for the multi-threaded case. In
particular, global objects of type "volatile sig_atomic_t" can be read
and written between signal handlers in a thread and that thread's main
control flow without locking. C++0x also defines an
atomic_signal_fence(memory_order) that only synchronizes with signal
handlers, in addition to the atomic_thread_fence(memory_order) that
synchronizes to other threads. See [atomics.fences]

I'm not familiar with what synchronization occurs as
part of the interrupt process, but I'd verify it before making too
many assumptions.

This sequence that SBCL does today with its internal codegen is basically
like:
MOV <pseudo_atomic>, 1
[[do allocation, fill in object, etc]]
XOR <pseudo_atomic>, 1
JEQ continue
<<call do_pending_interrupt>>
continue:
...

The important things here are:
1) Stores cannot be migrated from within the MOV/XOR instructions to outside
by the codegen.

Basically, this is merely the problem that x86 places a stricter
requirement on memory ordering than LLVM. Where x86 requires that
stores occur in program order, LLVM reserves the right to change that.
I have no idea if it is worthwhile to support memory barriers solely
within the flow of execution, but it seems highly suspicious.

It's needed to support std::atomic_signal_fence. gcc will initially
implement that with
  asm volatile("":::"memory")
but as James points out, that kills the JIT, and probably will keep
doing so until llvm-mc is finished or someone implements a special
case for it.

On at
least some non-x86 architectures, I suspect you'll need a memory
barrier here anyways, so it seems reasonable to place one anyways. I
*highly* doubt these fences are an overriding performance concern on
x86, do you have any benchmarks that indicate they are?

Memory fences are as expensive as atomic operations on x86 (quite
expensive), but you're right that benchmarks are a good idea anyway.

2) There's no way an interruption can be missed: the XOR is atomic with
regards to signals executing in the same thread, it's either fully executed
or not (both load+store). But I don't care whether it's visible on other
CPUs or not: it's a thread-local variable in any case.

Those are the two properties I'd like to get from LLVM, without actually
ever invoking superfluous processor synchronization.

Before we start extending LLVM to support expressing the finest points
of the x86 memory model in an optimal fashion given a single thread of
execution, I'd really need to see some compelling benchmarks that it
is a major performance problem. My understanding of the implementation
of these aspects of the x86 architecture is that they shouldn't have a
particularly high overhead.

The processor can reorder memory operations as well (within limits).
Consider that 'memset' to zero is often codegened to a non-temporal
store to memory. This exempts it from all ordering considerations

My understanding is that processor reordering only affects what you might
see from another CPU: the processor will undo speculatively executed
operations if the sequence of instructions actually executed is not the
sequence it predicted, so within a single CPU you should never be able tell
the difference.

But I must admit I don't know anything about non-temporal stores. Within a
single thread, if I do a non-temporal store, followed by a load, am I not
guaranteed to get back the value I stored?

If you read the *same address*, then the ordering is guaranteed, but
the Intel documentation specifically exempts these instructions from
the general rule that writes will not be reordered with other writes.
This means that a non-temporal store might be reordered to occur after
the "xor" to your atomic integer, even if the instruction came prior
to the xor.

It exempts these instructions from the cross-processor guarantees, but
I don't see anything saying that, for example, a temporal store in a
single processor's instruction stream after a non-temporal store may
be overwritten by the non-temporal store. Do you see something I'm
missing? If not, for single-thread signals, I think it's only compiler
reordering James has to worry about.

Hi, thanks everyone for all the comments. I think maybe I wasn't clear that
I *only* care about atomicity w.r.t. a signal handler interruption in the
same thread, *not* across threads. Therefore, many of the problems of
cross-CPU atomicity are not relevant. The signal handler gets invoked via
pthread_kill, and is thus necessarily running in the same thread as the code
being interrupted. The memory in question can be considered thread-local
here, so I'm not worried about other threads touching it at all.

Ok, this helps make sense, but it still is confusing to phrase this as
"single threaded". While the signal handler code may execute
exclusively to any other code, it does not share the stack frame, etc.
I'd describe this more as two threads of mutually exclusive execution
or some such.

I'm pretty sure James's way of describing it is accurate. It's a
single thread with an asynchronous signal, and C allows things in that
situation that it disallows for the multi-threaded case. In
particular, global objects of type "volatile sig_atomic_t" can be read
and written between signal handlers in a thread and that thread's main
control flow without locking. C++0x also defines an
atomic_signal_fence(memory_order) that only synchronizes with signal
handlers, in addition to the atomic_thread_fence(memory_order) that
synchronizes to other threads. See [atomics.fences]

Very interesting, and thanks for the clarifications. I'm not
particularly familiar with either those parts of C or C++0x, although
it's on the list... =D

I'm not familiar with what synchronization occurs as
part of the interrupt process, but I'd verify it before making too
many assumptions.

This sequence that SBCL does today with its internal codegen is basically
like:
MOV <pseudo_atomic>, 1
[[do allocation, fill in object, etc]]
XOR <pseudo_atomic>, 1
JEQ continue
<<call do_pending_interrupt>>
continue:
...

The important things here are:
1) Stores cannot be migrated from within the MOV/XOR instructions to outside
by the codegen.

Basically, this is merely the problem that x86 places a stricter
requirement on memory ordering than LLVM. Where x86 requires that
stores occur in program order, LLVM reserves the right to change that.
I have no idea if it is worthwhile to support memory barriers solely
within the flow of execution, but it seems highly suspicious.

It's needed to support std::atomic_signal_fence. gcc will initially
implement that with
asm volatile("":::"memory")
but as James points out, that kills the JIT, and probably will keep
doing so until llvm-mc is finished or someone implements a special
case for it.

Want to propose an extension to the current atomics of LLVM? Could we
potentially clarify your previous concern regarding the pairing of
barriers to operations, as it seems like they would involve related
bits of the lang ref? Happy to work with you on that sometime this Q
if you're interested; I'll certainly have more time. =]

On at
least some non-x86 architectures, I suspect you'll need a memory
barrier here anyways, so it seems reasonable to place one anyways. I
*highly* doubt these fences are an overriding performance concern on
x86, do you have any benchmarks that indicate they are?

Memory fences are as expensive as atomic operations on x86 (quite
expensive), but you're right that benchmarks are a good idea anyway.

2) There's no way an interruption can be missed: the XOR is atomic with
regards to signals executing in the same thread, it's either fully executed
or not (both load+store). But I don't care whether it's visible on other
CPUs or not: it's a thread-local variable in any case.

Those are the two properties I'd like to get from LLVM, without actually
ever invoking superfluous processor synchronization.

Before we start extending LLVM to support expressing the finest points
of the x86 memory model in an optimal fashion given a single thread of
execution, I'd really need to see some compelling benchmarks that it
is a major performance problem. My understanding of the implementation
of these aspects of the x86 architecture is that they shouldn't have a
particularly high overhead.

The processor can reorder memory operations as well (within limits).
Consider that 'memset' to zero is often codegened to a non-temporal
store to memory. This exempts it from all ordering considerations

My understanding is that processor reordering only affects what you might
see from another CPU: the processor will undo speculatively executed
operations if the sequence of instructions actually executed is not the
sequence it predicted, so within a single CPU you should never be able tell
the difference.

But I must admit I don't know anything about non-temporal stores. Within a
single thread, if I do a non-temporal store, followed by a load, am I not
guaranteed to get back the value I stored?

If you read the *same address*, then the ordering is guaranteed, but
the Intel documentation specifically exempts these instructions from
the general rule that writes will not be reordered with other writes.
This means that a non-temporal store might be reordered to occur after
the "xor" to your atomic integer, even if the instruction came prior
to the xor.

It exempts these instructions from the cross-processor guarantees, but
I don't see anything saying that, for example, a temporal store in a
single processor's instruction stream after a non-temporal store may
be overwritten by the non-temporal store. Do you see something I'm
missing? If not, for single-thread signals, I think it's only compiler
reordering James has to worry about.

The exemption I'm referring to (Section 8.2.2 of System Programming
Guide from Intel) is to the write-write ordering of the
*single-processor* model. Reading the referenced section on the
non-temporal behavior for these instructions (10.4.6 of volume 1 of
the architecture manual) doesn't entirely clarify the matter for me
either. It specifically says that the non-temporal writes may occur
outside of program order, but doesn't seem clarify exactly what the
result is of overlapping temporal writes are without fences within the
same program thread. The only examples I'm finding are for
multiprocessor scenarios. =/

The important things here are:
1) Stores cannot be migrated from within the MOV/XOR instructions to outside
by the codegen.

Basically, this is merely the problem that x86 places a stricter
requirement on memory ordering than LLVM. Where x86 requires that
stores occur in program order, LLVM reserves the right to change that.
I have no idea if it is worthwhile to support memory barriers solely
within the flow of execution, but it seems highly suspicious.

It's needed to support std::atomic_signal_fence. gcc will initially
implement that with
asm volatile("":::"memory")
but as James points out, that kills the JIT, and probably will keep
doing so until llvm-mc is finished or someone implements a special
case for it.

Want to propose an extension to the current atomics of LLVM? Could we
potentially clarify your previous concern regarding the pairing of
barriers to operations, as it seems like they would involve related
bits of the lang ref? Happy to work with you on that sometime this Q
if you're interested; I'll certainly have more time. =]

I have some ideas for that, and will be happy to help.

The processor can reorder memory operations as well (within limits).
Consider that 'memset' to zero is often codegened to a non-temporal
store to memory. This exempts it from all ordering considerations

My understanding is that processor reordering only affects what you might
see from another CPU: the processor will undo speculatively executed
operations if the sequence of instructions actually executed is not the
sequence it predicted, so within a single CPU you should never be able tell
the difference.

But I must admit I don't know anything about non-temporal stores. Within a
single thread, if I do a non-temporal store, followed by a load, am I not
guaranteed to get back the value I stored?

If you read the *same address*, then the ordering is guaranteed, but
the Intel documentation specifically exempts these instructions from
the general rule that writes will not be reordered with other writes.
This means that a non-temporal store might be reordered to occur after
the "xor" to your atomic integer, even if the instruction came prior
to the xor.

It exempts these instructions from the cross-processor guarantees, but
I don't see anything saying that, for example, a temporal store in a
single processor's instruction stream after a non-temporal store may
be overwritten by the non-temporal store. Do you see something I'm
missing? If not, for single-thread signals, I think it's only compiler
reordering James has to worry about.

The exemption I'm referring to (Section 8.2.2 of System Programming
Guide from Intel) is to the write-write ordering of the
*single-processor* model. Reading the referenced section on the
non-temporal behavior for these instructions (10.4.6 of volume 1 of
the architecture manual) doesn't entirely clarify the matter for me
either. It specifically says that the non-temporal writes may occur
outside of program order, but doesn't seem clarify exactly what the
result is of overlapping temporal writes are without fences within the
same program thread. The only examples I'm finding are for
multiprocessor scenarios. =/

Yeah, it's not 100% clear. I'm pretty sure that x86 maintains the
fiction of a linear "instruction stream" within each processor, even
in the presence of interrupts (which underly pthread_kill and OS-level
thread switching). For example, in 6.6, we have "The ability of a P6
family processor to speculatively execute instructions does not affect
the taking of interrupts by the processor. Interrupts are taken at
instruction boundaries located during the retirement phase of
instruction execution; so they are always taken in the “in-order”
instruction stream."

But I'm not an expert in non-temporal anything.

Hm...off topic from my original email since I think this is only relevant for multithreaded code...

But from what I can tell, an implementation of memset that does not contain an sfence after using movnti is considered broken. Callers of memset would not (and should not need to) know that they must use an actual memory barrier (sfence) after the memset call to get the usual x86 store-store guarantee.

Thread describing that bug in glibc memset implementation:
http://sourceware.org/ml/libc-alpha/2007-11/msg00017.html

Redhat errata including that fix in a stable update:
http://rhn.redhat.com/errata/RHBA-2008-0083.html

Then there's a recent discussion on the topic of who is responsible for calling sfence on the gcc mailing list:
http://www.mail-archive.com/gcc@gcc.gnu.org/msg45939.html

Unfortunately, that thread didn't seem to have any firm conclusion, but ISTM that the current default assumption is (b): anything that uses movnti is assumed to surround such uses with memory fences so that other code doesn't need to.

James

I didn't mean to imply that the fence was missing after the
non-temporal store (yikes!!), rather that it was an example of a not
uncommon situation where fencing (may be) required even in
single-threaded x86 code. That said, Jeffrey raised good points that
it isn't entirely clear at all to what extent non-temporal stores
deviate from the ordering constraints of typical x86 code. From the
threads you cite, there is also dispute about the best way to manage
those deviations from the ordering constraints. At least w.r.t.
memset, I would agree with you and assume that it is providing the
fencing needed.