RFC: non-temporal fencing in LLVM IR

Hello, fencing enthusiasts!

TL;DR: We’d like to propose an addition to the LLVM memory model requiring non-temporal accesses be surrounded by non-temporal load barriers and non-temporal store barriers, and we’d like to add such orderings to the fence IR opcode.

We are open to different approaches, hence this email instead of a patch.

Who’s “we”?

Philip Reames brought this to my attention, and we’ve had numerous discussions with Hans Boehm on the topic. Any mistakes below are my own, all the clever bits are theirs.

Why?

Ignore non-temporals for a moment, on most x86 targets LLVM generates an mfence for seq_cst atomic fencing. One could instead use a locked idempotent atomic accesses to top-of-stack such as lock or4i [RSP-8] 0. Philip has measured this as equivalent on micro-benchmarks, but as ~25% faster in macro-benchmarks (other codebases confirm this). There’s one problem with this approach: non-temporal accesses on x86 are only ordered by fence instructions! This means that code using non-temporal accesses can’t rely on LLVM’s fence opcode to do the right thing, they instead have to rely on architecture-specific _mm*fence intrinsics.

But wait! Who said developers need to issue any type of fence when using non-temporals?

Well, the LLVM memory model sure didn’t. The x86 memory model does (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more than x86 and the backends are free to ignore the !nontemporal metadata, and AFAICT the x86 backend doesn’t add those fences.

Therefore even without the above optimization the LLVM language reference is incorrect: non-temporals should be bracketed by barriers. This applies even without threading! Non-temporal accesses aren’t guaranteed to interact well with regular accesses, which means that regular loads cannot move “down” a non-temporal barrier, and regular stores cannot move “up” a non-temporal barrier.

Why not just have the compiler add the fences?

LLVM could do this, either as a per-backend thing or a hookable pass such as AtomicExpandPass. It seems more natural to ask the programmer to express intent, just as is done with atomics. In fact, a backend is current free to ignore !nontemporal on load and store and could therefore generate only half of what’s requested, leading to incorrect code. That would of course be silly, backends should either honor all !nontemporal or none of them but who knows what the middle-end does.

Put another way: some optimized C library use non-temporal accesses (when string instructions aren’t du jour) and they terminate their copying with an sfence. It’s a de-facto convention, the ABI doesn’t say anything, but let’s avoid divergence.

Aside: one day we may live in the fence elimination promised land where fences are exactly where they need to be, no more, no less.

Isn’t x86’s lfence just a no-op?

Yes, but we’re proposing the addition of a target-independent non-temporal load barrier. It’ll be up to the x86 backend to make it an X86ISD::MEMBARRIER and other backends to get it right (hint: it’s not always a no-op).

Won’t this optimization cause coherency misses? C++ access the thread stack concurrently all the time!

Maybe, but then it isn’t much of an optimization if it’s slowing code down. LLVM doesn’t just target C++, and it’s really up to the backend to decide whether one fence type is better than another (on x86, whether a locked top-of-stack idempotent operation is better than mfence). Other languages have private stacks where this isn’t an issue, and where the stack top can reasonably be assumed to be in cache.

How will this affect non-user-mode code (i.e. kernel code)?

Kernel code still has to ask for _mm_mfence if it wants mfence: C11 and C++11 barriers aren’t specified as a specific instruction.

Is it safe to access top-of-stack?

AFAIK yes, and the ABI-specified red zone has our back (or front if the stack grows up :slight_smile:).

What about non-x86 architectures?

Architectures such as ARMv8 support non-temporal instructions and require barriers such as DMB nshld to order loads and DMB nshst to order stores.

Even ARM’s address-dependency rule (a.k.a. the ill-fated std::memory_order_consume) fails to hold with non-temporals:

LDR X0, [X3]

LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!

Who uses non-temporals anyways?

That’s an awfully personal question!

Just for clarify: the proposal to change the implementation of ceq_cst is arguable separate from this proposal. It will go through normal patch review once the semantics are addressed. Whatever we end up doing with ceq_cst, we currently have a semantic hole in our specification around non-temporals that needs addressed. Another approach would be to define the current fences as fencing non-temporals and introducing new ones that don’t. Either approach is workable. I believe that new fences for non-temporals are the appropriate choice given that would more closely match existing practice. We could also consider forward serialize bitcode to the stronger form whichever choice we made. That would be conservatively correct thing to do for older bitcode which might be assuming strong semantics than our barriers explicitly provided.

What about non-x86 architectures?

Architectures such as ARMv8 support non-temporal instructions and require barriers such as DMB nshld to order loads and DMB nshst to order stores.

Even ARM’s address-dependency rule (a.k.a. the ill-fated std::memory_order_consume) fails to hold with non-temporals:

LDR X0, [X3]

LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!

What exactly do you mean by ‘X0 may not be loaded’ in your example here? If you mean that the LDNP

could start executing with the value of X0 from before the LDR, e.g. initially X0=0x100, the LDR loads

X0=0x200 but the LDNP uses the old value of X0=0x100, then I don’t think that’s true. According to

section C3.2.4 of the ARMv8 ARMARM other observers may observe the LDR and the LDNP in the wrong

order, but the CPU executing the instructions will observe them in program order.

I have no idea if that affects anything in this RFC though.

John

I haven't touched ARMv8 in a few years so I'm rusty on the non-temporal
details for that ISA. I lifted this example from here:

Documentation – Arm Developer

Which is correct?

FWIW, I agree with John here. The example I'd give for the unexpected
behaviour allowed in the spec is:

.Lwait_for_data:
    ldr x0, [x3]
    cbz x0, .Lwait_for_data
    ldnp x2, x1, [x0]

where another thread first writes to a buffer then tells us where that
buffer is. For a normal ldp, the address dependency rule means we
don't need a barrier or acquiring load to ensure we see the real data
in the buffer. For ldnp, we would need a barrier to prevent stale
data.

I suspect this is actually even closer to the x86 situation than what
the guide implies (which looks like a straight-up exposed pipeline to
me, beyond even what Alpha would have done).

Cheers.

Tim.

I agree with Tim’s assessment for ARM. That’s interesting; I wasn’t previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal loads.

I’m all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear. Nontemporal stores should probably ideally use an SFENCE. Locked instructions seem to be documented to work with MOVNTDQA. In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86. I’m significantly less enthusiastic for C++. I also think that risks unexpected coherence miss problems, though they would probably be very rare. But they would be very surprising if they did occur.

Hi JF, Philip,

Clang currently has __builtin_nontemporal_store and __builtin_nontemporal_load. How will the usage model for those change?

Thanks again,
Hal

Hi JF, Philip,

Clang currently has __builtin_nontemporal_store and
__builtin_nontemporal_load. How will the usage model for those change?

I think you would use them in the same way, but you'd have to also use
__builtin_nontemporal_store_fence and __builtin_nontemporal_load_fence.

Unless we have LLVM automagically figure out where non-temporal fences
should go, which I think isn't as good of an approach.

Thanks again,

From: "JF Bastien" <jfb@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Philip Reames" <listmail@philipreames.com>, "Hans Boehm" <hboehm@google.com>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Thursday, January 14, 2016 3:02:20 PM
Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR

Hi JF, Philip,

Clang currently has __builtin_nontemporal_store and
__builtin_nontemporal_load. How will the usage model for those
change?

I think you would use them in the same way, but you'd have to also
use __builtin_nontemporal_store_fence and
__builtin_nontemporal_load_fence.

So we'll add new fence intrinsics. That makes sense.

Unless we have LLVM automagically figure out where non-temporal
fences should go, which I think isn't as good of an approach.

I agree. Such a determination is likely to be too conservative in practice.

-Hal

I agree with Tim's assessment for ARM. That's interesting; I wasn't
previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal
loads.

I'm all in favor of more systematic handling of the fences associated with
x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing rules
on x86, none of them very clear. Nontemporal stores should probably
ideally use an SFENCE. Locked instructions seem to be documented to work
with MOVNTDQA. In both cases, there seems to be only empirical evidence as
to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update as
a fence in Java on x86. I'm significantly less enthusiastic for C++. I
also think that risks unexpected coherence miss problems, though they would
probably be very rare. But they would be very surprising if they did occur.

Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence
seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when
targeting 32-bit x86 machines which do not support mfence. What
instruction sequence should we be using instead?

> From: "JF Bastien" <jfb@google.com>
> To: "Hal Finkel" <hfinkel@anl.gov>
> Cc: "Philip Reames" <listmail@philipreames.com>, "Hans Boehm" <
hboehm@google.com>, "llvm-dev"
> <llvm-dev@lists.llvm.org>
> Sent: Thursday, January 14, 2016 3:02:20 PM
> Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR
>
>
>
>
>
>
> Hi JF, Philip,
>
> Clang currently has __builtin_nontemporal_store and
> __builtin_nontemporal_load. How will the usage model for those
> change?
>
>
>
> I think you would use them in the same way, but you'd have to also
> use __builtin_nontemporal_store_fence and
> __builtin_nontemporal_load_fence.

So we'll add new fence intrinsics. That makes sense.

Correct, and I propose that this translate to an LLVM IR barrier, with a
new type of memory ordering (non-temporal load, and non-temporal store). It
can't be metadata, but it could be an attribute instead (akin to how
load/store have atomic and volatile attributes).

We could then add the same concept to C++ but I won't tip my hand too much
:wink:

Unless we have LLVM automagically figure out where non-temporal
> fences should go, which I think isn't as good of an approach.
>

I agree. Such a determination is likely to be too conservative in practice.

Indeed, user control seems better here especially when it comes to knowing
which memory aliases to know where the fence matters.

-Hal

I agree with Tim's assessment for ARM. That's interesting; I wasn't
previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal
loads.

I'm all in favor of more systematic handling of the fences associated
with x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing rules
on x86, none of them very clear. Nontemporal stores should probably
ideally use an SFENCE. Locked instructions seem to be documented to work
with MOVNTDQA. In both cases, there seems to be only empirical evidence as
to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update
as a fence in Java on x86. I'm significantly less enthusiastic for C++. I
also think that risks unexpected coherence miss problems, though they would
probably be very rare. But they would be very surprising if they did occur.

Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence
seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when
targeting 32-bit x86 machines which do not support mfence. What
instruction sequence should we be using instead?

Do they have non-temporal accesses in the ISA?

I agree with Tim's assessment for ARM. That's interesting; I wasn't
previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal
loads.

I'm all in favor of more systematic handling of the fences associated
with x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing
rules on x86, none of them very clear. Nontemporal stores should probably
ideally use an SFENCE. Locked instructions seem to be documented to work
with MOVNTDQA. In both cases, there seems to be only empirical evidence as
to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update
as a fence in Java on x86. I'm significantly less enthusiastic for C++. I
also think that risks unexpected coherence miss problems, though they would
probably be very rare. But they would be very surprising if they did occur.

Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence
seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when
targeting 32-bit x86 machines which do not support mfence. What
instruction sequence should we be using instead?

Do they have non-temporal accesses in the ISA?

I thought not but there appear to be instructions like movntps. mfence was
introduced in SSE2 while movntps and sfence were introduced in SSE.

I agree with Tim's assessment for ARM. That's interesting; I wasn't
previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal
loads.

I'm all in favor of more systematic handling of the fences associated
with x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing
rules on x86, none of them very clear. Nontemporal stores should probably
ideally use an SFENCE. Locked instructions seem to be documented to work
with MOVNTDQA. In both cases, there seems to be only empirical evidence as
to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update
as a fence in Java on x86. I'm significantly less enthusiastic for C++. I
also think that risks unexpected coherence miss problems, though they would
probably be very rare. But they would be very surprising if they did occur.

Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence
seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when
targeting 32-bit x86 machines which do not support mfence. What
instruction sequence should we be using instead?

Do they have non-temporal accesses in the ISA?

I thought not but there appear to be instructions like movntps. mfence
was introduced in SSE2 while movntps and sfence were introduced in SSE.

So the new builtin could be sfence? I think the codegen you point out for
SEQ_CST is fine if we fix the memory model as suggested.

I agree that it's fine to use a locked instruction as a seq_cst fence if
MFENCE is not available. If you have to dirty a cache line, (%esp) seems
like relatively safe one. (I'm assuming that CPUID is appreciably slower
and out of the running? I haven't tried. But it also probably clobbers
too many registers.) It's only the idea of writing to a memory location
when MFENCE is available, and could be used instead, that seems
questionable.

What exactly would the non-temporal fences be? It seems that on x86, the
load and store case may differ. In theory, there's also a before vs. after
question. In practice code using MOVNTA seems to assume that you only need
an SFENCE afterwards. I can't back that up with spec verbiage. I don't
know about MOVNTDQA. What about ARM?

It’s not clear to me this is true if the seq_cst fence is expected to fence non-temporal stores. I think in practice, you’d be very unlikely to notice a difference, but I can’t point to anything in the Intel docs which justifies a lock prefixed instruction as sufficient to fence any non-temporal access. Agreed. As we discussed previously, it is possible to false sharing in C++, but this would require one thread to be accessing information stored in the last frame of another running thread’s stack. That seems sufficiently unlikely to be ignored. This is my belief. I haven’t actually tried this experiment, but I’ve seen no reports that CPUID is a good choice here. While in principal I agree, it appears in practice that this tradeoff is worthwhile. The hardware doesn’t seem to optimize for the MFENCE case whereas lock prefix instructions appear to be handled much better. I’ll leave this to JF to answer. I’m not knowledgeable enough about non-temporals to answer without substantial research first.

I agree that it's fine to use a locked instruction as a seq_cst fence if
MFENCE is not available.

It's not clear to me this is true if the seq_cst fence is expected to
fence non-temporal stores. I think in practice, you'd be very unlikely to
notice a difference, but I can't point to anything in the Intel docs which
justifies a lock prefixed instruction as sufficient to fence any
non-temporal access.

Correct, that's why changing the memory model is critical: seq_cst fence
wouldn't have any guarantee w.r.t. non-temporal.

What exactly would the non-temporal fences be? It seems that on x86, the

load and store case may differ. In theory, there's also a before vs. after
question. In practice code using MOVNTA seems to assume that you only need
an SFENCE afterwards. I can't back that up with spec verbiage. I don't
know about MOVNTDQA. What about ARM?

I'll leave this to JF to answer. I'm not knowledgeable enough about
non-temporals to answer without substantial research first.

I'm proposing two builtins:
- __builtin_nontemporal_load_fence
- __builtin_nontemporal_store_fence

I've I've got this right, on x86 they would respectively be a nop, and
sfence.

They otherwise act as memory code motion barriers unless accesses are
proven to not alias. I think it may be possible to loosen the rule so they
act closer to acquire/release (allowing accesses to move into the pair) but
I'm not convinced that this works for every ISA so I'd err on the side of
caution (since this can be loosened later).

I haven’t touched ARMv8 in a few years so I’m rusty on the non-temporal
details for that ISA. I lifted this example from here:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html

Which is correct?

I’ve confirmed that this example in the Cortex-A programmers guide is wrong, and it should

hopefully be corrected in a future version.

John

It seems to me the intent of that section is intelligible to those of us who have been spending too much time dealing with these issues, but seems wrong to everyone else: If another thread updates [X0] and then [X3] (with an intervening fence), this thread may see the new value of [X3], but the old value of [X0], violating the data dependence. This makes it incorrect to use such a load for e.g. Java final fields without a fence. I agree that the text is at best unclear, but presumably that was indeed the intent?

It's not clear to me this is true if the seq_cst fence is expected to
fence non-temporal stores. I think in practice, you'd be very unlikely to
notice a difference, but I can't point to anything in the Intel docs which
justifies a lock prefixed instruction as sufficient to fence any
non-temporal access.

Agreed. I think it's not guaranteed. And the most rational explanation
for the fact that LOCK; X is faster than MFENCE seems to be that LOCK only
deals with normal write-back cacheable accesses, and hence may not work for
cases like this.

If you have to dirty a cache line, (%esp) seems like relatively safe one.

Agreed. As we discussed previously, it is possible to false sharing in
C++, but this would require one thread to be accessing information stored
in the last frame of another running thread's stack. That seems
sufficiently unlikely to be ignored.

I disagree with the reasoning, but not really with the conclusion.
Starting a thread with a lambda that captures locals by reference is likely
to do this, and is a common C++ idiom, especially in textbook examples.
This is aggravated by the fact that I don't understand the hardware
prefetcher, and that it sometimes seems to fetch an adjacent line. (Note
that C, unlike C++, allows implementations to make thread stacks
inaccessible to other threads. Some of us consider that a bug and would
refuse to use a general purpose implementation that actually did this. I
suspect there are enough of us that it doesn't matter.)

I think a stronger argument is that the compiler is always allowed to push
temporaries on the stack. So this looks exactly as though a sequentially
consistent fence required a stack temporary.

It's only the idea of writing to a memory location when MFENCE is
available, and could be used instead, that seems questionable.

While in principal I agree, it appears in practice that this tradeoff is
worthwhile. The hardware doesn't seem to optimize for the MFENCE case
whereas lock prefix instructions appear to be handled much better.

The concern is that it is actually fairly easy to get contention as a
result in C++. And programmers might think they know that certain fences
shouldn't use temporaries and the rest of their code should run in
registers. But I agree this is not a completely clear call. I wish x86
provided a plain fence instruction that handled the common case
efficiently, so we could avoid these trade-offs. (A "sequentially
consistent store" instruction might be even better, in that it should
largely eliminate fences and allows other optimizations.)

Hans