[RFC] Add compiler scheduling barriers

Hi all,

I'm currently working on implementing ACLE extensions for ARM. There
are some memory barrier intrinsics, i.e.__dsb and __isb that require
the compiler not to reorder instructions around their corresponding
built-in intrinsics(__builtin_arm_dsb, __builtin_arm_isb), including
non-memory-access instructions.[1] This is currently not possible.

It is sometimes useful to prevent the compiler from reordering
memory-access instructions as well. The only way to do that in both
GCC and LLVM is using a in-line assembly hack:
  asm volatile("" ::: "memory")

I propose adding two compiler scheduling barriers intrinsics to LLVM:
__schedule_barrier_memory and __schedule_barrier_full. The former only
prevents memory-access instructions reordering around the instruction
and the latter stops all. So that __isb, for example, can be
implemented something like:
  inline void __isb() {
    __schedule_barrier_full();
    __builtin_arm_isb();
    __schedule_barrier_full();
  }

To implement these intrinsics, I think the best method is to add
target-independent pseudo-instructions with appropriate
properties(hasSideEffects for memory barrier and isTerminator for full
barrier) and a pseudo-instruction elimination pass after the
scheduling pass.

What do people think of this idea?

Cheers,

Yi

This sounds similar to the problem I want to solve with the nomemfence attribute http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-January/069129.html

I had this about half implemented in December but I haven’t gotten back to finishing it yet

-Matt

C11 defines the atomic_thread_fence() function for the memory-only part. Clang exposes this as __c11_atomic_thread_fence(). This is more flexible than that part of your proposal, as it allows relaxing the restrictions based on the memory model.

I can see the use case for preventing any reordering, but it seems somewhat specialised. Wouldn't it be simpler to just model the wfi operation as loading from memory (which, effectively, it does, if you have a memory-mapped interrupt controller)?

David

It is sometimes useful to prevent the compiler from reordering
memory-access instructions as well. The only way to do that in both
GCC and LLVM is using a in-line assembly hack:
asm volatile("" ::: "memory")

I propose adding two compiler scheduling barriers intrinsics to LLVM:
__schedule_barrier_memory and __schedule_barrier_full. The former only
prevents memory-access instructions reordering around the instruction
and the latter stops all. So that __isb, for example, can be
implemented something like:
inline void __isb() {
   __schedule_barrier_full();
   __builtin_arm_isb();
   __schedule_barrier_full();
}

To implement these intrinsics, I think the best method is to add
target-independent pseudo-instructions with appropriate
properties(hasSideEffects for memory barrier and isTerminator for full
barrier) and a pseudo-instruction elimination pass after the
scheduling pass.

What do people think of this idea?

C11 defines the atomic_thread_fence() function for the memory-only part. Clang exposes this as __c11_atomic_thread_fence(). This is more flexible than that part of your proposal, as it allows relaxing the restrictions based on the memory model.

atomic_thread_fence() always inserts a machine memory fence on weak
memory model, which is different from simply stopping reordering.
Memory fences are expensive and might be overkill.

We can leave this intrinsic out for now if it's not particularly
useful, as the in-line assembly hack does work, although not elegant.

I can see the use case for preventing any reordering, but it seems somewhat specialised. Wouldn't it be simpler to just model the wfi operation as loading from memory (which, effectively, it does, if you have a memory-mapped interrupt controller)?

__WFI() is just one of the instructions where reordering isn't
allowed. There are more examples in the DAI0321A document. I think it
might be required on other architectures as well.

From: "Yi Kong" <kongy.dev@gmail.com>
To: "LLVM Dev" <llvmdev@cs.uiuc.edu>
Sent: Thursday, June 19, 2014 11:35:05 AM
Subject: [LLVMdev] [RFC] Add compiler scheduling barriers

Hi all,

I'm currently working on implementing ACLE extensions for ARM. There
are some memory barrier intrinsics, i.e.__dsb and __isb that require
the compiler not to reorder instructions around their corresponding
built-in intrinsics(__builtin_arm_dsb, __builtin_arm_isb), including
non-memory-access instructions.[1] This is currently not possible.

It is sometimes useful to prevent the compiler from reordering
memory-access instructions as well. The only way to do that in both
GCC and LLVM is using a in-line assembly hack:
  asm volatile("" ::: "memory")

I propose adding two compiler scheduling barriers intrinsics to LLVM:
__schedule_barrier_memory and __schedule_barrier_full. The former
only
prevents memory-access instructions reordering around the instruction
and the latter stops all. So that __isb, for example, can be
implemented something like:
  inline void __isb() {
    __schedule_barrier_full();
    __builtin_arm_isb();
    __schedule_barrier_full();
  }

To implement these intrinsics, I think the best method is to add
target-independent pseudo-instructions with appropriate
properties(hasSideEffects for memory barrier and isTerminator for
full
barrier) and a pseudo-instruction elimination pass after the
scheduling pass.

What do people think of this idea?

I don't believe that we currently support calls that are terminators, and doing so would be a large change to the infrastructure. I think, however, that declaring an intrinsic that is marked such that mayHaveSideEffects() is true will also prevent reordering of anything around the intrinsic that touches observable state. Is there a reason why this would be insufficient?

-Hal

Hi all,

I'm currently working on implementing ACLE extensions for ARM. There
are some memory barrier intrinsics, i.e.__dsb and __isb that require
the compiler not to reorder instructions around their corresponding
built-in intrinsics(__builtin_arm_dsb, __builtin_arm_isb), including
non-memory-access instructions.[1] This is currently not possible.

It is sometimes useful to prevent the compiler from reordering
memory-access instructions as well. The only way to do that in both
GCC and LLVM is using a in-line assembly hack:
   asm volatile("" ::: "memory")

I propose adding two compiler scheduling barriers intrinsics to LLVM:
__schedule_barrier_memory and __schedule_barrier_full. The former only
prevents memory-access instructions reordering around the instruction
and the latter stops all. So that __isb, for example, can be
implemented something like:
   inline void __isb() {
     __schedule_barrier_full();
     __builtin_arm_isb();
     __schedule_barrier_full();
   }

Given your examples are in C, I want to ask a clarification question. Are you proposing adding such intrinsics to the LLVM IR? Or to some runtime library? If the later, *specifically* which one? Or at the MachineInst layer?

I'm going to run under the assumption you're using C pseudo code for IR. If this is not the case, the rest of this will be off base.

I'm not familiar with the exact semantics of an "isb" barrier, but I think you should look at the existing fence IR instructions. These restrict memory reorderings in the IR. Depending on the platform, they may imply hardware barriers, but they always imply compiler barriers.

If all you want is a compiler barrier with the existing fence semantics w.r.t. reordering, we could consider extending fence with a "compiler only" (bikeshed needed!) attribute.

If you're describing a new memory ordering for existing fences, that would seem like a reasonable extension.

I'm not familiar with how we currently handle intrinsics for architecture specific memory barriers. Can anyone else comment on that? Is there a way to tag a particular intrinsic function as *also* being a full fence?

To implement these intrinsics, I think the best method is to add
target-independent pseudo-instructions with appropriate
properties(hasSideEffects for memory barrier and isTerminator for full
barrier) and a pseudo-instruction elimination pass after the
scheduling pass.

Why would your barrier need to be a basic block terminator? That doesn't parse for me. Could you explain?

What do people think of this idea?

I'm honestly unclear on what your problem is and what you're trying to propose. It make take a few rounds of conversation to clarify.

Philip

Hi all,

I'm currently working on implementing ACLE extensions for ARM. There
are some memory barrier intrinsics, i.e.__dsb and __isb that require
the compiler not to reorder instructions around their corresponding
built-in intrinsics(__builtin_arm_dsb, __builtin_arm_isb), including
non-memory-access instructions.[1] This is currently not possible.

It is sometimes useful to prevent the compiler from reordering
memory-access instructions as well. The only way to do that in both
GCC and LLVM is using a in-line assembly hack:
   asm volatile("" ::: "memory")

I propose adding two compiler scheduling barriers intrinsics to LLVM:
__schedule_barrier_memory and __schedule_barrier_full. The former only
prevents memory-access instructions reordering around the instruction
and the latter stops all. So that __isb, for example, can be
implemented something like:
   inline void __isb() {
     __schedule_barrier_full();
     __builtin_arm_isb();
     __schedule_barrier_full();
   }

Given your examples are in C, I want to ask a clarification question. Are
you proposing adding such intrinsics to the LLVM IR? Or to some runtime
library? If the later, *specifically* which one? Or at the MachineInst
layer?

I'm going to run under the assumption you're using C pseudo code for IR. If
this is not the case, the rest of this will be off base.

Yes, IR.

I'm not familiar with the exact semantics of an "isb" barrier, but I think
you should look at the existing fence IR instructions. These restrict
memory reorderings in the IR. Depending on the platform, they may imply
hardware barriers, but they always imply compiler barriers.

If all you want is a compiler barrier with the existing fence semantics
w.r.t. reordering, we could consider extending fence with a "compiler only"
(bikeshed needed!) attribute.

AFAIK, there isn't an existing fence strong enough for the memory
barrier intrinsics. The current strongest fence still allows
register-register data-processing instructions reordering across. For
DSB and ISB, no instruction should be allowed.

If you're describing a new memory ordering for existing fences, that would
seem like a reasonable extension.

I'm not familiar with how we currently handle intrinsics for architecture
specific memory barriers. Can anyone else comment on that? Is there a way
to tag a particular intrinsic function as *also* being a full fence?

I'm interested in this as well.

To implement these intrinsics, I think the best method is to add
target-independent pseudo-instructions with appropriate
properties(hasSideEffects for memory barrier and isTerminator for full
barrier) and a pseudo-instruction elimination pass after the
scheduling pass.

Why would your barrier need to be a basic block terminator? That doesn't
parse for me. Could you explain?

Compiler shouldn't allow instructions to be reordered between basic
blocks. By implementing as a basic block terminator, it will stop any
instruction from reordering.

I'm not very familiar with LLVM, can you propose the correct way of
implementing it?

Hi Yi,

If all you want is a compiler barrier with the existing fence semantics
w.r.t. reordering, we could consider extending fence with a "compiler only"
(bikeshed needed!) attribute.

AFAIK, there isn't an existing fence strong enough for the memory
barrier intrinsics. The current strongest fence still allows
register-register data-processing instructions reordering across. For
DSB and ISB, no instruction should be allowed.

I disagree. DSB and ISB may need to be insulated against reordering
with other instructions with unmodelled side-effects (e.g. MSR/MRS),
but they don't care about a normal instructions. There's no reason the
compiler shouldn't be free to move an arithmetic add around them, for
example. Fortunately, I think LLVM already handles this.

The main argument I'm aware of in favour of stronger barriers is for
things like performance counters where the intervening instructions
are sometimes critical to what's being measured. I'm not entirely
convinced by that either, since I don't think the compiler can ever
provide the needed guarantees anyway.

Why would your barrier need to be a basic block terminator? That doesn't
parse for me. Could you explain?

Compiler shouldn't allow instructions to be reordered between basic
blocks. By implementing as a basic block terminator, it will stop any
instruction from reordering.

It should, and does. The idealised view is that basic blocks should be
no extra barrier at all to reordering. Of course, we don't reach that
because the control-flow is harder to analyse so it's simply easier to
do optimisations within a basic block.

But you shouldn't be taking that as a point of principle when
designing anything.

Cheers.

Tim.

Hi all,

I'm currently working on implementing ACLE extensions for ARM. There
are some memory barrier intrinsics, i.e.__dsb and __isb that require
the compiler not to reorder instructions around their corresponding
built-in intrinsics(__builtin_arm_dsb, __builtin_arm_isb), including
non-memory-access instructions.[1] This is currently not possible.

It is sometimes useful to prevent the compiler from reordering
memory-access instructions as well. The only way to do that in both
GCC and LLVM is using a in-line assembly hack:
    asm volatile("" ::: "memory")

I propose adding two compiler scheduling barriers intrinsics to LLVM:
__schedule_barrier_memory and __schedule_barrier_full. The former only
prevents memory-access instructions reordering around the instruction
and the latter stops all. So that __isb, for example, can be
implemented something like:
    inline void __isb() {
      __schedule_barrier_full();
      __builtin_arm_isb();
      __schedule_barrier_full();
    }

Given your examples are in C, I want to ask a clarification question. Are
you proposing adding such intrinsics to the LLVM IR? Or to some runtime
library? If the later, *specifically* which one? Or at the MachineInst
layer?

I'm going to run under the assumption you're using C pseudo code for IR. If
this is not the case, the rest of this will be off base.

Yes, IR.

I'm not familiar with the exact semantics of an "isb" barrier, but I think
you should look at the existing fence IR instructions. These restrict
memory reorderings in the IR. Depending on the platform, they may imply
hardware barriers, but they always imply compiler barriers.

If all you want is a compiler barrier with the existing fence semantics
w.r.t. reordering, we could consider extending fence with a "compiler only"
(bikeshed needed!) attribute.

AFAIK, there isn't an existing fence strong enough for the memory
barrier intrinsics. The current strongest fence still allows
register-register data-processing instructions reordering across. For
DSB and ISB, no instruction should be allowed.

From what I can read about the ISB barrier, this is untrue (in it's full generality.) There are certainly some instructions which can't be moved past an ISB (memory accesses, cpu control instructions.) It's not clear to me that *all* instruction reorderings need to be blocked.

Aside: From the documentation, it's actually unclear how strong a barrier ISB actually is. The "fetched after the ISB" gives a lot of room for interpretation. (i.e. is it legal for a cpu to fetch every instruction in a function, queue them for execution under wait conditions, *then* fetch the ISB? This would be a *really* weird implementation, but would it be legal according to this spec? If so, the ISB provides *no* guarantees.)

Reference:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html

If you're describing a new memory ordering for existing fences, that would
seem like a reasonable extension.

I'm not familiar with how we currently handle intrinsics for architecture
specific memory barriers. Can anyone else comment on that? Is there a way
to tag a particular intrinsic function as *also* being a full fence?

I'm interested in this as well.

To implement these intrinsics, I think the best method is to add
target-independent pseudo-instructions with appropriate
properties(hasSideEffects for memory barrier and isTerminator for full
barrier) and a pseudo-instruction elimination pass after the
scheduling pass.

Why would your barrier need to be a basic block terminator? That doesn't
parse for me. Could you explain?

Compiler shouldn't allow instructions to be reordered between basic
blocks.

Er, no. Any decent optimizer compiler can, should, and does.

By implementing as a basic block terminator, it will stop any
instruction from reordering.

I'm not very familiar with LLVM, can you propose the correct way of
implementing it?

This seems like a fundamentally flawed approach. Conceptually, there should be no difference between:
bb: inst1; inst2;
and
bb: inst1; goto next
next: inst2;

If there is, you've created a nearly impossible to optimize IR.

Given a strong enough barrier, you also don't need such a construct. These two will always have the same semantics:
bb: inst1; barrier; inst2;
and
bb: inst1; barrier; goto next;
next: inst2

I would suggest you simply drop the bit about being a terminator.

Hi Philip,

Aside: From the documentation, it's actually unclear how strong a barrier
ISB actually is. The "fetched after the ISB" gives a lot of room for
interpretation. (i.e. is it legal for a cpu to fetch every instruction in a
function, queue them for execution under wait conditions, *then* fetch the
ISB? This would be a *really* weird implementation, but would it be legal
according to this spec? If so, the ISB provides *no* guarantees.)

Reference:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html

You've actually linked to the assembler (as in "armasm") reference
manual there. The architectural documentation of these barriers is
better, though still not airtight in my opinion. From section A3.8.3,
the bit about "program order" is new, and it's after the ISB has
*completed*:

"An ISB instruction flushes the pipeline in the processor, so that all
instructions that come after the ISB instruction in program order are
fetched from cache or memory only after the ISB instruction has
completed."

It's behind a registration wall, but the v7 version can be had as a
.pdf here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0406c/index.html

Cheers.

Tim.

Hi Philip,

Aside: From the documentation, it's actually unclear how strong a barrier
ISB actually is. The "fetched after the ISB" gives a lot of room for
interpretation. (i.e. is it legal for a cpu to fetch every instruction in a
function, queue them for execution under wait conditions, *then* fetch the
ISB? This would be a *really* weird implementation, but would it be legal
according to this spec? If so, the ISB provides *no* guarantees.)

Reference:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html

You've actually linked to the assembler (as in "armasm") reference
manual there. The architectural documentation of these barriers is
better, though still not airtight in my opinion. From section A3.8.3,
the bit about "program order" is new, and it's after the ISB has
*completed*:

Figured there had to be something better, but that's what google found me. Thanks.

"An ISB instruction flushes the pipeline in the processor, so that all
instructions that come after the ISB instruction in program order are
fetched from cache or memory only after the ISB instruction has
completed."

Not seeing the full text, this still looks incomplete. I think we can infer what was actually meant here, so it doesn't make much of a practical difference.

It's behind a registration wall, but the v7 version can be had as a
.pdf here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0406c/index.html

Also behind a registration wall. I might bother find my old account info at some point, but not right now.

Philip

After some internal discussions, we concluded that moving
non-memory-access and non-side-effect instructions around memory
barrier intrinsics is fine in almost any typical situation. In the
rare occasions where reordering is undesirable, the programmer should
use inline asm instead.

Thus I'm dropping this RFC. Thanks for your comments.

-Yi

Hi Yi,

A key use of ISB is where the access or cacheability semantics of the
instruction stream being executed may change (e.g. updating MPU or MMU
access attributes, enabling or disabling caches) and you want to force
the pipeline to be flushed and refilled with fetches against the new
configuration. If the instruction immediately follows is a simple
instruction, you have to use inline asm.

Also, for self-modifying code, reordering around ISB can be
problematic. But you probably needs to write it in asm in the first
place anyway.

-Yi