memory scopes in atomic instructions

Hi all,

OpenCL 2.0 introduced the notion of memory scope in atomic operations to global memory. These scopes are a hint to the underlying platform to optimize how synchronization is achieved. HSAIL also has a notion of memory scopes that is compatible with OpenCL 2.0. Currently, the LLVM IR uses a binary value (SingleThread/CrossThread) to represent synchronization scope on atomic instructions. This makes it difficult to translate OpenCL 2.0 atomic operations to LLVM IR, and also to implement HSAIL memory scopes in the proposed HSAIL backend for LLVM.

We would like to enhance the representation of memory scopes in LLVM IR to allow more values than just the current two. The intention of this email is to invite comments before we start prototyping. Here’s what we have in mind:

  1. Update the synchronization scope field in atomic instructions from a single bit to a wider field, say 32-bit unsigned integer.

  2. Retain the current default of zero as “system scope”, replacing the current “cross thread” scope.

  3. All other values are target-defined.

  4. The use of “single thread scope” is not clear. If it is required in target-independent transforms, then it could be encoded as just “1”, or as “all ones” in the wider field. The latter option is a bit weird, because most targets will have very few scopes. But it is useful in case the next point is included in LLVM IR.

  5. Possibly add the following constraint on memory scopes: “The scope represented by a larger value is nested inside (is a proper subset of) the scope represented by a smaller value.” This would also imply that the value used for single-thread scope must be the largest value used by the target.
    This constraint on “nesting” is easily satisfied by HSAIL (and also OpenCL), where synchronization scopes increase from a single work-item to the entire system. But it is conceivable that other targets do not have this constraint. For example, a platform may define synchronization scopes in terms of overlapping sets instead of proper subsets.

  6. The impact of this change is limited to increasing the number of bits used to store synchronization scope. Future optimizations on atomics may need to interpret scopes in target-defined ways. When the synchronization scopes of two atomic instructions do not match, these optimizations must query the target for validity.

Relation with SPIR: SPIR defines an enumeration for memory scopes, but it does not support LLVM atomic instructions. So memory scopes in SPIR are independent of the representation finally chosen in LLVM IR. A compiler that translates SPIR to native LLVM IR will have to translate memory scopes wherever appropriate.

Sameer.

Can you send a plain-text version of this email. It's easier to read
and reply to.

-Tom

Hi Sameer,

I support this proposal, and have discussed more or less the same idea with various people in the past. Regarding point #5, I believe address spaces may already provide the functionality needed to express overlapping constraints. I’m also not aware of any systems that would really need that functionality anyways.

—Owen

Sorry about that! Here's the plain text (I hope!):

Hi all,

OpenCL 2.0 introduced the notion of memory scope in atomic operations to global memory. These scopes are a hint to the underlying platform to optimize how synchronization is achieved. HSAIL also has a notion of memory scopes that is compatible with OpenCL 2.0. Currently, the LLVM IR uses a binary value (SingleThread/CrossThread) to represent synchronization scope on atomic instructions. This makes it difficult to translate OpenCL 2.0 atomic operations to LLVM IR, and also to implement HSAIL memory scopes in the proposed HSAIL backend for LLVM.

We would like to enhance the representation of memory scopes in LLVM IR to allow more values than just the current two. The intention of this email is to invite comments before we start prototyping. Here's what we have in mind:

1. Update the synchronization scope field in atomic instructions from a
    single bit to a wider field, say 32-bit unsigned integer.
2. Retain the current default of zero as "system scope", replacing the
    current "cross thread" scope.
3. All other values are target-defined.
4. The use of "single thread scope" is not clear. If it is required in
    target-independent transforms, then it could be encoded as just "1",
    or as "all ones" in the wider field. The latter option is a bit
    weird, because most targets will have very few scopes. But it is
    useful in case the next point is included in LLVM IR.
5. Possibly add the following constraint on memory scopes: "The scope
    represented by a larger value is nested inside (is a proper subset
    of) the scope represented by a smaller value." This would also imply
    that the value used for single-thread scope must be the largest
    value used by the target.
    This constraint on "nesting" is easily satisfied by HSAIL (and also
    OpenCL), where synchronization scopes increase from a single
    work-item to the entire system. But it is conceivable that other
    targets do not have this constraint. For example, a platform may
    define synchronization scopes in terms of overlapping sets instead
    of proper subsets.
6. The impact of this change is limited to increasing the number of
    bits used to store synchronization scope. Future optimizations on
    atomics may need to interpret scopes in target-defined ways. When
    the synchronization scopes of two atomic instructions do not match,
    these optimizations must query the target for validity.

*Relation with SPIR: *SPIR defines an enumeration for memory scopes, but it does not support LLVM atomic instructions. So memory scopes in SPIR are independent of the representation finally chosen in LLVM IR. A compiler that translates SPIR to native LLVM IR will have to translate memory scopes wherever appropriate.

Sameer.

Copying #5 here for reference:

> 5. Possibly add the following constraint on memory scopes: "The scope
> represented by a larger value is nested inside (is a proper subset
> of) the scope represented by a smaller value." This would also imply
> that the value used for single-thread scope must be the largest
> value used by the target.
> This constraint on "nesting" is easily satisfied by HSAIL (and also
> OpenCL), where synchronization scopes increase from a single
> work-item to the entire system. But it is conceivable that other
> targets do not have this constraint. For example, a platform may
> define synchronization scopes in terms of overlapping sets instead
> of proper subsets.

I support this proposal, and have discussed more or less the same idea with various people in the past.

Thanks! Good to know that.

Regarding point #5, I believe address spaces may already provide the functionality needed to express overlapping constraints.

I am not sure how address spaces can solve the need for such a constraint. Memory scopes are orthogonal to address spaces. An agent that atomically accesses a shared location in one address space may specify a different memory scope on different instructions.

I’m also not aware of any systems that would really need that functionality anyways.

Agreed that there are no known systems with overlapping memory scopes, but it might premature to just dismiss the possibility. For example, there could be a system with a fixed number of agents, say four. Each agent can choose to synchronize with one, two or all three peers on different instructions. The resulting memory scopes cannot be ordered as proper subsets.

Sameer.

It is already the case that address spaces can (potentially) alias. As such, the combination of address spaces and memory scopes can represent any combination where the sharing properties of memory are statically known, simply by having (potentially aliasing) address spaces to represent memory pools that are only shared with a specific combinations of agents. One can imagine a GPU that worked like this, and GPU programming models do generally differentiating various sharing pools statically.

The case that this doesn’t handle is when the sharing properties are not known statically. However, I question the utility of designing this, since there are no known systems that require it. We should design the representation to cover all reasonably anticipated systems, not ones that don’t, and have no prospect of, existing.

—Owen

It is already the case that address spaces can (potentially) alias. As such, the combination of address spaces and memory scopes can represent any combination where the sharing properties of memory are statically known, simply by having (potentially aliasing) address spaces to represent memory pools that are only shared with a specific combinations of agents. One can imagine a GPU that worked like this, and GPU programming models do generally differentiating various sharing pools statically.

I am trying to understand this with a concrete example. OpenCL 2.0 allows atomic instructions in the global address space, which is encoded as "1" in the SPIR target. The possible memory scopes are work_item, work_group, device and all_svm_devices. We could resolve the global address spaces into four statically known "synchronization pools", say "global_work_item", "global_work_group", etc. They would all alias with the real global address space, and could be encoded as new address spaces, is that correct? Then we wouldn't even need the memory scope argument on the atomic instruction, right?

Note that "global_work_item" isn't even a real address space, i.e., it is not a well-defined sequence of addresses that is located somewhere in the global address space. It's actually the set of all global locations that can potentially be accessed by atomic instructions using "work_item" memory scope in a given program. It is not required to be contiguous, and can alias with the entire global address space in the worst case.

So this is what it looks like to me: The proposal is to encode memory scopes as a new field that is orthogonal to address spaces. Address spaces are defined on locations, while memory scopes are defined on operations. Every combination of an address space and a memory scope represents a set of instructions synchronizing with a set of agents through a set of locations in that address space. The first two sets are statically known (not considering the effect of control flow on the instructions). But the set of locations is dynamic, and could span the whole address space in the absence of aliasing information.

The case that this doesn’t handle is when the sharing properties are not known statically. However, I question the utility of designing this, since there are no known systems that require it. We should design the representation to cover all reasonably anticipated systems, not ones that don’t, and have no prospect of, existing.

Sure. But we could just leave this undefined for now, without losing the ability to express what we need. The idea is to not specify any semantics on non-zero memory scopes (such as assuming that they have a nesting order).

Sameer.

1. Update the synchronization scope field in atomic instructions from a
   single bit to a wider field, say 32-bit unsigned integer.

I think this should be an arbitrary bit width integer. I think baking any
size into this is a mistake unless that size is "1".

2. Retain the current default of zero as "system scope", replacing the
   current "cross thread" scope.

I would suggest, address-space scope.

3. All other values are target-defined.

You need to define single-thread scope.

4. The use of "single thread scope" is not clear.

Consider trying to read from memory written in a thread from a signal
handler delivered to that thread. Essentially, there may be a need to write
code which we know will execute in a single hardware thread, but where the
compiler optimizations precluded by atomics need to be precluded as the
control flow within the hardware thread may arbitrarily move from one
sequence of instructions to another.

If it is required in
   target-independent transforms,

Yes, it is. sig_atomic_t.

then it could be encoded as just "1",
   or as "all ones" in the wider field. The latter option is a bit
   weird, because most targets will have very few scopes. But it is
   useful in case the next point is included in LLVM IR.

If we go with your proposed constraint below, I think it is essential to
model single-thread-scope as the maximum integer. It should be a strict
subset of all inter-thread scopes.

5. Possibly add the following constraint on memory scopes: "The scope
   represented by a larger value is nested inside (is a proper subset
   of) the scope represented by a smaller value." This would also imply
   that the value used for single-thread scope must be the largest
   value used by the target.
   This constraint on "nesting" is easily satisfied by HSAIL (and also
   OpenCL), where synchronization scopes increase from a single
   work-item to the entire system. But it is conceivable that other
   targets do not have this constraint. For example, a platform may
   define synchronization scopes in terms of overlapping sets instead
   of proper subsets.

I think this is the important thing to settle on in the design. I'd really
like to hear from a diverse set of vendors and folks operating in the GPU
space to understand whether having this constraint is critically important
or problematic for any reasons.

I think (unfortunately) it would be hard to add this later...

-Chandler

These seem mutually contradictory.

I am not aware of any systems (including GPUs) that would need non-nested memory scopes. If such exist, I might expect them to be some kind of clustered NUMA HPC machine.

—Owen

    1. Update the synchronization scope field in atomic instructions
    from a
       single bit to a wider field, say 32-bit unsigned integer.

I think this should be an arbitrary bit width integer. I think baking any size into this is a mistake unless that size is "1".

I noticed that the LRM never specifies a width for address spaces, but the implementation uses "unsigned" everywhere, which is clearly not an arbitrary width integer. Is this how memory scopes should also be implemented?

    4. The use of "single thread scope" is not clear.

Consider trying to read from memory written in a thread from a signal handler delivered to that thread. Essentially, there may be a need to write code which we know will execute in a single hardware thread, but where the compiler optimizations precluded by atomics need to be precluded as the control flow within the hardware thread may arbitrarily move from one sequence of instructions to another.

    If it is required in
       target-independent transforms,

Yes, it is. sig_atomic_t.

Thanks! This also explains why SingleThread is baked into tsan. I couldn't find a way to work around __tsan_atomic_signal_fence if I removed SingleThread as a well-known memory scope.

    5. Possibly add the following constraint on memory scopes: "The scope
       represented by a larger value is nested inside (is a proper subset
       of) the scope represented by a smaller value." This would also
    imply
       that the value used for single-thread scope must be the largest
       value used by the target.
       This constraint on "nesting" is easily satisfied by HSAIL (and also
       OpenCL), where synchronization scopes increase from a single
       work-item to the entire system. But it is conceivable that other
       targets do not have this constraint. For example, a platform may
       define synchronization scopes in terms of overlapping sets instead
       of proper subsets.

I think this is the important thing to settle on in the design. I'd really like to hear from a diverse set of vendors and folks operating in the GPU space to understand whether having this constraint is critically important or problematic for any reasons.

I think "heterogenous systems" (in general, and not just HSA) might be a better term since it covers more than just GPU devices.

Also, I don't see why this constraint in the general LLVM IR could be critically important to some target. But I can see why it could be problematic for a target! If I understand this correctly, the main issue is that if we do not build nested scopes into the IR, then we can never have target-independent optimizations that work with multiple memory scopes. Is that correct? Is that really so important? What happens when we do have a target that does not have nested memory scopes? Will it not be harder to remove this assumption from the target-independent optimizations?

I think (unfortunately) it would be hard to add this later...

I am not sure I understand this part. The only effect I see is that targets might use enumerations that do not follow a strict order in their list of memory scopes. We can always encourage a future-looking convention to list the memory scopes in nesting order. And in the worst case, the enumerations can be reordered when the need arises, right?

Sameer.

I asked around inside AMD. The feedback that I got is that assuming nested scopes for GPUs is reasonable, but the concern about future targets is valid on general principles.

I still feel that it is a bit premature to build this assumption into LLVM ... we can at least wait until we encounter the first analysis or transformation that needs to depend on nested scopes! Note that this only affects the comparison of atomic accesses to each other, and that too, only when their scopes do not match. It does not affect how non-atomic accesses are optimized with respect to atomic accesses.

Meanwhile, I will start working on the required changes. The only decision required right now is whether to use "1" for SingleThread, or the largest integer value.

Sameer.

I am currently trying to understand how the bitcode reader and writer work, and it will certainly take some time. Besides, that the SDNode class needs to be updated to use a separate unsigned field for the synchronization scope.

Meanwhile, here's a patch that shows the immediate effect of allowing more than two scopes. There seem to be only two users for now --- the X86 target and tsan. The patch attempts to fix them in slightly different ways:

1. The X86 target understands only two scopes --- SingleThread or
    CrossThread (to be renamed later). All other "not SingleThread"
    scopes will be treated as CrossThread. I am not sure if this needs
    an assertion instead ... other scopes should not occur in the
    program anyway!

    Interestingly, this sort of "scope promotion" is supposed to be okay
    for OpenCL and will not introduce any new data races. The frontend
    can do this explicitly when translating OpenCL programs to targets
    with just two scopes.

2. tsan has a similar problem, but it should definitely assert for
    unknown scopes. If it attempted to promote them instead, then it
    will hide data races that it was supposed to find! The assertion can
    be replaced if/when tsan learns how to model other scopes for a
    given target.

Sameer.

compare-synch-scopes.patch (2.37 KB)

I managed to implement wider synchronization scopes in the IR as follows. Would like to put the changes up for review, but I still need some high-level inputs.

1. The synchronization scope is already implemented as an unsigned
    integer in the bitcode, as far as I can tell (but I am still new at
    understanding BitcodeWriter). Only BitcodeReader needs to change so
    that it is aware that CrossThread and SingleThread are not the only
    two values possible.

2. My change redefines CrossThread to be "0" from the current "1", and
    SingleThread to "~0U" from the current "0".

3. The LLVM assembly language needs a way to specify all the scopes.
    The proposed syntax is:
     1. singlethread: This represents "the largest integer representable
        in the underlying implementation" and hence "the smallest
        scope". LLParser will interpret this "~0U".
     2. synchscope(n): Used to specify any synchronization scope that is
        larger than a single thread. The notation is similar to that
        used for address spaces.
     3. The default scope is the same as specifying "synchscope(0)".

4. But the above encoding is actually incompatible with the current
    encoding; I had got this wrong in my original post. In particular,
    the following tests failed because the meaning of 0/1 changed:

    LLVM :: Bitcode/cmpxchg-upgrade.ll
    LLVM :: Bitcode/memInstructions.3.2.ll
    LLVM :: Bitcode/weak-cmpxchg-upgrade.ll

    All three tests contain comments on making sure that older bitcode
    files can be read correctly.

    The simplest approach is to invert the new encoding and say that "0"
    remains as SingleThread, and CrossThread is now the largest integer.
    This will need a new token "crossthread" in the assembly, and point
    (3) above will change accordingly. This breaks away from the general
    tradition of "zero as default", but this was not true with
    synchronization scopes anyway.

5. And yes, "crossthread" needs to be renamed to something better, like
    "allthreads", or "systemscope", or "addressspacescope". The last one
    is accurate but takes a small effort to see why. I would go for
    "allthreads".

Sameer.

Hi Sameer,

Did you ever get any further with this proposal?

—Owen

Hi Owen,

Sorry about the silence on that front! There's certainly an intention to make it happen, but can't commit anything just yet.

Sameer.