RFC: Element-atomic memory intrinsics

Greetings all,
I am picking up the work that was started in https://reviews.llvm.org/D27133 — adding support for an element-atomic memcpy/memset/memmove to LLVM. I would appreciate suggestions/thoughts/advice/comments on how to best proceed with this work in a way that will be acceptable to the LLVM group.

I apologize in advance; this is going to be a long one…

Background

Loads/stores of shared data in Java must be done in an atomic-like manner — it must essentially look like the entire variable is loaded/stored in a single operation. So, no doing stuff like splitting the load of a uint32 into four loads of uint8’s, or two loads of uint16’s; such a splitting could result in a data-race where each of the split loads can load a different version of the variable’s value (due to the variable being stored-to in other threads of execution). In LLVM, the atomic-like behaviour is achieved by emitting an atomic load/store with the ‘unordered’ ordering constraint.

In LLVM there is a loop idiom recognition pass that will convert loops that look like memcpy/memmove/memset into a direct call to the corresponding intrinsic — which allows the rest of the LLVM passes to reason about the operation better, and for codegen to materialize probably-better optimized versions of the loop (ex: one that uses wider loads/stores). But, as one might expect, this idiom recognition doesn’t touch atomic operations — i.e. if a loop that’s doing a memcpy is doing so with unordered loads & stores then loop idiom should not convert that into a memcpy. Reason being that memcpy/memset/memmove implementations are allowed to operate on the byte-level when doing the op, and that would break our requirement that our data be loaded/stored in an atomic-like manner if we’re, say, copying arrays of uint32’s.

Goal

We would like for loop idiom to be able to convert unordered load/store loops into memory intrinsics, but doing so requires that we have some way of expressing that the resulting memory intrinsic be unordered and what the element-size is. We would also like all/most of the existing LLVM passes that understand the semantics of the memory intrinsics to understand the semantics of these ‘unordered’ versions.

Method

Clearly we are going to have to teach LLVM about unordered memory intrinsics. There are, as I can see it, a couple of ways to accomplish this. I’d like your opinions on which are preferable, or if you can think of any other options. In no particular order…

Option 1)
Introduce a new unordered/element-atomic version of each of the memory intrinsics.
Ex: @llvm.memcpy_element_atomic — work was already started to introduce this one in D27133, but could be backed out and restarted.
Intrinsic prototype: @llvm.memcpy_element_atomic.(* dest, * src, len, i32 align, i2 isunordered, i16 element_size)
Semantics:

  • Will do a memcpy of len bytes from src to dest.
  • len must = k * lcm( #bytes in dest type, #bytes in src type), for some non-negative integer k [note: lcm = least-common multiple]
  • load/store size given by the constant power-of-2 parameter “element_size”; expected to be the lcm(sizeof(dest_ty), sizeof(src_ty))
  • isunordered param: bit 0 == 1 => stores to dest must be marked ‘unordered’; bit 1 == 1 => loads from src must be marked ‘unordered’
  • expanded memcpy code has this sort of form (ignoring alignment for simplicity):
    template<typename len_ty, uint16_t element_size>
    void memcpy_element_atomic(char dest, char src, len_ty len) {
    copy_ty = <int type with sizeof(type) == element_size>;
    len_ty num_iters = len / sizeof(copy_ty);
    for (len_ty i=0; i < num_iters; i++)
    (copy_ty)(dest + i
    sizeof(copy_ty)) = (copy_ty)(src + i
    sizeof(copy_ty));
    // Note: restriction on len means we don’t need a residual loop.
    }

We would inject the new intrinsic into the MemIntrinsic introspection hierarchy (IR/InstrinsicInst.h).

Pros:

  • New intrinsic doesn’t interfere with existing memory intrinsics code.
  • Adding new intrinsic to MemIntrinsic hierarchy means that many LLVM passes will handle the new intrinsic “for free” (though, not really free; some work would need to be done)

Cons:

  • Passes that check intrinsic ID against Intrinsics::memcpy, etc would need to be updated to handle the new intrinsic ID.
  • When new code is introduced by others to exploit/handle memcpy, then they may not remember to handle the memcpy_element_atomic version.

Option 2)
Expand the current/existing memory intrinsics to identify the unordered constraint, if one exists, in much the same way that volatility is expressed — i.e. add an ‘isunordered’ parameter(s) to the intrinsic.
This option has the same semantics as option 1; the only difference is, literally, that we expand the existing memcpy/memset/memmove intrinsics to have an ‘isunordered’ parameter and an ‘element_size’ parameter, so the prototype becomes something like:
@llvm.memcpy.(* dest, * src, len, i32 align, i1 isvolatile, i2 isunordered, i16 element_size)

Pros:

  • Minimal extra work to handle the new version in existing passes — only need to change passes that create calls to memory intrinsics, expand memory intrinsics, or that need to care about unordered (which none should that are reasoning about memory intrinsic semantics).
  • New code that’s introduced by others to exploit/handle memory intrinsics should just handle unordered for free — unordered being a part of the memory intrinsic means it’s more likely that the person will realize that they have to think about it (i.e. it raises the profile of unordered memory intrinsics).

Cons:

  • Breaks backward compatibility of the IR — is there a mechanism for migrating old IR into a new IR version when loading the IR into LLVM?
  • Memory intrinsic prototypes changing could break some 3rd party consumers of LLVM-IR until they update their parsers.
  • Some risk that existing passes (LLVM or 3rd party) don’t handle the new attribute of memory intrinsic properly — introducing bugs. Though, I suspect that this risk is small/tiny as a functional issue would only materialize if they’re materializing unordered-atomic load/stores (which only Java does, I think).

Note on generalizing to other orderings

We had some internal discussion on generalizing this to the other LLVM atomic orderings: monotonic, acquire, release, acquire_release, and sequentially_consistent.

We don’t think that it’s worth generalizing the memory intrinsics to general memory atomic-ordering constraints. The only benefit that we can see is that doing so, and teaching loop idiom about them, would allow passes to reason about memcpy/memset/memmove semantics via the intrinsic instead of inferring info from the loop behaviour — arguably, a big benefit. However, there are complications in doing so:

i) The atomic ordering constraints are memory barriers — this would mean that all passes that have to know about atomic-load/store vs load/store would also have to be taught about atomic memory intrinsics. i.e. The memory intrinsic itself would become a memory barrier, the type of which would be a union of the src & dest atomic-orderings.

ii) Rematerialzing the loop from the memory intrinsic will be tricky and have a lot of special cases. The rematerialized loop would have to exactly match the loop that the memory intrinsic was created from due to the requirements of the memory ordering barriers. We would probably have to have a separate atomic-ordering for the src & dest parameters of the memory intrinsic to be able to faithfully reproduce the original loop semantics.

iii) The loop would always have to be rematerialized from the intrinsic in codegen — if the intrinsic ever becomes a lib call, then there’s a bug. No different from the unordered case that we want to handle, really, but it’s worth pointing out since this bug would show as a data race and would be a beast to debug.

iv) I doubt that any real-world code will ever exist that would become one of these atomic-ordered memory intrinsics, so it’s probably not worth the trouble.

Thanks for reading! Thoughts, questions, and comments are welcome.

-Daniel

Hi Daniel,

[+CC Mehdi, Vedant for the auto upgrade issue]

**Method**

Clearly we are going to have to teach LLVM about unordered memory
intrinsics. There are, as I can see it, a couple of ways to accomplish this.
I’d like your opinions on which are preferable, or if you can think of any
other options. In no particular order…

Option 1)
Introduce a new unordered/element-atomic version of each of the memory
intrinsics.
  Ex: @llvm.memcpy_element_atomic — work was already started to introduce
this one in D27133, but could be backed out and restarted.
  Intrinsic prototype: @llvm.memcpy_element_atomic.<overload desc>(<ty>*
dest, <ty>* src, <ty> len, i32 align, i2 isunordered, i16 element_size)
    Semantics:
       * Will do a memcpy of len bytes from src to dest.
       * len must = k * lcm( #bytes in dest type, #bytes in src type), for
some non-negative integer k [note: lcm = least-common multiple]
       * load/store size given by the constant power-of-2 parameter
“element_size”; expected to be the lcm(sizeof(dest_ty), sizeof(src_ty))

I'm not sure if sizeof(dest_ty) and sizeof(src_ty) adds anything here.

LLVM is moving towards "typeless pointers" (i.e. pointers will not
have pointee types, instead they will just be a "generic pointer" in
some address space), so working in the types of dest and src into the
specification seems awkward.

Also, does the non-overlap restriction of src and dest (as in the
regular llvm.memcpy) apply here as well?

       * isunordered param: bit 0 == 1 => stores to dest must be marked
‘unordered’; bit 1 == 1 => loads from src must be marked ‘unordered'

What if the bits are zero -- will the stores / loads (depending on
which bit) be "ordered" in that case, or something stronger?

Option 2)
  Expand the current/existing memory intrinsics to identify the unordered
constraint, if one exists, in much the same way that volatility is expressed
— i.e. add an ‘isunordered’ parameter(s) to the intrinsic.
  This option has the same semantics as option 1; the only difference is,
literally, that we expand the existing memcpy/memset/memmove intrinsics to
have an ‘isunordered’ parameter and an ‘element_size’ parameter, so the
prototype becomes something like:
   @llvm.memcpy.<overload desc>(<ty>* dest, <ty>* src, <ty> len, i32 align,
i1 isvolatile, i2 isunordered, i16 element_size)

Pros:
   * Minimal extra work to handle the new version in existing passes — only
need to change passes that create calls to memory intrinsics, expand memory
intrinsics, or that need to care about unordered (which none should that are
reasoning about memory intrinsic semantics).
   * New code that’s introduced by others to exploit/handle memory
intrinsics should just handle unordered for free — unordered being a part of
the memory intrinsic means it’s more likely that the person will realize
that they have to think about it (i.e. it raises the profile of unordered
memory intrinsics).

I like the second point, but (unfortunately) I suspect in practice
you'll see new code do:

  if (MCI->isOrdered())
    return false; // be conservative

Cons:
   * Breaks backward compatibility of the IR — is there a mechanism for
migrating old IR into a new IR version when loading the IR into LLVM?

I think the migration here will be fairly straightforward -- you can
just auto-upgrade old calls to memcpy to pass in 0 for the isordered
argument. But I've CC'd Mehdi and Vedant to help shed some light on
this.

-- Sanjoy

Hi Daniel,

[+CC Mehdi, Vedant for the auto upgrade issue]

**Method**

Clearly we are going to have to teach LLVM about unordered memory
intrinsics. There are, as I can see it, a couple of ways to accomplish this.
I’d like your opinions on which are preferable, or if you can think of any
other options. In no particular order…

Option 1)
Introduce a new unordered/element-atomic version of each of the memory
intrinsics.
Ex: @llvm.memcpy_element_atomic — work was already started to introduce
this one in D27133, but could be backed out and restarted.

I'm curious about this -- do you know why the decision was made in D27133 to go with a new intrinsic, instead of extending the existing one? Would the same rationale apply here?

Intrinsic prototype: @llvm.memcpy_element_atomic.<overload desc>(<ty>*
dest, <ty>* src, <ty> len, i32 align, i2 isunordered, i16 element_size)
  Semantics:
     * Will do a memcpy of len bytes from src to dest.
     * len must = k * lcm( #bytes in dest type, #bytes in src type), for
some non-negative integer k [note: lcm = least-common multiple]
     * load/store size given by the constant power-of-2 parameter
“element_size”; expected to be the lcm(sizeof(dest_ty), sizeof(src_ty))

I'm not sure if sizeof(dest_ty) and sizeof(src_ty) adds anything here.

LLVM is moving towards "typeless pointers" (i.e. pointers will not
have pointee types, instead they will just be a "generic pointer" in
some address space), so working in the types of dest and src into the
specification seems awkward.

Also, does the non-overlap restriction of src and dest (as in the
regular llvm.memcpy) apply here as well?

     * isunordered param: bit 0 == 1 => stores to dest must be marked
‘unordered’; bit 1 == 1 => loads from src must be marked ‘unordered'

What if the bits are zero -- will the stores / loads (depending on
which bit) be "ordered" in that case, or something stronger?

Option 2)
Expand the current/existing memory intrinsics to identify the unordered
constraint, if one exists, in much the same way that volatility is expressed
— i.e. add an ‘isunordered’ parameter(s) to the intrinsic.
This option has the same semantics as option 1; the only difference is,
literally, that we expand the existing memcpy/memset/memmove intrinsics to
have an ‘isunordered’ parameter and an ‘element_size’ parameter, so the
prototype becomes something like:
@llvm.memcpy.<overload desc>(<ty>* dest, <ty>* src, <ty> len, i32 align,
i1 isvolatile, i2 isunordered, i16 element_size)

Pros:
* Minimal extra work to handle the new version in existing passes — only
need to change passes that create calls to memory intrinsics, expand memory
intrinsics, or that need to care about unordered (which none should that are
reasoning about memory intrinsic semantics).
* New code that’s introduced by others to exploit/handle memory
intrinsics should just handle unordered for free — unordered being a part of
the memory intrinsic means it’s more likely that the person will realize
that they have to think about it (i.e. it raises the profile of unordered
memory intrinsics).

I like the second point, but (unfortunately) I suspect in practice
you'll see new code do:

if (MCI->isOrdered())
  return false; // be conservative

Cons:
* Breaks backward compatibility of the IR — is there a mechanism for
migrating old IR into a new IR version when loading the IR into LLVM?

I think the migration here will be fairly straightforward -- you can
just auto-upgrade old calls to memcpy to pass in 0 for the isordered
argument. But I've CC'd Mehdi and Vedant to help shed some light on
this.

LLVM has one test which for backwards-compatibility with old versions of the memcpy intrinsic, which provides limited coverage (standardCIntrinsic.3.2.ll).

Whichever option you choose, it would be helpful to add uses of the new intrinsic to test/Bitcode/compatibility.ll. If you choose Option 2, it would also help to copy these uses into the 4.0 bitcode compatibility test (with "is_unordered" dropped), and to re-generate the bitcode for the test. I can help out with this if you'd like.

vedant

Hi Sanjoy,
Responses inlined…

Hi Daniel,

[+CC Mehdi, Vedant for the auto upgrade issue]

**Method**

Clearly we are going to have to teach LLVM about unordered memory
intrinsics. There are, as I can see it, a couple of ways to accomplish this.
I’d like your opinions on which are preferable, or if you can think of any
other options. In no particular order…

Option 1)
Introduce a new unordered/element-atomic version of each of the memory
intrinsics.
Ex: @llvm.memcpy_element_atomic — work was already started to introduce
this one in D27133, but could be backed out and restarted.
Intrinsic prototype: @llvm.memcpy_element_atomic.<overload desc>(<ty>*
dest, <ty>* src, <ty> len, i32 align, i2 isunordered, i16 element_size)
   Semantics:
      * Will do a memcpy of len bytes from src to dest.
      * len must = k * lcm( #bytes in dest type, #bytes in src type), for
some non-negative integer k [note: lcm = least-common multiple]
      * load/store size given by the constant power-of-2 parameter
“element_size”; expected to be the lcm(sizeof(dest_ty), sizeof(src_ty))

I'm not sure if sizeof(dest_ty) and sizeof(src_ty) adds anything here.

LLVM is moving towards "typeless pointers" (i.e. pointers will not
have pointee types, instead they will just be a "generic pointer" in
some address space), so working in the types of dest and src into the
specification seems awkward.

Poor choice of wording on my part. By sizeof(<thing>) I mean the element-size of the load/store that appeared in the original loop from which the memory intrinsic was materialized — arguably, these will be the same in any real cases so it probably doesn’t make any sense to even mention it...

Also, does the non-overlap restriction of src and dest (as in the
regular llvm.memcpy) apply here as well?

Yes, I would think so.

      * isunordered param: bit 0 == 1 => stores to dest must be marked
‘unordered’; bit 1 == 1 => loads from src must be marked ‘unordered'

What if the bits are zero -- will the stores / loads (depending on
which bit) be "ordered" in that case, or something stronger?

This is partly why I prefer option 2. An ‘isunordered’ value of 0 is nonsense for the standalone atomic-unordered memory intrinsic. It would imply that neither the source nor dest needs to be loaded/stored via unordered-atomic ops, and so the memory intrinsic is identical to the ordinary non-atomic one.

Option 2)
Expand the current/existing memory intrinsics to identify the unordered
constraint, if one exists, in much the same way that volatility is expressed
— i.e. add an ‘isunordered’ parameter(s) to the intrinsic.
This option has the same semantics as option 1; the only difference is,
literally, that we expand the existing memcpy/memset/memmove intrinsics to
have an ‘isunordered’ parameter and an ‘element_size’ parameter, so the
prototype becomes something like:
  @llvm.memcpy.<overload desc>(<ty>* dest, <ty>* src, <ty> len, i32 align,
i1 isvolatile, i2 isunordered, i16 element_size)

Pros:
  * Minimal extra work to handle the new version in existing passes — only
need to change passes that create calls to memory intrinsics, expand memory
intrinsics, or that need to care about unordered (which none should that are
reasoning about memory intrinsic semantics).
  * New code that’s introduced by others to exploit/handle memory
intrinsics should just handle unordered for free — unordered being a part of
the memory intrinsic means it’s more likely that the person will realize
that they have to think about it (i.e. it raises the profile of unordered
memory intrinsics).

I like the second point, but (unfortunately) I suspect in practice
you'll see new code do:

if (MCI->isOrdered())
   return false; // be conservative

Yes, that would be an unfortunate reality, but one can hope. :slight_smile:

Cons:
  * Breaks backward compatibility of the IR — is there a mechanism for
migrating old IR into a new IR version when loading the IR into LLVM?

I think the migration here will be fairly straightforward -- you can
just auto-upgrade old calls to memcpy to pass in 0 for the isordered
argument. But I've CC'd Mehdi and Vedant to help shed some light on
this.

— Sanjoy

Thanks!

-Daniel

Hi Vedant,
Responses inline…

Hi Daniel,

[+CC Mehdi, Vedant for the auto upgrade issue]

Method

Clearly we are going to have to teach LLVM about unordered memory
intrinsics. There are, as I can see it, a couple of ways to accomplish this.
I’d like your opinions on which are preferable, or if you can think of any
other options. In no particular order…

Option 1)
Introduce a new unordered/element-atomic version of each of the memory
intrinsics.
Ex: @llvm.memcpy_element_atomic — work was already started to introduce
this one in D27133, but could be backed out and restarted.

I’m curious about this – do you know why the decision was made in D27133 to go with a new intrinsic, instead of extending the existing one? Would the same rationale apply here?

The rationale as I understand it basically boils down to minimizing risk — make the changes with the introduction of a new intrinsic that’ll be invisible in practice to anyone except Java, see how that looks/feels, and then extend the existing one (& remove the added one) once it looks like it’ll work. I have no problem with continuing that line, but I do prefer to move away from the separate intrinsic after the thing has been proven out.

Cons:

  • Breaks backward compatibility of the IR — is there a mechanism for
    migrating old IR into a new IR version when loading the IR into LLVM?

I think the migration here will be fairly straightforward – you can
just auto-upgrade old calls to memcpy to pass in 0 for the isordered
argument. But I’ve CC’d Mehdi and Vedant to help shed some light on
this.

LLVM has one test which for backwards-compatibility with old versions of the memcpy intrinsic, which provides limited coverage (standardCIntrinsic.3.2.ll).

Whichever option you choose, it would be helpful to add uses of the new intrinsic to test/Bitcode/compatibility.ll. If you choose Option 2, it would also help to copy these uses into the 4.0 bitcode compatibility test (with “is_unordered” dropped), and to re-generate the bitcode for the test. I can help out with this if you’d like.

Will do, and I’ll definitely reach out for assistance with the bitcode compatibility test when the time arrives. Thanks!

-Daniel

Hi Daniel,

Hi Sanjoy,