[RFC] Correct implementation of memcpy with metadata

I want to propose a new metadata annotation for loads and stores to help solve a semantics gap around the @llvm.memcpy intrinsic, namely that it is not currently possible to exactly implement it in pure LLVM IR. This inability creates an unfortunate interaction between InstCombine and SROA that prevents some allocas from being eliminated that could have been.

Motivation

Consider the following IR:

declare target(“foo”) @init()
declare void @use(target(“foo”))
define void @f() {
  %a = alloca target(“foo”)
  %b = alloca target(“foo”)
  %a.in = call target(“foo”) @init()
  store target(“foo”) %a.in, ptr %a
  call void @llvm.memcpy.p0.p0.i64(ptr %b, ptr %a, i64 8, i1 false)
  %b.out = load target(“foo”), ptr %b
  call void @use(target(“foo”) %b.out)
  ret void
}

declare void @llvm.memcpy.p0.p0.i64(ptr, ptr, i64, i1)

When we run InstCombine and then SROA on this code, we get back the following IR:

define void @f() {
  %a = alloca target(“foo”)
  %b = alloca target(“foo”)
  %a.in = call target(“foo”) @init()
  store target(“foo”) %a.in, ptr %a
  %x = load i64, ptr %a
  store i64 %x, ptr %b
  %b.out = load target(“foo”), ptr %b
  call void @use(target(“foo”) %b.out)
  ret void
}

However, if we were to first run SROA on this code, we would instead get the following, eliminating the allocations:

define void @f() {
  %a.in = call target(“foo”) @init()
  call void @use(target(“foo”) %a.in)
  ret void
}

Now, usually SROA will run on a function before InstCombine does, so the fact that InstCombine’s transformation blocks the ability of SROA to eliminate the allocations isn’t usually an issue. But there can be cases where the first pass of SROA wouldn’t be able to eliminate the allocations but a later invocation would be, primarily around inlining (imagine the call to memcpy were in a separate function). So the problem I need to solve is how to allow the InstCombine optimization of memcpy to not block the later SROA allocation elimination.

Fundamentally, this issue comes down to the fact that casting an LLVM type to an integer or vice versa isn’t a trivially, round-trippable conversion, at least for some types. The issue is most salient for target extension types, where there is no general IR construct for converting a target extension type to an integer or vice versa, but the some problem exists for pointer types, as inttoptr(ptrtoint x) is not equivalent to x (though that does not stop us from doing so).

At the end of the day, the IR instruction call void @llvm.memcpy.p0.p0.i64(ptr %b, ptr %a, i64 8, i1 false) is not exactly equivalent to a load/store i64 pair. Perhaps the best explanation of why this is the case is Nicolae Hähnle’s blog post on the byte types proposal. I’m not going to go into detail here, for I think it is only necessary to concede that there is a difference. (A final note is that a load/store i64 pair is a valid pessimization of the memcpy—it is always correct to transform an 8-byte memcpy to a load/store i64, but it is not always correct to transform a load/store pair to a memcpy).

Memcpy with metadata

What I propose is that we add a !noncasting metadata to load and store instructions that would allow optimizations to understand that a load/store pair can be converted back into a memcpy. Semantically, !noncasting metadata works like this:

  • A store marked with !noncasting that is storing a load marked with !noncasting is equivalent to a memcpy using the size of the type.
  • The store must directly use of the load. If there is any intervening instruction—even a select or a phi—the back-conversion to a memcpy is illegal.
  • There are no restrictions on the type in play here, other than that it must have an integer number of bytes. You can noncasting-load an integer, a pointer, a vector, a float, whatever.
  • (I’m not sure this is actually relevant for anything, but if useful, it can be added) A noncasting-store of a zeroinitializer is equivalent to a memset.
  • For any other use of a noncasting-load, or any other value being stored in a noncasting-store, the semantics are identical to a regular load or store.
  • As a more formal-ish definition of !noncasting, I would suggest wording to the effect that a noncasting-load creates a value that has the same sort of shadow state memory has, and a noncasting-store can access those same shadow state, but no other operation (including selects and phis) can do so. The precise nature of the shadow state of memory is (to my understanding) still unsettled, but we don’t need to define it here to give good semantics to !noncasting, only accept that it exists.

The main way I expect optimizations to take advantage of this metadata is that they have some “expected” type of the memory (in the case of SROA, this is the type of the relevant alloca), and if they come across a noncasting-load whose only uses are noncasting-stores and whose type is the same size as the expected type, they can simply mutate the type of the noncasting-load and not need to worry about any casting. (Indeed, opaque pointers helps tremendously here as there is no longer any need to think about casting the pointer operand’s type.)

The final rule should allow most optimizations to ignore the presence or absence of !noncasting. The main danger would come if some arithmetic expression of a noncasting-load is eliminated, and the result is stored with a noncasting-store, but I think these can be avoided by the rarity of noncasting in the first place, and an additional InstCombine change that removes noncasting attributes from loads with uses other than noncasting-stores.

Example

A concrete example of how the changes would work. Given the starting definition for @f, a change to InstCombine would mean that the result of running InstCombine looks like this:

define void @f() {
  %a = alloca target(“foo”)
  %b = alloca target(“foo”)
  %a.in = call target(“foo”) @init()
  store target(“foo”) %a.in, ptr %a
  %x = load i64, ptr %a, !noncasting !0
  store i64 %x, ptr %b, !noncasting !0
  %b.out = load target(“foo”), ptr %b
  call void @use(target(“foo”) %b.out)
  ret void
}

After this, SROA would rely on the new metadata to reduce its output to this:

define void @f() {
  %a.in = call target(“foo”) @init()
  call void @use(target(“foo”) %a.in)
  ret void
}

Obviously, without the new metadata, it would retain the allocas and the load and store instructions, as shown in prior sections. The rules I have for the SROA transformation are actually pretty simple (although there is a bug in here in that I don’t check for the use in the store being the value operand):

void retypeMemcpy(LoadInst &LI) {
  MDNode *MemcpyNode = LI.getMetadata(LLVMContext::MD_noncasting);
  if (!MemcpyNode)
    return;
  if (DL.getTypeSizeInBits(LI.getType()) != DL.getTypeSizeInBits(NewAllocaTy))
    return;
  for (User *U : LI.users()) {
    if (auto *SI = dyn_cast<StoreInst>(U)) {
      if (SI->getMetadata(LLVMContext::MD_noncasting) != MemcpyNode)
        return;
    } else {
      return;
    }
  }
  LI.mutateType(NewAllocaTy);
}

Alternatives

Here is a list of alternatives I have considered, and reasons why I have so far declined to pursue them further

  • The byte type proposal. This has been proposed before and had a rather frosty reception at the time. In essence, I believe my proposal here to solve one of the primary use cases of the byte type without needing the full complexity of new LLVM types.
  • Removing the small-memcpy optimization in InstCombine. I haven’t done enough of a performance analysis to understand the impacts of such a change.
  • Applying the changes to SROA, just not reliant on the presence of metadata. For my immediate use-case of target extension types, this is sufficient and probably justifiable (given the way the relevant types works in practice, there’s no pointer provenance-like concerns, although this may not hold true for all putative types). But this change isn’t sound for pointer types, although we already do something like this for pointer types anyways.
  • Adjusting the layout type of target extension types to not be caught by the memcpy. This feels like a hack to work around an incorrect optimization, and it has other effects that might not be tolerable in general.
  • Change the frontend (in this case, Clang) to not generate memcpy intrinsics for the relevant target-extension types. Again, this feels like a hack to work around an incorrect optimization, and from my poking around at Clang codegen, it is far more difficult to make that work than this change.

I would appreciate comments on the wisdom, utility, and especially correctness of this new metadata attribute, especially by people who have stronger grasps on the underdefined LLVM memory model than I do.

It’s a little hard for me to follow what your assumptions are here. As best as I can tell, it’s roughly the following:

  • Memory can have “extra” properties, like pointer provenance, or some target-specific properties related to target extension types.
  • A memcpy copies these properties.
  • Transforming an memcpy into an integer load/store is always legal; rules exist to re-synthesize any relevant properties, like pointer provenance, or whatever properties target types have.
  • As a corollary, store i64 (load 64, i64* %p), i64* %p is not a no-op. But under this proposal, if you mark the load and store !noncasting, it is a no-op.

The assumption that re-synthesis is always possible seems dubious to me. Particularly when non-integral pointers get involved.

I don’t know if you saw A memory model for LLVM IR supporting limited type punning at the time, but it’s sort of similar to your proposal.

This proposal doesn’t solve the issue with chars in C, that are universal holders. They can be used to implement memcpy (which your proposal solves if the copy is straightforward), but they can be casted later to other things.
Also, not sure the proposal allows vectorizing the load/stores.

I know the byte type sounds like a big change, but we’ve introduced types in the IR before and LLVM didn’t break. I think we just need to bite the bullet and that’s it.

(this proposal is creating an implicit byte type, since we need to formalize the return of the load !noncasting somehow – and that’s like a raw byte)

2 Likes

Isn’t the opposite the case? Doing the load/store via type i64 might lose information (e.g. if some but not all the bytes are poison), whereas memcpy will exactly preserve the entire internal state of that range of memory.

With this proposal it seems like i64 can sometimes hold per-byte poison information but not always. So when dealing with an i64 of unknown origin, it might be in a state where some bytes are poison and others are not, and transformations have to deal with that – that seems like a massive potential footgun?

The most obvious one is whether the byte is initialized or poison. memcpy has bytewise granularity for this (I assume), but i64 does not.

1 Like

That is only true after you verify that *to is disjoint from *from, otherwise “all sorts of strange things can happen.”

Fair, memcpy is UB if they overlap.

But I don’t think “load/store i64 pair is a valid pessimization of the memcpy” is correct, unless you meant “with this new metadata applied” (i.e., basically a load/store with the byte type).

1 Like