I want to propose a new metadata annotation for loads and stores to help solve a semantics gap around the @llvm.memcpy
intrinsic, namely that it is not currently possible to exactly implement it in pure LLVM IR. This inability creates an unfortunate interaction between InstCombine and SROA that prevents some alloca
s from being eliminated that could have been.
Motivation
Consider the following IR:
declare target(“foo”) @init()
declare void @use(target(“foo”))
define void @f() {
%a = alloca target(“foo”)
%b = alloca target(“foo”)
%a.in = call target(“foo”) @init()
store target(“foo”) %a.in, ptr %a
call void @llvm.memcpy.p0.p0.i64(ptr %b, ptr %a, i64 8, i1 false)
%b.out = load target(“foo”), ptr %b
call void @use(target(“foo”) %b.out)
ret void
}
declare void @llvm.memcpy.p0.p0.i64(ptr, ptr, i64, i1)
When we run InstCombine and then SROA on this code, we get back the following IR:
define void @f() {
%a = alloca target(“foo”)
%b = alloca target(“foo”)
%a.in = call target(“foo”) @init()
store target(“foo”) %a.in, ptr %a
%x = load i64, ptr %a
store i64 %x, ptr %b
%b.out = load target(“foo”), ptr %b
call void @use(target(“foo”) %b.out)
ret void
}
However, if we were to first run SROA on this code, we would instead get the following, eliminating the allocations:
define void @f() {
%a.in = call target(“foo”) @init()
call void @use(target(“foo”) %a.in)
ret void
}
Now, usually SROA will run on a function before InstCombine does, so the fact that InstCombine’s transformation blocks the ability of SROA to eliminate the allocations isn’t usually an issue. But there can be cases where the first pass of SROA wouldn’t be able to eliminate the allocations but a later invocation would be, primarily around inlining (imagine the call to memcpy
were in a separate function). So the problem I need to solve is how to allow the InstCombine optimization of memcpy
to not block the later SROA allocation elimination.
Fundamentally, this issue comes down to the fact that casting an LLVM type to an integer or vice versa isn’t a trivially, round-trippable conversion, at least for some types. The issue is most salient for target extension types, where there is no general IR construct for converting a target extension type to an integer or vice versa, but the some problem exists for pointer types, as inttoptr(ptrtoint x)
is not equivalent to x
(though that does not stop us from doing so).
At the end of the day, the IR instruction call void @llvm.memcpy.p0.p0.i64(ptr %b, ptr %a, i64 8, i1 false)
is not exactly equivalent to a load/store i64 pair. Perhaps the best explanation of why this is the case is Nicolae Hähnle’s blog post on the byte types proposal. I’m not going to go into detail here, for I think it is only necessary to concede that there is a difference. (A final note is that a load/store i64 pair is a valid pessimization of the memcpy—it is always correct to transform an 8-byte memcpy
to a load/store i64, but it is not always correct to transform a load/store pair to a memcpy
).
Memcpy with metadata
What I propose is that we add a !noncasting
metadata to load and store instructions that would allow optimizations to understand that a load/store pair can be converted back into a memcpy
. Semantically, !noncasting
metadata works like this:
- A store marked with
!noncasting
that is storing a load marked with!noncasting
is equivalent to amemcpy
using the size of the type. - The store must directly use of the load. If there is any intervening instruction—even a select or a phi—the back-conversion to a memcpy is illegal.
- There are no restrictions on the type in play here, other than that it must have an integer number of bytes. You can noncasting-load an integer, a pointer, a vector, a float, whatever.
- (I’m not sure this is actually relevant for anything, but if useful, it can be added) A noncasting-store of a zeroinitializer is equivalent to a memset.
- For any other use of a noncasting-load, or any other value being stored in a noncasting-store, the semantics are identical to a regular load or store.
- As a more formal-ish definition of
!noncasting
, I would suggest wording to the effect that a noncasting-load creates a value that has the same sort of shadow state memory has, and a noncasting-store can access those same shadow state, but no other operation (including selects and phis) can do so. The precise nature of the shadow state of memory is (to my understanding) still unsettled, but we don’t need to define it here to give good semantics to!noncasting
, only accept that it exists.
The main way I expect optimizations to take advantage of this metadata is that they have some “expected” type of the memory (in the case of SROA, this is the type of the relevant alloca), and if they come across a noncasting-load whose only uses are noncasting-stores and whose type is the same size as the expected type, they can simply mutate the type of the noncasting-load and not need to worry about any casting. (Indeed, opaque pointers helps tremendously here as there is no longer any need to think about casting the pointer operand’s type.)
The final rule should allow most optimizations to ignore the presence or absence of !noncasting
. The main danger would come if some arithmetic expression of a noncasting-load is eliminated, and the result is stored with a noncasting-store, but I think these can be avoided by the rarity of noncasting in the first place, and an additional InstCombine change that removes noncasting attributes from loads with uses other than noncasting-stores.
Example
A concrete example of how the changes would work. Given the starting definition for @f
, a change to InstCombine would mean that the result of running InstCombine looks like this:
define void @f() {
%a = alloca target(“foo”)
%b = alloca target(“foo”)
%a.in = call target(“foo”) @init()
store target(“foo”) %a.in, ptr %a
%x = load i64, ptr %a, !noncasting !0
store i64 %x, ptr %b, !noncasting !0
%b.out = load target(“foo”), ptr %b
call void @use(target(“foo”) %b.out)
ret void
}
After this, SROA would rely on the new metadata to reduce its output to this:
define void @f() {
%a.in = call target(“foo”) @init()
call void @use(target(“foo”) %a.in)
ret void
}
Obviously, without the new metadata, it would retain the allocas and the load and store instructions, as shown in prior sections. The rules I have for the SROA transformation are actually pretty simple (although there is a bug in here in that I don’t check for the use in the store being the value operand):
void retypeMemcpy(LoadInst &LI) {
MDNode *MemcpyNode = LI.getMetadata(LLVMContext::MD_noncasting);
if (!MemcpyNode)
return;
if (DL.getTypeSizeInBits(LI.getType()) != DL.getTypeSizeInBits(NewAllocaTy))
return;
for (User *U : LI.users()) {
if (auto *SI = dyn_cast<StoreInst>(U)) {
if (SI->getMetadata(LLVMContext::MD_noncasting) != MemcpyNode)
return;
} else {
return;
}
}
LI.mutateType(NewAllocaTy);
}
Alternatives
Here is a list of alternatives I have considered, and reasons why I have so far declined to pursue them further
- The byte type proposal. This has been proposed before and had a rather frosty reception at the time. In essence, I believe my proposal here to solve one of the primary use cases of the byte type without needing the full complexity of new LLVM types.
- Removing the small-memcpy optimization in InstCombine. I haven’t done enough of a performance analysis to understand the impacts of such a change.
- Applying the changes to SROA, just not reliant on the presence of metadata. For my immediate use-case of target extension types, this is sufficient and probably justifiable (given the way the relevant types works in practice, there’s no pointer provenance-like concerns, although this may not hold true for all putative types). But this change isn’t sound for pointer types, although we already do something like this for pointer types anyways.
- Adjusting the layout type of target extension types to not be caught by the
memcpy
. This feels like a hack to work around an incorrect optimization, and it has other effects that might not be tolerable in general. - Change the frontend (in this case, Clang) to not generate
memcpy
intrinsics for the relevant target-extension types. Again, this feels like a hack to work around an incorrect optimization, and from my poking around at Clang codegen, it is far more difficult to make that work than this change.
I would appreciate comments on the wisdom, utility, and especially correctness of this new metadata attribute, especially by people who have stronger grasps on the underdefined LLVM memory model than I do.