[RFC] memfill and memtransfer intrinsics for generalized memset/memcpy

I want to suggest the addition of a memfill and a memtransfer intrinsic that are similar to, but not exactly equivalent to, memset and memcpy/memmove respectively. These intrinsics would improve load-store forwarding optimizations for different scenarios.

Motivation

Memset and memcpy/memmove are treated specially by several passes, usually for the purpose of load-store forwarding. However, there are situations where these functions cannot be used, but the frontend can provide additional information about the exact operation that is happening.

Memset

Memset sets all bytes in a memory region to the same value, which allows it to be used for initializing arrays with 0 or -1. However, for arrays of integers larger than one byte, memset is usually inapplicable (e.g. a uint64 array of 1s must be initialized by an explicit loop, since zero bytes intersperse the one bytes). This results in optimizations being unable to infer that loads from the memory will always return the same value, whereas with the memset optimizations can and do infer the exact contents of the written memory.

Example: Compiler Explorer
In this example, the function memset_test is directly optimized to return 0, but no_memset_test is unable to be optimized in the same way since we are writing 1s instead of 0s.

Memcpy/Memmove

Memcpy/memmove is a good bulk data copying function, but it cannot be used when bulk copying non-integral pointers, as it is legal to convert a memcpy/memmove into load-store pairs of integers. This can be used by the optimizer to type-pun a non-integral pointer into an integer, which we don’t want to happen. However, explicitly writing out all of the loads and stores to account for non-integral pointers results in a large amount of IR, which slows down the optimizer and has the same missed load-store optimization as the memset case.

Proposal

Memfill intrinsic

This would be equivalent to memset, except instead of only accepting byte values it would accept integers/pointers of any size, and the length parameter would be in number of values to write to memory. This would allow representation of the concept of set this memory to follow this repeating pattern, and allows easy analysis of both initialization code and memset-like operations. Lowering this intrinsic could be done either by lowering directly to memset, or by inserting an explicit loop late in the optimization pipeline to account for the non-byte cases.

Memtransfer intrinsic

This would be equivalent to memcpy/memmove, except it is not legal for the optimizer to convert the intrinsic into a load-store pair of integers. This allows the memtransfer intrinsic to be used on data containing non-integral pointers while not inserting large numbers of IR instructions, and allows optimizations to forward loads on the destination to stores on the source object. Since non-integral pointers are frontend/backend specific, introducing and lowering this intrinsic should dependent on the frontend and backend as well. This unspecified lowering enables use of the intrinsic for copying structs with alternating pointer and byte data without inserting an instruction for every different region. Possible implementations could once again lower directly to memcpy or to some more specialized copy routine. To enable the frontend to communicate information to the backend for lowering this intrinsic, callsite attributes would be guaranteed undroppable for this intrinsic.

Alternatives

I’ve considered a few other approaches, but I don’t think they’re worth considering further for various reasons.

  • For memfill

    • Introduce a llvm.assume-like intrinsic that represents an assumption about the contents of a region of memory. I think this is susceptible to miscompiles, since the original stores/loads can move or be altered by the optimizer in ways that make the assumption invalid, and maintaining its correctness would require checking downstream uses of the pointer for such assumptions and updating them appropriately. Since memfill replaces the loads and stores, moving the memfill would also move the assumption about memory contents without the need for a separate update.
    • Use ScalarEvolution in more places to be able to analyze loops directly for fill-like behavior. SCEV brings fairly large compile-time impacts wherever it is used, and I don’t think the optimizations here are worth the compile-time regressions with SCEV.
    • Ignore this optimization, since it might be relatively rare for real programs. I think this would be unfortunate, since this can affect the precision of alias analysis and other optimizations. In the example above, the store-load forwarding allowed elimination of a malloc and 100 useless stores, which can have real runtime impacts.
  • For memtransfer

    • Byte type proposal, which lists the small memcpy optimization as one of its key motivations. The byte type proposal is a much larger change than one new intrinsic, and likely has much larger downstream impact compared to the proposed change.
    • Removing the small memcpy optimization. This could take the form of reimplementing the optimization in all of the backends where the semantics of the memcpy would be known, but otherwise pessimizes perfectly fine IR just to account for this edge case.
    • Restricting the proposed memtransfer intrinsic to operate on a single type. This makes designing the lowering of the intrinsic simpler, but at the cost of not being able to use the intrinsic to copy structured data around as a single unit.
    • Not implementing this intrinsic, and requiring frontends to emit the explicit stores and loads of non-integral pointers. This makes analysis of where pointers are located easier, but can have a compile time impact in that the frontend needs to look up where the pointers are located in a struct definition when emitting struct copy operations. In addition, if structs are copied multiple times, this can result in large bloating of the IR.
    • An equivalent custom intrinsic emitted by a frontend, and that frontend adds its own pass to eliminate the custom intrinsic.This forgoes the load-store forwarding optimizations, and can result in the same sort of optimization misses as not implementing the memfill intrinsic.

I appreciate any thoughts or comments anyone has to offer on this proposal, especially people who work with the relevant optimizations (InstCombine, SROA, escape analysis that would use the new intrinsics, etc).

I don’t think gluing the two proposals together is really helpful for either one. They present very different challenges.

For memfill, when selectiondag sees a “memfill” intrinsic, how do you expect it to lower it? Apple targets have memset_pattern4/8/16, but other targets don’t have any equivalent functionality. And if you try to just expand it to a simple loop (presumably in an IR pass just before SelectionDAG), the performance of that loop is probably not great; you need to vectorize/unroll/specialize on alignment/etc. to get decent performance. But if you try to do that inline, you start having codesize issues, so you probably need some number of out-of-line helper functions. None of this is really unsolvable, but your proposal doesn’t really describe how you plan to deal with it.

That said, I think something along these lines is clearly worth implementing.


For memtransfer, adding an intrinsic with nearly identical semantics to memcpy/memmove doesn’t really solve the fundamental issue, which is “we want to keep values in registers for performance”. Adding an alternative which forbids putting the values in registers dodges the correctness issue, but it doesn’t really serve as a way forward for generating efficient code.

Maybe instead of something less specialized, you actually want something more specialized. If the frontend knows which byte ranges correspond to pointers, it can pass down the information in a metadata argument, which would allow expanding the intrinsic in IR without breaking the semantics. (I guess this doesn’t work so well if you need to handle constructs like unions, where it’s ambiguous what kind of pointer something is, but the simple cases are a lot more common.)

1 Like

On second look, I agree that these proposals are kind of unrelated, so it’s probably better to consider them in isolation.

For memfill, the alternative encoding is just a simple loop with a gep+store. I think SelectionDAG should just lower it directly to the equivalent simple loop in MIR. Unrolling and vectorization can probably be handled by the respective optimization passes directly, which would be able to make the relevant decisions about whether the relevant transform is cost effective for the given optimization level and given arguments. As for alignment, I think one approach would be to require that the pointer alignment be at least the byte size of the integer type being stored, which makes the fallback loop simple to produce. If we want to handle arbitrary alignment in that loop, then we could maybe split the stores into alignment-sized chunks? I don’t think it’s worth complicating selection too much for this single intrinsic.


For memtransfer, I agree that maybe what I’m looking for is the ability to mark ranges as possibly containing pointers, and that therefore those ranges shouldn’t be eligible for memcpy optimization. I think non-debug metadata must be droppable, but maybe we could put the relevant information in an attribute? I don’t know if there’s good support for having an array of ranges or arrays in general as an attribute value though.

I like the idea of a memfill intrinsic.

Since recently, PreISelIntrinsicLowering will lower large/unknown-size memcpy/memset to loops for targets that don’t support these libcalls. It would be straightforward to extend that to memfill for non-Darwin targets.

Regarding code size, I think the comparison point here is “the frontend directly emits the loop”, which is what would happen right now. In that case a loop legalization isn’t really worse than the current situation. Just need to be careful that intrinsic is costed appropriately.


I’m not a fan of the memtransfer proposal. Especially when I see this in conjunction with [RFC] Correct implementation of memcpy with metadata, which is a proposal in the same area, but the reverse goal, it seems like we should really be focusing on adding the byte type.

3 Likes

I am left wondering what you expect to happen if/when someone wants to memfill a 5-byte or 7-byte pattern to an array of significant size (100-1000 bytes) and what do you do with the excess of the last loop ??

That is:: If you give people the abstraction of a byte length, sooner or later the specified length will not be a power of 2.

As a future direction offer Clang custom intrinsics for complex lowerings and let the middle-end handle it, i.e., loops …

I think the best way to handle these cases is to just lower it to the equivalent loop, without any extra performance optimizations on top. If the target supports 5/7, then the equivalent loop is simple enough to write; otherwise, the memfill intrinsic would not be valid. Optimization passes would be free to transform the intrinsic into an explicit loop or different form at any time if it’s judged to be beneficial.

I generally like the idea of providing more information about intent to the compiler, but I worry that adding more and more custom intrinsics will cause large amounts of special casing for those intrinsics throughout optimization passes. This special casing already exists for memset, which is why I think generalizing to memfill shouldn’t require too large of a change to see benefits, but adding a new if/case per new custom intrinsic seems like a lot of code to me.

Sure. It is a trade-off.

Even if the type is naturally aligned, you need to worry about alignment to get efficient lowering: for example, if you’re filling a 2-byte value, an optimized lowering is going to use much larger stores, and there’s a performance penalty on many targets if those stores are unaligned. (You can solve this using a prologue loop, but that bloats the code even more, which makes out-of-line helpers even more important.)

In terms of semantics, I expect alignment for memfill to work the same way as it does for memset: the pointer parameter has an “align” attribute, and the expansion is allowed to assume that alignment.

Well, if you’re talking about clang, it will almost never generate a loop at the moment; the competing lowerings are either straight-line code or a memcpy from a global variable. But sure, the initial implementation doesn’t need to be perfect to get some benefit.

I had rustc in mind here, which supports constructs like [v; n] (initializing an n element array to v), which gets lowered to an initialization loop. LLVM turns that into either an unrolled initialization or a vectorized loop.

Ah I see what you mean here. I agree that out-of-line helper functions would be nice, and we should lower to them whenever they are available, but in the absence of them we don’t have much choice beyond lowering the intrinsic to a loop. I think then it’s probably good to differentiate between lowering for performance vs lowering for correctness.

If a memfill survives until the PreISelInstructionLoweringPass, then we should probably just forget about performance and lower it to the basic equivalent loop. Requiring at least natural alignment helps here in simplifying the basic loop (fwiw, memset also technically requires natural alignment, though it’s equivalent to no alignment since it operates on bytes).

Since we do care about performance, I think relying on other optimization passes to perform the prologue/epilogue expansion is a valid way to salvage performance. Specifically, I’m imagining LoopVectorize looking at the TLI, determining that we’re going to expand memfill to a loop anyways, and then turning the call into a proper vectorized loop with prologue/epilogue as necessary. When we’re optimizing for size, LV probably isn’t included anyways, so we won’t get the extra inline code bloat as when we just want fast code. I’m not familiar with LV/cost modeling for vectorization, but I’m guessing that it already takes into account underaligned pointers and emits the necessary instructions to realign the pointer (I’m also choosing to link performance of non-power-of-two initialization to LV’s capability of vectorizing such loops).

There would probably be other passes that would like to modify the intrinsic along the way (InstCombine/LoopUnroll probably). They can also do things like replace the memfill with stores, or overalign the pointer, etc. There’s probably a tradeoff there in that we lose some information when we adjust the memfill like that, but that’s probably a discussion point for later.

I’m having trouble understanding how this is different, because today LLVM doesn’t optimize memcpys to integer load-store pairs in IR: https://llvm.godbolt.org/z/WM16M1E9K. (Not even for memcpy.inline either.)

It’s something that I’ve been trying to work around in Rust, in fact, because that optimization not happening sometimes means poor codegen, as while the backend does lower it to reasonable instructions for each memcpy, it’s apparently too late for it to remove the alloca, and thus there’s completely unnecessary copying to stack, as can be seen in the same example above. (Relatedly, `bcmp`/`memcmp` removal optimization should remove unneeded `alloca`s · Issue #52701 · llvm/llvm-project · GitHub )

So, personally, I’d much prefer to see some kind of replacement for memcpy that can use SSA virtual registers, to enable more general optimizations for patterns like this, not one that makes it even more restricted. Thus at least for what I’d like to emit in Rust, the alternative of that 24-byte copy being a load/store of a b192 seems more appealing than memtransfer.

In our downstream pipeline we do late expansion of element atomic memcpy at the IR level. We lower it to a naive scalar loop right before the loop vectorizer. This way we have an optimization-friendly representation of memcpy as as intrinsic early in the pipeline, but rely on the existing loop vectorizer to emit the best vector code for it.

We have plans to do the same for mem fill operation. Having a memfill intrinsic upstream would be very helpful. Although, we would need to have an element-wise atomic variant for it.

@pchintalapudi do you have any plans to move forward with this proposal? I’m specifically interested in the memfill part of it.