I want to suggest the addition of a memfill and a memtransfer intrinsic that are similar to, but not exactly equivalent to, memset and memcpy/memmove respectively. These intrinsics would improve load-store forwarding optimizations for different scenarios.
Motivation
Memset and memcpy/memmove are treated specially by several passes, usually for the purpose of load-store forwarding. However, there are situations where these functions cannot be used, but the frontend can provide additional information about the exact operation that is happening.
Memset
Memset sets all bytes in a memory region to the same value, which allows it to be used for initializing arrays with 0 or -1. However, for arrays of integers larger than one byte, memset is usually inapplicable (e.g. a uint64 array of 1s must be initialized by an explicit loop, since zero bytes intersperse the one bytes). This results in optimizations being unable to infer that loads from the memory will always return the same value, whereas with the memset optimizations can and do infer the exact contents of the written memory.
Example: Compiler Explorer
In this example, the function memset_test
is directly optimized to return 0, but no_memset_test
is unable to be optimized in the same way since we are writing 1s instead of 0s.
Memcpy/Memmove
Memcpy/memmove is a good bulk data copying function, but it cannot be used when bulk copying non-integral pointers, as it is legal to convert a memcpy/memmove into load-store pairs of integers. This can be used by the optimizer to type-pun a non-integral pointer into an integer, which we don’t want to happen. However, explicitly writing out all of the loads and stores to account for non-integral pointers results in a large amount of IR, which slows down the optimizer and has the same missed load-store optimization as the memset case.
Proposal
Memfill intrinsic
This would be equivalent to memset, except instead of only accepting byte values it would accept integers/pointers of any size, and the length parameter would be in number of values to write to memory. This would allow representation of the concept of set this memory to follow this repeating pattern
, and allows easy analysis of both initialization code and memset-like operations. Lowering this intrinsic could be done either by lowering directly to memset, or by inserting an explicit loop late in the optimization pipeline to account for the non-byte cases.
Memtransfer intrinsic
This would be equivalent to memcpy/memmove, except it is not legal for the optimizer to convert the intrinsic into a load-store pair of integers. This allows the memtransfer intrinsic to be used on data containing non-integral pointers while not inserting large numbers of IR instructions, and allows optimizations to forward loads on the destination to stores on the source object. Since non-integral pointers are frontend/backend specific, introducing and lowering this intrinsic should dependent on the frontend and backend as well. This unspecified lowering enables use of the intrinsic for copying structs with alternating pointer and byte data without inserting an instruction for every different region. Possible implementations could once again lower directly to memcpy or to some more specialized copy routine. To enable the frontend to communicate information to the backend for lowering this intrinsic, callsite attributes would be guaranteed undroppable for this intrinsic.
Alternatives
I’ve considered a few other approaches, but I don’t think they’re worth considering further for various reasons.
-
For memfill
- Introduce a llvm.assume-like intrinsic that represents an assumption about the contents of a region of memory. I think this is susceptible to miscompiles, since the original stores/loads can move or be altered by the optimizer in ways that make the assumption invalid, and maintaining its correctness would require checking downstream uses of the pointer for such assumptions and updating them appropriately. Since memfill replaces the loads and stores, moving the memfill would also move the assumption about memory contents without the need for a separate update.
- Use ScalarEvolution in more places to be able to analyze loops directly for fill-like behavior. SCEV brings fairly large compile-time impacts wherever it is used, and I don’t think the optimizations here are worth the compile-time regressions with SCEV.
- Ignore this optimization, since it might be relatively rare for real programs. I think this would be unfortunate, since this can affect the precision of alias analysis and other optimizations. In the example above, the store-load forwarding allowed elimination of a malloc and 100 useless stores, which can have real runtime impacts.
-
For memtransfer
- Byte type proposal, which lists the small memcpy optimization as one of its key motivations. The byte type proposal is a much larger change than one new intrinsic, and likely has much larger downstream impact compared to the proposed change.
- Removing the small memcpy optimization. This could take the form of reimplementing the optimization in all of the backends where the semantics of the memcpy would be known, but otherwise pessimizes perfectly fine IR just to account for this edge case.
- Restricting the proposed memtransfer intrinsic to operate on a single type. This makes designing the lowering of the intrinsic simpler, but at the cost of not being able to use the intrinsic to copy structured data around as a single unit.
- Not implementing this intrinsic, and requiring frontends to emit the explicit stores and loads of non-integral pointers. This makes analysis of where pointers are located easier, but can have a compile time impact in that the frontend needs to look up where the pointers are located in a struct definition when emitting struct copy operations. In addition, if structs are copied multiple times, this can result in large bloating of the IR.
- An equivalent custom intrinsic emitted by a frontend, and that frontend adds its own pass to eliminate the custom intrinsic.This forgoes the load-store forwarding optimizations, and can result in the same sort of optimization misses as not implementing the memfill intrinsic.
I appreciate any thoughts or comments anyone has to offer on this proposal, especially people who work with the relevant optimizations (InstCombine, SROA, escape analysis that would use the new intrinsics, etc).