RFC: Introducing an llvm.memset_pattern.inline intrinsic

The problem

We have llvm.memset and llvm.memset.inline intrinsics, yet they can only be used for setting memory to a byte-sized value. More complex initialisation patterns can’t be represented in this way.

As an example, consider the initialisation of array values to 0x0000ffff:

void foo(int *in) {
  for (int i = 0; i < 100; i++) {
    in[i] = 0xffff;
  }
}

If full or partial loop unrolling triggers, then stores should be combined to use a 64-bit store (with appropriately generated constant). GCC handles such cases via SWAR, though the generated code is not as efficient as it could be.

The solution

If the initialisation was to 0xffffffff (i.e. it could be represented by a memset), then the loop idiom recognition pass would have done so. As it turns out, Apple platforms actually have memset_pattern and so LoopIdiomRecognize has logic for introducing memset_pattern16 libcalls on those targets. A straightforward and non-disruptive solution is to introduce a llvm.memset_pattern.inline intrinsic that LoopIdiomRecognize can use on targets without the relevant libcalls.

This means that we can take advantage of pattern recognition logic that’s already implemented to convert such loops into a form that produces efficient code without relying on later transformations that may or may not happen.

An llvm.memset_pattern.inline is expanded to stores when lowering to the SelctionDAG using similar logic to SelectionDAG::getMemSet (and the getMemsetStores helper). This is an overloaded intrinsic, but attempts to use a non-power-of-two overload will error.

Just like llvm.memset.inline, the length argument must be constant but there is no such restriction on the value.

Input requested

I’d welcome any guidance or thoughts on this before it moves forward, in case minor modifications might enable other use cases.

An initial draft PR is available here.

4 Likes

I don’t think this should be a restriction, and the restriction should be removed from the existing memset.inline. The only reason to have it is a workaround for SelectionDAG making it difficult to introduce lowering with control flow, but we can already handle the in PreISelIntrinsicLowering

+1 to this RFC. This can also help with some optimizations in MemCpyOptimizer. Consider this case in rustc. When returning a heap vector of a constant number of elements, the pseudocode looks like

arr = alloca i32 x 1000
for i in 1000 { arr[i] = 1; }
heaparr = malloc(32 x 1000)
memcpy(heaparr, arr)
return heaparr

Since 1 does not have the same value in each byte, we cannot lower it to a memset. With this RFC, we could have rustc directly lower to llvm.memset_pattern.inline, or have LoopIdiomRecognize transform the loop. Then MemCpyOptimizer could directly memset into the heap array and avoid the stack allocation all together.

1 Like

Generally in favor of this RFC. There is some previous discussion on a memfill intrinsic that does essentially the same in [RFC] memfill and memtransfer intrinsics for generalized memset/memcpy.

I think that one of the key takeaways from that discussion is that providing just an SDAG lowering for memset_pattern.inline is not sufficient: This intrinsic should be generally usable, even if N is large enough that a straight-line expansion is non-profitable. In that case, it should expand into an initialization loop (with appropriate vectorization and unrolling) pre-isel. Otherwise this intrinsic will not be usable by frontends.

The other thing I’m wondering about is why the focus here is on llvm.memset_pattern.inline specifically, rather than llvm.memset_pattern. If we are on a platform that does happen to have builtins for this operation available, wouldn’t we want to make use of them where profitable (that is, if N is large, instead of a loop expansion?)

2 Likes

I’ve posted [Intrinsics][PreISelIntrinsicLowering] llvm.memset.inline length no longer needs to be constant by asb · Pull Request #95397 · llvm/llvm-project · GitHub as one approach to remove that restriction (optimising for minimum disruption by only expanding in PreISelLowering the previously unsupported case of a memset.inline with a non-constant length).

Thanks, I had indeed missed the memfill discussion. One difference in semantics is that like the memset_patternN functions, the length argument gives the number of bytes to write (and if a non-integral multiple of the pattern, then the remainder bytes from the pattern are written). This obviously requires a bit more code to be generated if the length is non-constant or isn’t known to be a multiple of the pattern width, but has the advantage of matching the semantics of the existing libcalls.

Would your suggestion be to handle all expansion of llvm.memset_pattern.inline in PreISelIntrinsicLowering in order to allow maximum flexibility? I’d started down the SDag expansion route as this matches memset.inline and although I agree it can be advantageous for frontends to simply emit the intrinsic and hope the compiler lowers in the optimal way, there’s also a counter-argument for low-level intrinsics to have an obvious expansion so frontends can make their own decisions. It seems inconsistent to handle this differently for memset.inline vs memset_pattern.inline, though perhaps memset.inline lowering is something that also needs updating.

In terms of why start with llvm.memset_pattern.inline - simply because that’s where the biggest and most addressable gap seems to be. It would be marginally better to have a memset_pattern intrinsic rather than the libcall, but the libcall is totally usable by the logic in LoopIdiomRecognize. And only a small number of platforms support the underlying libcall anyway, so it’s not something that would be widely used any time soon. So the motivation is just about taking an incremental approach rather than a belief non-inline llvm.memset_pattern wouldn’t ever be useful.

How do you define “remainder bytes”? The current LangRef PR doesn’t explain that there might be a remainder, and which bytes from the pattern to put where when there is a remainder. I feel there are a few semantic options here, and it would be good to define which one you intend, rather than letting backends differ in their interpretation of the remainder.

Interesting, I hadn’t noticed that difference. Providing the length in bytes seems somewhat unfortunate, because we don’t really need it, but may have to pay the price for an unnecessarily large expansion because of it.

I guess the size will usually be an explicit multiplication, so maybe we can infer that the tail is unnecessary in most cases.

My thinking here was along the lines of only having llvm.memset_pattern and not bother with llvm.memset_pattern.inline, at least for now. The intrinsic would either use the libcalls, a loop expansion or a straight-line expansion depending on what the platform provides and whether the size is known / how large it is.

The advantages I see are:

  • We have consolidated handled in LIR – we don’t need to emit either a libcall or an inline intrinsic, we can always emit the intrinsic and leave expansion to the backend.
  • Frontends can emit the intrinsic and benefit from the libcalls for large memset_pattern intrinsic calls on platforms that have them, without having to implement their own knowledge that they exist.
  • Passes don’t need to handle both libcalls and intrinsics and can handle intrinsics only. I don’t think we have a lot of places that currently handle memset_pattern libcalls – if we start making more use of this, it would be nice to not have to implement additional optimizations for them in two different ways.

Thank you everyone for your feedback, I have switched to introducing llvm.memset_pattern, removed restrictions on arguments being constant, and moved to lowering via PreISelIntrinsicLowering. This is done in a new draft PR, which I hope to mark as ready for review after a round of cleanup and testing.

I’ve opted to start with the most general lowering (loop based, not customised depending on which arguments are constant) and would suggest to follow up with separately reviewed patches that alter this for particular cases (e.g. where the libcall is available on Apple platforms).

I believe the PR is now ready for review. I’ve also posted a PR that stacks on top of it that adds initial support for producing the intrinsic in LoopIdiomRecognize (though disabled by default for now).