New intrinsic for masked vector compress without store

There is an LLVM IR intrinsic for compressstore (see discussion for its introduction here). However, it doesn’t translate well directly to anything but Intel x86. The main issue being that the logic is very specific, as it only writes the bytes that have a 1 set in the mask and not the full vector, i.e., the intrinsic it is not equivalent to a compress + store combination. On most targets, this intrinsic generates a lot of branches, which kill the performance of vectorized code. Also, this table shows that the vpcompress x86 instructions are very slow on AMD (up to 100 cycles!). So it doesn’t even always make sense to generate this intrinsic’s exact instruction on AMD x86.

There are svcompact_* instructions for SVE that operate only in vector registers, RISC-V has vector compress within registers, and AMD’s vpcompress without storing to memory is a lot more efficient. For 16-byte vectors on x86 pre AVX-512 or ARM NEON, it may even be conceivable to generate a small lookup table from mask to vector index to perform a shuffle, given the number of elements is low (e.g., <= 4 or 8). All of this is only possible if we remove the constraint of only writing n values for n 1s in the mask.

Given these conditions, I think it might make sense to have a masked compress (without store) LLVM IR intrinsic, which can be supported efficiently on a lot more targets. For many workloads, I assume that logically performing a compress + storing the full vector even if it has junk in it is fine, as that will just be overwritten in the next iteration.

As a compress intrinsic is applicable to a lot more targets than just x86, I think it would also make sense to expose this in Clang, so C/C++ vector intrinsics can leverage this logic explicitly, as I guess it is not trivial to detect this pattern from general C/C++ code.

What are your thoughts on this?

1 Like

The notion of a compress operation as an intrinsic seems reasonable to me.

Can you spell out what the alternate representation in IR looks like today? With some quick thought, I didn’t come up with anything clean, but I haven’t thought about it much. I’m asking for mostly for completeness sake as knowing the alternatives is a big part of evaluating a proposal.

I don’t really have a nice way to do this in mind right now either. So far, I’ve been doing this like I would be when writing to memory, which is something along the lines of the code below. I’m basically just checking if the current mask value is true and advancing the output “pointer” by one if it is. I’m sure there are better ways to do this, and when generating this from the equivalent C++ code, the IR is slightly different but with the same intention.

Note: untested code, godbolt

define <4 x i32> @compress(<4 x i32> %vec, <4 x i1> %mask) {
entry:
  ; Extract all vector elements 
  %vec0 = extractelement <4 x i32> %vec, i64 0
  %vec1 = extractelement <4 x i32> %vec, i64 1
  %vec2 = extractelement <4 x i32> %vec, i64 2
  %vec3 = extractelement <4 x i32> %vec, i64 3

  ; Extract all bools from mask
  %mask0 = extractelement <4 x i1> %mask, i64 0
  %mask1 = extractelement <4 x i1> %mask, i64 1
  %mask2 = extractelement <4 x i1> %mask, i64 2
  %mask3 = extractelement <4 x i1> %mask, i64 3

  ; Convert mask values to output positions
  %conv0 = zext i1 %mask0 to i32
  %conv1 = zext i1 %mask1 to i32
  %conv2 = zext i1 %mask2 to i32

  ; Compute output positions
  %pos1 = add i32 0, %conv0
  %pos2 = add i32 %pos1, %conv1
  %pos3 = add i32 %pos2, %conv2

  ; Insert into output vector
  %out = shufflevector <4 x i32> undef, <4 x i32> undef, <4 x i32> zeroinitializer
  %out0 = insertelement <4 x i32> %out , i32 %vec0, i32 0
  %out1 = insertelement <4 x i32> %out0, i32 %vec1, i32 %pos1
  %out2 = insertelement <4 x i32> %out1, i32 %vec2, i32 %pos2
  %out3 = insertelement <4 x i32> %out2, i32 %vec3, i32 %pos3

  ret <4 x i32> %out3
}

There are probably a lot better ways to do this but I don’t think they are a lot different conceptually. Maybe some use ifs to add elements. But I think in the end, we always need to check how many values have been written before and where to write the next value to.

As there are many slightly different ways of doing this, I think it might be rather hard to detect this in a pass for all cases.

I also believe the code above could be a simple fallback implementation for ISAs that don’t have a compress-equivalent instruction (e.g., ARM NEON).

What would be the output of the intrinsic for the elements past the compressed range? I suspect there isn’t a consistent behavior across targets.

RISC-V can preserve the value from what was in the register before, but that would require an operand to the intrinsic to provide the source to preserve from. There’s no option to force them to 0 or other known value without another instruction to force the output register to known value first.

X86 can set the elements to 0 or preserve them, but again preserving would require a second input.

I don’t know about SVE.

SVE “compact” zeros the remaining elements.

So it looks like we have instructions that fill with 0, some that preserve the input, and some that copy from a second operand.

Does this behavior need to be specified exactly for an LLVM intrinsics? I’m not sure which level of semantics we want/need to provide here, but would it be possible to just say that all remaining lanes contain unspecified values? shufflevector also allows for undefined lanes, to allow for certain optimizations. My thought here was to have this as a significantly more performant version of compressstore because that one has such strong semantics.

X86 can set the elements to 0 or preserve them, but again preserving would require a second input.

We could just pass the input vector twice. Not sure if that would make sense though, because that is really “garbage” data then. But that sounds a bit like what RISC-V does.

If we do need to specify this, there are two thoughts that I have right now:

  • Filling with zero is probably the most “obvious” choice to users of the intrinsic, because it is very easy to reason about the output.
  • What is the cost of “filling with zeros” vs. “keep source input” for ISAs that don’t offer both options? For x86, I guess it does not make a lot of difference, because we can just choose the corresponding instruction. For SVE to keep the source, we would need to do something like a vector select (vsel_*) between the compacted and the input, which is just one additional instruction. For RISC-V, following their example, we would need to create a zeroed v2 to write to, which then keeps all zeros. So this is also probably just one instruction. For my simple fallback implementation above, keeping the data of the source is slightly more efficient, as we always write the first element, which we would need to zero out again if the mask is 0. But there are probably also different ways to implement this for different semantics. So it seems like we can do this rather efficiently in both directions, if we need to fully specify this.

Given the direction VP intrinsics have taken I think the most natural way would be to set the remaining lanes to poison. If we want to represent a passthru value then we could use llvm.vp.merge with the evl set to the sum of the mask:

%x = call <4 x i32> @compress(<4 x i32> %v, <4 x i1> %mask)
%mask.zext = zext <4 x i1> %mask to <4 x i32>
%evl = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %mask.zext)
%y = call <4 x i32> @llvm.vp.merge(<4 x i1> splat(i1 true), <4 x i32> %x, <4 x i32> %v, i32 %evl)

Then either a codegen pass like InterleavedAccessPass would detect this and provide hooks for lowering it into a target specific intrinsic, or the target backend could fold the passthru into the compress. The latter is what the RISC-V backend currently does for VP intrinsics where we want to preserve the masked off values, e.g.

define <vscale x 2 x i32> @vpmerge_vpadd(<vscale x 2 x i32> %passthru, <vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> %m, i32 zeroext %vl) {
  %a = call <vscale x 2 x i32> @llvm.vp.add.nxv2i32(<vscale x 2 x i32> %x, <vscale x 2 x i32> %y, <vscale x 2 x i1> splat (i1 -1), i32 %vl)
  %b = call <vscale x 2 x i32> @llvm.vp.merge.nxv2i32(<vscale x 2 x i1> %m, <vscale x 2 x i32> %a, <vscale x 2 x i32> %passthru, i32 %vl)
  ret <vscale x 2 x i32> %b
}

Gets lowered to

vsetvli zero, a0, e32, m1, tu, mu
vadd.vv v8, v9, v10, v0.t        

I had been assuming the poison tail semantics, and consider either that or an explicit passthru completely reasonable. I think both could be made to work, and that we have precedent for both designs (e.g. the vp example Luke gave and masked.load respectively.)

I would lean against from requiring a zero tail, but don’t think this choice greatly matters, so that’s a weak preference.

As no further issues were raised in the discussion, I’ll start implementing a first version of the intrinsic and see if we forgot something.