There is an LLVM IR intrinsic for compressstore
(see discussion for its introduction here). However, it doesn’t translate well directly to anything but Intel x86. The main issue being that the logic is very specific, as it only writes the bytes that have a 1 set in the mask and not the full vector, i.e., the intrinsic it is not equivalent to a compress
+ store
combination. On most targets, this intrinsic generates a lot of branches, which kill the performance of vectorized code. Also, this table shows that the vpcompress
x86 instructions are very slow on AMD (up to 100 cycles!). So it doesn’t even always make sense to generate this intrinsic’s exact instruction on AMD x86.
There are svcompact_*
instructions for SVE that operate only in vector registers, RISC-V has vector compress within registers, and AMD’s vpcompress
without storing to memory is a lot more efficient. For 16-byte vectors on x86 pre AVX-512 or ARM NEON, it may even be conceivable to generate a small lookup table from mask to vector index to perform a shuffle, given the number of elements is low (e.g., <= 4 or 8). All of this is only possible if we remove the constraint of only writing n
values for n
1s in the mask.
Given these conditions, I think it might make sense to have a masked compress
(without store
) LLVM IR intrinsic, which can be supported efficiently on a lot more targets. For many workloads, I assume that logically performing a compress + storing the full vector even if it has junk in it is fine, as that will just be overwritten in the next iteration.
As a compress
intrinsic is applicable to a lot more targets than just x86, I think it would also make sense to expose this in Clang, so C/C++ vector intrinsics can leverage this logic explicitly, as I guess it is not trivial to detect this pattern from general C/C++ code.
What are your thoughts on this?