[RFC] "SIMD within a register" (SWAR) loop vectorization

Motivation

Let’s consider the following example - dualcast function (it’s like memcpy, but with 2 copy destinations):

There are number of problems with LLVM codegen here in comparison with GCC. One of them was fixed recently (https://github.com/llvm/llvm-project/pull/68322), but the most significant one is a missing loop vectorization because the target doesn’t have vector registers (in this concrete example RISC-V rv64gc was used, but the point stands for any target without vector regs). However, because all potentially-widened operations (load and 2 stores) have 8-bit width, this loop can be vectorized by VF=8 only with use of 64-bit scalar registers. This technique is known as SWAR (SIMD within a register): SWAR - Wikipedia.

After further investigation I’ve discovered that GCC is able to handle more interesting cases, like bitwise operators and even add/sub: Compiler Explorer (addition of 8 8-bit integers is done with 6 64-bit arithmetic instructions). The idea of processing packed integer addition/subtraction is to prevent carries from propagating between bytes. To do this, MSB bit and all other ones are handled separately:

H = 0x8080808080808080 // mask with elements' MSB set to one
SWAR add z = x + y
    z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)
SWAR sub z = x - y
    z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)

(References:
TAOCP Volume 4A Part 1, Chapter 7.1.3, “Tweaking several bytes at once”
https://www.chessprogramming.org/SIMD_and_SWAR_Techniques)

Finally, GCC also supports integer reductions in SWAR vectorization: Compiler Explorer. So, we can conclude that for targets without vector registers LLVM has an optimization gap with SWAR vectorization (in comparison with GCC).

Proposal

I’ve done a draft implementation of SWAR support in LoopVectorizer: [LV] Implement SWAR loop vectorization by skachkov-sc · Pull Request #69306 · llvm/llvm-project · GitHub. This extends VPlan recipes with 2 new ones: VPSWARRecipe (analogue of VPWidenRecipe) and VPSWARMemoryInstructionRecipe (analogue of VPWidenMemoryInstructionRecipe).

VPSWARMemoryInstructionRecipe has the following restrictions:

  1. Masking is not supported (basically it means that loop body should have only one basic block)
  2. Memory access should be consecutive and in direct order (corresponds to InstWidening::CM_Widen decision)

VPSWARRecipe supports next binary operations:

  1. Bitwise operators (and, or, xor)
  2. Addition/Subtraction (add, sub)
  3. Shifts (shl, lshr) with constant shift amount (2nd operand) - this case isn’t supported in GCC, but it seems very easy to implement (just do ordinary shl/lshr and then apply mask to zero propagated bits from neighbour elements)

To minimize the amount of changes in VPTransfromState, I’ve decided to leave ingredients of SWAR recipes as vector types, bitcast them to scalar type, do operation with scalars and bitcast the result back. The advantage of this approach is an essentially free support of reductions (the same llvm.vp.reduce.* intrinsics can be used without any change); however, we rely on InstCombine to cleanup redundant bitcast pairs. Another encountered problem is the alignment of SWAR-ed load/store instructions: usually the preferred alignment (DataLayout::getPrefTypeAlign) for wider integer types is higher, so additional run-time checks are needed (analysis of GCC examples also shows these kinds of checks); this is done in generateAlignChecks() method. Also, the separate cost method (getSWARInstructionCost) was introduced because some operations (e.g. add/sub) need more complex cost calculations.

Final words

This optimization was tested on some internal workloads and, as expected, showed big improvements on tasks with such kinds of loops: for example, the already mentioned dualcast() method has 9.1x performance improvement (on RISC-V rv64gc machine). The results on SPECInt2006 are mostly unchanged, but there is a -0.64 % instruction count reduction in 401.bzip2. Another discussion topic is the impact of LoopVectorizer on compile-time, because currently it is skipped for targets without vector registers; the benefit of SWAR vectorization vs compile-time increase is a topic of further investigation.

I would be glad to hear any thoughts or suggestions about the proposal and PR. Feedback would be very much appreciated!

1 Like

I’m interested in seeing where this goes.

Even without vectorization, we’ve not been good at canonicalizing to SWAR-like patterns for irregular bitfield arithmetic - not sure if your work has much overlap (afaict you’re only looking at regular loop vectorization and packing the same uniform type), or will hit similar issues as this: