[RFC] "SIMD within a register" (SWAR) loop vectorization

skachkov-sc · October 17, 2023, 9:15am

Motivation

Let’s consider the following example - dualcast function (it’s like memcpy, but with 2 copy destinations):

There are number of problems with LLVM codegen here in comparison with GCC. One of them was fixed recently (https://github.com/llvm/llvm-project/pull/68322), but the most significant one is a missing loop vectorization because the target doesn’t have vector registers (in this concrete example RISC-V rv64gc was used, but the point stands for any target without vector regs). However, because all potentially-widened operations (load and 2 stores) have 8-bit width, this loop can be vectorized by VF=8 only with use of 64-bit scalar registers. This technique is known as SWAR (SIMD within a register): SWAR - Wikipedia.

After further investigation I’ve discovered that GCC is able to handle more interesting cases, like bitwise operators and even add/sub: Compiler Explorer (addition of 8 8-bit integers is done with 6 64-bit arithmetic instructions). The idea of processing packed integer addition/subtraction is to prevent carries from propagating between bytes. To do this, MSB bit and all other ones are handled separately:

H = 0x8080808080808080 // mask with elements' MSB set to one
SWAR add z = x + y
    z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)
SWAR sub z = x - y
    z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)

(References:
TAOCP Volume 4A Part 1, Chapter 7.1.3, “Tweaking several bytes at once”
https://www.chessprogramming.org/SIMD_and_SWAR_Techniques)

Finally, GCC also supports integer reductions in SWAR vectorization: Compiler Explorer. So, we can conclude that for targets without vector registers LLVM has an optimization gap with SWAR vectorization (in comparison with GCC).

Proposal

I’ve done a draft implementation of SWAR support in LoopVectorizer: [LV] Implement SWAR loop vectorization by skachkov-sc · Pull Request #69306 · llvm/llvm-project · GitHub. This extends VPlan recipes with 2 new ones: VPSWARRecipe (analogue of VPWidenRecipe) and VPSWARMemoryInstructionRecipe (analogue of VPWidenMemoryInstructionRecipe).

VPSWARMemoryInstructionRecipe has the following restrictions:

Masking is not supported (basically it means that loop body should have only one basic block)
Memory access should be consecutive and in direct order (corresponds to InstWidening::CM_Widen decision)

VPSWARRecipe supports next binary operations:

Bitwise operators (and, or, xor)
Addition/Subtraction (add, sub)
Shifts (shl, lshr) with constant shift amount (2nd operand) - this case isn’t supported in GCC, but it seems very easy to implement (just do ordinary shl/lshr and then apply mask to zero propagated bits from neighbour elements)

To minimize the amount of changes in VPTransfromState, I’ve decided to leave ingredients of SWAR recipes as vector types, bitcast them to scalar type, do operation with scalars and bitcast the result back. The advantage of this approach is an essentially free support of reductions (the same llvm.vp.reduce.* intrinsics can be used without any change); however, we rely on InstCombine to cleanup redundant bitcast pairs. Another encountered problem is the alignment of SWAR-ed load/store instructions: usually the preferred alignment (DataLayout::getPrefTypeAlign) for wider integer types is higher, so additional run-time checks are needed (analysis of GCC examples also shows these kinds of checks); this is done in generateAlignChecks() method. Also, the separate cost method (getSWARInstructionCost) was introduced because some operations (e.g. add/sub) need more complex cost calculations.

Final words

This optimization was tested on some internal workloads and, as expected, showed big improvements on tasks with such kinds of loops: for example, the already mentioned dualcast() method has 9.1x performance improvement (on RISC-V rv64gc machine). The results on SPECInt2006 are mostly unchanged, but there is a -0.64 % instruction count reduction in 401.bzip2. Another discussion topic is the impact of LoopVectorizer on compile-time, because currently it is skipped for targets without vector registers; the benefit of SWAR vectorization vs compile-time increase is a topic of further investigation.

I would be glad to hear any thoughts or suggestions about the proposal and PR. Feedback would be very much appreciated!

RKSimon · October 21, 2023, 11:16am

I’m interested in seeing where this goes.

Even without vectorization, we’ve not been good at canonicalizing to SWAR-like patterns for irregular bitfield arithmetic - not sure if your work has much overlap (afaict you’re only looking at regular loop vectorization and packing the same uniform type), or will hit similar issues as this:

github.com/llvm/llvm-project

Improve bitfield arithmetic

opened 11:38AM - 08 Sep 17 UTC

RKSimon

good first issue bugzilla llvm:codegen llvm:instcombine

| | | | --- | --- | | Bugzilla Link | [34526](https://llvm.org/bz34526) | |… Version | trunk | | OS | Windows NT | | CC | @alexey-bataev,@legrosbuffle,@rotateright | ## Extended Description We could improve math ops on irregular bitfields with the relevant SWAR style patterns: ```c struct Counters { unsigned a:7; unsigned b:16; unsigned c:9; }; Counters IncCounters(Counters x) { x.a++; x.b++; x.c++; return x; } Counters AddCounters(Counters x, Counters y) { Counters c; c.a = x.a + y.a; c.b = x.b + y.b; c.c = x.c + y.c; return c; } ``` ```ll define i32 @IncCounters(i32) { %2 = add i32 %0, 1 %3 = and i32 %2, 127 %4 = add i32 %0, 128 %5 = and i32 %4, 8388480 %6 = or i32 %5, %3 %7 = add i32 %0, 8388608 %8 = and i32 %7, -8388608 %9 = or i32 %6, %8 ret i32 %9 } define i32 @AddCounters(i32, i32) { %3 = add i32 %1, %0 %4 = and i32 %3, 127 %5 = and i32 %0, 8388480 %6 = add i32 %5, %1 %7 = and i32 %6, 8388480 %8 = or i32 %4, %7 %9 = and i32 %0, -8388608 %10 = add i32 %9, %1 %11 = and i32 %10, -8388608 %12 = or i32 %8, %11 ret i32 %12 } ``` llc -mcpu=btver2 ```asm IncCounters(Counters): # @IncCounters(Counters) movl %edi, %ecx leal 1(%rdi), %eax addl $8388608, %edi # imm = 0x800000 subl $-128, %ecx andl $127, %eax andl $-8388608, %edi # imm = 0xFF800000 andl $8388480, %ecx # imm = 0x7FFF80 orl %ecx, %eax orl %edi, %eax retq AddCounters(Counters, Counters): # @AddCounters(Counters, Counters) movl %edi, %ecx leal (%rsi,%rdi), %eax andl $-8388608, %edi # imm = 0xFF800000 andl $8388480, %ecx # imm = 0x7FFF80 addl %esi, %edi andl $127, %eax addl %esi, %ecx andl $-8388608, %edi # imm = 0xFF800000 andl $8388480, %ecx # imm = 0x7FFF80 orl %ecx, %eax orl %edi, %eax retq ```

Topic		Replies	Views
Loop vectorization with the loop containing bitcast LLVM Dev List Archives	3	76	August 17, 2016
What is the strategy of llvm select simd register and how to adjust it? IR & Optimizations clang , llvm	2	168	January 19, 2024
[Vectorization] Mis match in code generated LLVM Dev List Archives	9	103	November 11, 2014
Simple Loop Vectorize Question LLVM Dev List Archives	4	78	May 10, 2013
Understanding Loop Vectorized IR LLVM Dev List Archives	1	90	September 16, 2019

[RFC] "SIMD within a register" (SWAR) loop vectorization

Motivation

Proposal

Final words

Related Topics