Modeling GPU vector registers, again (with my implementation)

It seems to me that LLVM sub-register is not for the following hardware architecture.

All instructions of a hardware are vector instructions. All registers contains
4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w.

Most instructions write more than one elements in this way:

mul r0.xyw, r1, r2
add r0.z, r3, r4
sub r5, r0, r1

Notice that the four elements of r0 are written by two different instructions.

My question is how should I model these sub-registers. If I treat each component
as a register, and do the register allocation individually, it seems very
difficult to merge the scalars operations back into one vetor operation.

// each %reg is a sub-register
// r1, r2, r3, r4 here are virtual register number

mul %reg1024, r1, r2 // x
mul %reg1025, r1, r2 // y
mul %reg1026, r1, r2 // z

add %reg1027, r3, r4 // w

sub %reg1028, %reg1024, r1
sub %reg1029, %reg1025, r1
sub %reg1030, %reg1026, r1
sub %reg1031, %reg1027, r1

So I decided to model each 4-element register as one Register in *.td file.

Here are the details.

Since all the 4 elements of a vector register occupy the same ‘alloca’,
during the conversion of shader assembly to LLVM IR, I check if a vector
register is written (to different elements) by different instructions. When
the second write happens, I generate a shufflevector to multiplex the
existing value and the new value, and store the result of shufflevector.

Input assembly language:
mul r0.xy, r1, r2
add r0.zw, r3, r4
sub r5, r0, r1

is converted to LLVM IR:

%r0 = alloca <4 x float>
%mul_1 = mul <4 x float> %r1, %r2
store <4 x float> %mul_1, <4 x float>* %r0

%add_1 = add <4 x float> %r3, %r4
; a store does not immediately happen here
%load_1 = load <4 x float>* %r0

; select the first two elements from the existing value,
; the last two elements from the newly generated value
%merge_1 = shufflevector <4 x float> %load_1,
<4 x float> %add_1,
<4 x i32> < i32 0, i32 1, i32 6, i32 7 >

; store the multiplexed value
store <4 x float> %merge_1, <4 x float>* %r0

After mem2reg:

%mul_1 = mul <4 x float> %r1, %r2
%add_1 = add <4 x float> %r3, %r4
%merge_1 = shufflevector <4 x float> %mul_1,
<4 x float> %add_1,
<4 x i32> < i32 0, i32 1, i32 6, i32 7 >

After instruction selection:

MUL %reg1024, %reg1025, %reg1026
ADD %reg1027, %reg1028, %reg1029
MERGE %reg1030, %reg1024, “xy”, %reg1027, “zw”

The ‘shufflevector’ is selected to a MERGE instruction by the default LLVM
instruction selector. The hardware doesn’t have this instruction. I have a
pre-register allocation FunctionPass to remember:

The phyicial regsiter allocated to the destination register of MERGE
(%reg1030) should replace the destination register allocated to the
destination register of MUL (%reg1024) and ADD(%reg1027).

In this way I ensure MUL and ADD write to the same physical register. This
replacement is done in the other FunctionPass after register allocation.

MUL and ADD have an ‘OptionalDefOperand’ writemask. By default the writemask is
“xyzw” (all elmenets are written).

// 0xF == all elements are written by default
def WRITEMASK : OptionalDefOperand<OtherVT, (ops i32imm), (ops (i32 0xF))>
{…}

def MUL : MyInst<(outs REG4X32:$dst),
(ins REG4X32:$src0, REG4X32:$src1, WRITEMASK:$wm),

In the said post-register-allocation FunctionPass, in addition to replace the
destination registers as described before, the writemask ($wm) of each
instruction is also replaced with the writemask operands of MERGE. So:

MUL %R0, %R1, %R2, “xyzw”
ADD %R5, %R3, %R4, “xyzw”
MERGE %R6, %R0, “xy”, %R5, “zw”

==>

MUL %R6, %R1, %R2, “xy” // “xy” comes from MERGE operand 2
ADD %R6, %R3, %R4, “zw”
// MERGE %R6, %R0, “xy”, %R5, “zw” <== REMOVED

Final machine code:

MUL r6.xy, r1, r2
ADD r6.zw, r3, r4
SUB r8, r6, r1

I don’t feel very comfortable with these two very ad-hoc FunctionPass.

Alex.

It seems to me that LLVM sub-register is not for the following hardware architecture.

All instructions of a hardware are vector instructions. All registers contains
4 32-bit FP sub-registers. They are called r0.x, r0.y, r0.z, r0.w.

Most instructions write more than one elements in this way:

mul r0.xyw, r1, r2
add r0.z, r3, r4
sub r5, r0, r1

Notice that the four elements of r0 are written by two different instructions.

My question is how should I model these sub-registers. If I treat each component
as a register, and do the register allocation individually, it seems very
difficult to merge the scalars operations back into one vetor operation.

Well, how many possible permutations are there? Is it possible to model each case as a separate physical register?

Evan

Evan Cheng-2 wrote:

Well, how many possible permutations are there? Is it possible to
model each case as a separate physical register?

Evan

I don't think so. There are 4x4x4x4 = 256 permutations. For example:

* xyzw: default
* zxyw
* yyyy: splat

Even if can model each of these 256 cases as a separate physical register,
how can I model the use of r0.xyzw in the following example:

// dp4 = dot product 4-element
dp4 r0.x, r1, r2
dp4 r0.y, r3, r4
dp4 r0.z, r5, r6
dp4 r0.w, r7, r8
sub r5, r0.xyzw, r6

Alex,
support for swizzles in the manner that you would normally code them,
and in my case I have 6^4 permutations on src registers and 24
combinations in the dst registers. The way that I ended up handling this
was to have different register classes for 1, 2, 3 and 4 component
vectors. This made the generic cases very simple but still made
swizzling fairly difficult.
  In order to get swizzling to work you only need to handle three
SDNodes, insert_vector_elt, extract_vector_elt and build_vector while
expanding the rest. For those three nodes I then custom lowered them to
a target specific node with an extra integer constant per register that
would encode the swizzle mask in 32bits. The correct swizzles can then
be generated in the asm printer by decoding the integer constant. This
does require having extra moves, but your example below would end up
being something like the following:

dp4 r100, r1, r2
mov r0.x, r100 (float4 => float1 extract_vector_elt)
dp4 r101, r4, r5
mov r3.x, r101 (float4 => float1 extract_vector_elt)
iadd r6.xy__, r0.x000, r3.0x00(float1 + float1 => float2 build_vector)
dp4 r7.x, r8, r9
<as above>
dp4 r10.x, r11, r12
<as above>
iadd r13.xy__, r7.x000, f10.0x00(float1 + float1 => float2 build_vector)
iadd r14, r13.xy00, r6.00xy (float2 + float2 => float4 build_vector)
sub r15, r14, r9

It's not as compact and neat but it works and the move instructions will
get optimized away by the lower level gpu compiler.

Hope this helps,
Micah

On Behalf Of [Alex]

Villimow, Micah:

This problem argues for why SDNode should be target polymorphic. If they were target polymorphic, then a target-specific node would be largely unnecessary. (By a target-specific node, I mean extending the ISD enumeration for your target.) A target polymorphic SDNode would still capture all of the behaviors and attributes with insert_vector_elt, extract_vector_elt and build_vector, but also allow you to add additional behaviors and attributes. Which is mostly the point in object oriented programming. Assuming you don't need to do extra DAGCombine work, you would get that for free from the parent class.

Unfortunately, that would mean a lot of work at this juncture and a heavy overhaul of SelectionDAGNodes.h and associated SelectionDAG source. Node allocation, in particular, would become more complicated (but not unsolvable.)

Were anyone going to tackle this problem, the solution would have to be largely incremental, i.e., the source can't be overhauled all at once, but should permit incremental transition of SDNodes to a behavioral interface style.

-scooter

This is a very good use case for vector masks in LLVM. Expressing this as
two masked operations and a merge:

** Warning, pseudo-LLVM code ***
mul r0, r1, r2, [1101] ; [xy_w]
add r6, r3, r4, [0010] ; [__z_]
** The assumption here is that masked elements are undefined, so we need a
merge **
select r0, r0, r6, [1101] ; Select 1's from r0, 0's from r6, merge
sub r5, r0, r1, [1111] ; Or have no mask == full mask

The registers are just vector registers then. They don't have component
pieces. Regalloc will have no problem with them.

The MachineInstrs for your architecture would have to preserve the mask
semantics. In the AsmPrinter for your architecture, it would be a simple
matter to dump out the mask as the field specifier on a register name.

The masks would allow you to get rid of the shufflevector stuff. Since you
don't have a hardware merge instruction you could keep your pre- and
post-regalloc passes to rewrite things or a very simple post-regalloc peephole
pass could examine the masks of the merge and rewrite the registers in the
defs without a pre-regalloc pass needed to remember things.

Alas, we do not have masks in LLVM just yet. But I'm getting to the point
where I'm ready to restart that discussion. :slight_smile:

This also won't directly handle the more general case of swizzling:

r0.wyzx = ...

But a "regular" masked operation followed by a shufflevector should do it.

                                      -Dave