How to describe the RegisterInfo?

Hello Everyone,

I am trying to make a new LLVM backend target for Intel GPU.
I would start from targeting OpenCL language first.
But I am not quite familiar with LLVM backend infrastructure.
I have some problem on describing the RegisterInfo.

Intel GPU launches lots of hardware threads to do GPGPU workload.
Each hardware thread has 128 registers(r0-r127), with each one of size 32 byte.

Each hardware thread may run in SIMD 8/16/32 way, which maps to
8/16/32 OpenCL working items. And the SIMD width is chosen at
compile time (normally chosen according to register pressure, bigger simd width means bigger register pressure).
Note each instruction has each own exec-width, which may not be equal to program SIMD width.
Normally we would allocate contiguous registers for divergent value.
For example, we have a program compiled as SIMD 8, we need to allocate 4 byte8=32 byte
value for a divergent float/i32 value. But if there is a ‘short type’ value,
it only needs 2 byte
8=16 byte, that is half of a 32-byte-register.

we may also allocate for ‘uniform’ value, a uniform value only needs type-sized register,
without multiply ‘simd-width’. A uniform float/i32 value only needs 4 byte physical register.
Thus a 32-byte-register can hold up to 8 different uniform float/i32 values.

Some time we also need to access register in stride way. Like a bitcast from i64 to v2i32,
we need to access the i64 register with horizontal stride of 2.
Look below example, the i64 value is hold in r10 and r11. L/H stands for the low 32bit/high 32bit.
And the simd width of the program is SIMD 8, so we have 8 pairs of L/H.
r10: L H L H L H L H
r11: L H L H L H L H
below two instructions will extract the low 32bit and high 32bit part.
mov(8 | M0) r12.0<1>, r10.0<8,4,2>:D
mov(8 | M0) r13.0<1>, r10.1<8,4,2>:D
(The format of a register region is RegNum.regSubNum<vertStride, width, horzStride>:type)
(Note the regSubNum is measured in units of the register type here.)
then r12/r13 contains the result vector components.
You can refer below link for more details on Intel GPU assembly and register usage:

https://software.intel.com/en-us/articles/introduction-to-gen-assembly

I notice the hardware encoding of a register is 16 bit. that is not enough to encode all the
register region parameters(regNum, type, hstride, vstride, width,…) in RegisterInfo.td. And I am not sure
which is the reasonable place to hold this stride/type/width information for a physical register.
Maybe some other .cpp file is more suitable than RegisterInfo.td file? Because I need to change the register
region parameters in the bitcast instruction( from qword with hstride 1 to dword with hstride 2)
At which stage is suitable to do such bitcast instruction logic? after reg-alloc?

The detailed hardware spec is located at:
https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-bdw-vol07-3d_media_gpgpu_3.pdf
at page 921, it describe the detailed instruction encode format.
It needs (regFile, regNum, subRegNum, width, type, addrMode, hStride, vStride) to describe a register.

I have attached my first version RegisterInfo.td.
And I also have a question about the attached RegisterInfo.td file. Do I have to define different SubRegIndex
like below to make TableGen works correctly?

foreach Index = 0-15 in {
def subd#Index :SubRegIndex<32, !shl(Index, 5)>; //used as SubRegIndex when declaring gpr_d_simd8
def subw#Index: SubRegIndex<16, !shl(Index, 4)>; //used as SubRegIndex when declaring gpr_w_simd8

}

If anything I am not saying clear, just reply the mail. Thanks for any help!

Thanks!
Ruiling

IntelGPURegisterInfo.td (5.77 KB)

Hello Everyone,

I am trying to make a new LLVM backend target for Intel GPU.
I would start from targeting OpenCL language first.
But I am not quite familiar with LLVM backend infrastructure.
I have some problem on describing the RegisterInfo.

Intel GPU launches lots of hardware threads to do GPGPU workload.
Each hardware thread has 128 registers(r0-r127), with each one of size 32 byte.

Each hardware thread may run in SIMD 8/16/32 way, which maps to
8/16/32 OpenCL working items. And the SIMD width is chosen at
compile time (normally chosen according to register pressure, bigger simd width means bigger register pressure).
Note each instruction has each own exec-width, which may not be equal to program SIMD width.
Normally we would allocate contiguous registers for divergent value.
For example, we have a program compiled as SIMD 8, we need to allocate 4 byte8=32 byte
value for a divergent float/i32 value. But if there is a ‘short type’ value,
it only needs 2 byte
8=16 byte, that is half of a 32-byte-register.

we may also allocate for ‘uniform’ value, a uniform value only needs type-sized register,
without multiply ‘simd-width’. A uniform float/i32 value only needs 4 byte physical register.
Thus a 32-byte-register can hold up to 8 different uniform float/i32 values.

As a GPU backend maintainer, I strongly discourage trying to model the total register bank of the GPU in LLVM. Just model one thread. This will make things much, much easier.

Some time we also need to access register in stride way. Like a bitcast from i64 to v2i32,
we need to access the i64 register with horizontal stride of 2.
Look below example, the i64 value is hold in r10 and r11. L/H stands for the low 32bit/high 32bit.
And the simd width of the program is SIMD 8, so we have 8 pairs of L/H.
r10: L H L H L H L H
r11: L H L H L H L H
below two instructions will extract the low 32bit and high 32bit part.
mov(8 | M0) r12.0<1>, r10.0<8,4,2>:D
mov(8 | M0) r13.0<1>, r10.1<8,4,2>:D
(The format of a register region is RegNum.regSubNum<vertStride, width, horzStride>:type)
(Note the regSubNum is measured in units of the register type here.)
then r12/r13 contains the result vector components.
You can refer below link for more details on Intel GPU assembly and register usage:

https://software.intel.com/en-us/articles/introduction-to-gen-assembly

—escha

Hi Escha,

Great to have your comment! Do you have any specific reason for not doing like this?
I am not sure whether I understand your point correctly. For “just model one thread”,
do you mean “only considering ONE of the 8/16 working lanes that running in lock-step way”??

For my case, may be something like I only need to define r0~r127 as register for i32 register (each r# is just enough for simd8 i32).
Then the register allocator never need to go to allocate the sub-registers, just operate them as a whole. right?

Yes, it looks really easy for divergent registers. But I think then I would lose the ability
to allocate uniform register. Am I right? Is there any way to allocate uniform register
as well as allocate divergent register?

If I understand right, on this arch, ‘uniform’ refers to values that only take one lane of register file instead of SIMD-width lanes, and they share the same region of the register file as non-uniform values. This is in contrast to e.g. AMDGPU where SGPRs (scalar GPRs) and VGPRs are separate register files.

If this understanding is correct, you may be able to define uniform and non-uniform registers separately, but make sure that one aliases the other, e.g. so that (if your SIMD width is 16) VGPR 20 overlaps SGPR 320, 321….335. So you can have 128 vector registers, 16*128 uniforms, or a mix of the two.

(Maybe some of the AMDGPU maintainers have thoughts?)

—escha

Yes, the arch is just as you said, something like AMD GPU, but Intel GPU don’t have separate register file for ‘scalar/vector’.
In fact my idea of defining the register tuples was borrowed from SIRegisterInfo.td in AMD GPU.

But seems that AMD GPU mainly support i32/i64 register type, while Intel GPU also support byte/short register type.
So I have to start defining the registers from ‘byte’ type, and then build up other type registers through RegisterTuples.

I thought RegisterTuple is kind of expressing register alias in RegisterInfo.td file. I am not sure whether I understand it correctly. My first trial was like below(to make things simple, I remove some WORD/QWORD register class):

let Namespace = “IntelGPU” in {

foreach Index = 0-15 in {
def sub#Index : SubRegIndex<32, !shl(Index, 5)>;
}
}

class IntelGPUReg<string n, bits<13> regIdx> : Register {
bits<2> HStride;
bits<1> regFile;

let Namespace = “IntelGPU”;
let HWEncoding{12-0} = regIdx;
let HWEncoding{15} = regFile;
}
// here I define the whole 4096 byte registers
foreach Index = 0-4095 in {
def Rb#Index : IntelGPUReg <“Rb”#Index, Index> {
let regFile = 0;
}
}

// b–>byte w–>word d–>dword q–>qword
// the set of uniform byte register
def gpr_b : RegisterClass<“IntelGPU”, [i8], 8,
(sequence “Rb%u”, 0, 4095)> {
let AllocationPriority = 1;
}

def gpr_d : RegisterTuples<[sub0, sub1, sub2, sub3],
[(add (decimate gpr_b, 4)),
(add (decimate (shl gpr_b, 1), 4)),
(add (decimate (shl gpr_b, 2), 4)),
(add (decimate (shl gpr_b, 3), 4))]>;

// simd byte use stride 2 register as stride 1 does not support useful ALU instruction
def gpr_b_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5, sub6, sub7],
[(add (decimate gpr_b, 16)),
(add (decimate (shl gpr_b, 2), 16)),
(add (decimate (shl gpr_b, 4), 16)),
(add (decimate (shl gpr_b, 6), 16)),
(add (decimate (shl gpr_b, 8), 16)),
(add (decimate (shl gpr_b, 10), 16)),
(add (decimate (shl gpr_b, 12), 16)),
(add (decimate (shl gpr_b, 14), 16))]>;

def gpr_d_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5, sub6, sub7],
[(add (decimate gpr_d, 8)),
(add (decimate (shl gpr_d, 1), 8)),
(add (decimate (shl gpr_d, 2), 8)),
(add (decimate (shl gpr_d, 3), 8)),
(add (decimate (shl gpr_d, 4), 8)),
(add (decimate (shl gpr_d, 5), 8)),
(add (decimate (shl gpr_d, 6), 8)),
(add (decimate (shl gpr_d, 7), 8))]>;
def RegD_Uniform : RegisterClass<“IntelGPU”, [i32, f32], 32, (add gpr_d)>;
def RegD_SIMD8 : RegisterClass<“IntelGPU”, [i32, f32], 32, (add gpr_d_simd8)> {
}

This is easy for me to define the register alias information. But it won’t works!
the tablegen exit and tells me: “error:Ran out of lanemask bits to represent subregister sub1_then_sub1”
Anybody know what’s wrong here?

  • Ruiling

Hello Everyone,

I am trying to make a new LLVM backend target for Intel GPU.
I would start from targeting OpenCL language first.
But I am not quite familiar with LLVM backend infrastructure.
I have some problem on describing the RegisterInfo.

Intel GPU launches lots of hardware threads to do GPGPU workload.
Each hardware thread has 128 registers(r0-r127), with each one of size 32
byte.
Each hardware thread may run in SIMD 8/16/32 way, which maps to
8/16/32 OpenCL working items. And the SIMD width is chosen at
compile time (normally chosen according to register pressure, bigger simd
width means bigger register pressure).
Note each instruction has each own exec-width, which may not be equal to
program SIMD width.
Normally we would allocate contiguous registers for divergent value.
For example, we have a program compiled as SIMD 8, we need to allocate 4
byte*8=32 byte
value for a divergent float/i32 value. But if there is a 'short type' value,
it only needs 2 byte*8=16 byte, that is half of a 32-byte-register.
we may also allocate for 'uniform' value, a uniform value only needs
type-sized register,
without multiply 'simd-width'. A uniform float/i32 value only needs 4 byte
physical register.
Thus a 32-byte-register can hold up to 8 different uniform float/i32 values.

Some time we also need to access register in stride way. Like a bitcast
from i64 to v2i32,
we need to access the i64 register with horizontal stride of 2.
Look below example, the i64 value is hold in r10 and r11. L/H stands for
the low 32bit/high 32bit.
And the simd width of the program is SIMD 8, so we have 8 pairs of L/H.
r10: L H L H L H L H
r11: L H L H L H L H
below two instructions will extract the low 32bit and high 32bit part.
mov(8 | M0) r12.0<1>, r10.0<8,4,2>:D
mov(8 | M0) r13.0<1>, r10.1<8,4,2>:D
(The format of a register region is RegNum.regSubNum<vertStride, width,
>:type)
(Note the regSubNum is measured in units of the register type here.)
then r12/r13 contains the result vector components.
You can refer below link for more details on Intel GPU assembly and
register usage:
Introduction to GEN Assembly

I notice the hardware encoding of a register is 16 bit. that is not enough
to encode all the
register region parameters(regNum, type, hstride, vstride, width,...) in
RegisterInfo.td. And I am not sure
which is the reasonable place to hold this stride/type/width information
for a physical register.
Maybe some other .cpp file is more suitable than RegisterInfo.td file?
Because I need to change the register
region parameters in the bitcast instruction( from qword with hstride 1 to
dword with hstride 2)
At which stage is suitable to do such bitcast instruction logic? after
reg-alloc?

Hi,

I would recommend encoding some of the register region parameters as part
of the instruction rather than using the register encoding, because
something like 'width' seems more like a property of the instruction
than of the register to me.

-Tom

lanemasks are used at several places in the compiler to describe live/dead subregisters parts. That is if you take your largest register (may be a tuple) how many different subregisters you can reach by that. I would expect that in your example you can from a gpr_d_simd8 you can reach 8 gpr_d registers through sub0-sub7 and from each gpr_d you can reach 4 gpr_b registers through sub0-sub3. This should be fine with 32 bites/lanes. I am not sure if that is the problem here but I think you should use different subregisters indixes for the byte access (bsub0-bsub3) than you used for the higher level tuples.

You could also experiment with increasing the limit in Tablegen and changing the LaneBitmask typedef, however this has possible implications on memory use and performance of the register allocator so it would be good to find a way to avoid that.

- Matthias

Hi Matthias,

Thanks for your explanation. It really helps me! I tried and make sure that
32bit lanemask works for gpr_d_simd8 to reach 8 gpr_d register through
subd0-subd7 and then reach 4 gpr_b register through sub0-sub3.
Based on this, the new RegisterInfo.td looks like below. As there is only
32 bit lanemask, I choose to define Rw# (register of word) instead of Rb#.
I think with word register as a base, I can describe simd8 QWord register
at least. But it does not works if I add in gpr_q_simd8 register.
Follow your advice, w0-w3 is used as subregister index for the low-level to
access word. and subd0-subd7 as the subregister index for the second level
for dword.

let Namespace = "IntelGPU" in {

foreach Index = 0-3 in {
  def w#Index : SubRegIndex<16, !shl(Index, 4)>;
}
foreach Index = 0-7 in {
// def subw#Index : SubRegIndex<16, !shl(Index, 4)>;
  def subd#Index : SubRegIndex<32, !shl(Index, 5)>;
// def subq#Index : SubRegIndex<64, !shl(Index, 6)>;
}
}

class IntelGPUReg<string n, bits<13> regIdx> : Register<n> {
  bits<2> HStride;
  bits<1> regFile;

  let Namespace = "IntelGPU";
  let HWEncoding{12-0} = regIdx;
  let HWEncoding{15} = regFile;
}
foreach Index = 0-2047 in {
  def Rw#Index : IntelGPUReg <"Rw"#Index, !shl(Index, 1)> {
    let regFile = 0;
  }
}

// b-->byte w-->word d-->dword q-->qword

def gpr_w : RegisterClass<"IntelGPU", [i16], 16,
                          (sequence "Rw%u", 0, 2047)> {
  let AllocationPriority = 1;
}

def gpr_d : RegisterTuples<[w0, w1],
                           [(add (decimate gpr_w, 2)),
                            (add (decimate (shl gpr_w, 1), 2))]>;

def gpr_q : RegisterTuples<[w0, w1, w2, w3],
                           [(add (decimate gpr_w, 4)),
                            (add (decimate (shl gpr_w, 1), 4)),
                            (add (decimate (shl gpr_w, 2), 4)),
                            (add (decimate (shl gpr_w, 3), 4))]>;

//def gpr_w_simd8 : RegisterTuples<[subw0, subw1, subw2, subw3, subw4,
subw5, subw6, subw7],
// [(add (decimate gpr_w, 8)),
// (add (decimate (shl gpr_w, 1), 8)),
// (add (decimate (shl gpr_w, 2), 8)),
// (add (decimate (shl gpr_w, 3), 8)),
// (add (decimate (shl gpr_w, 4), 8)),
// (add (decimate (shl gpr_w, 5), 8)),
// (add (decimate (shl gpr_w, 6), 8)),
// (add (decimate (shl gpr_w, 7), 8))]>;

def gpr_d_simd8 : RegisterTuples<[subd0, subd1, subd2, subd3, subd4, subd5,
subd6, subd7],
                                [(add (decimate gpr_d, 8)),
                                 (add (decimate (shl gpr_d, 1), 8)),
                                 (add (decimate (shl gpr_d, 2), 8)),
                                 (add (decimate (shl gpr_d, 3), 8)),
                                 (add (decimate (shl gpr_d, 4), 8)),
                                 (add (decimate (shl gpr_d, 5), 8)),
                                 (add (decimate (shl gpr_d, 6), 8)),
                                 (add (decimate (shl gpr_d, 7), 8))]>;

The issue comes out in the below line, using subd0-subd7 will cause
"llvm/utils/TableGen/CodeGenRegisters.cpp:1146: void
llvm::CodeGenRegBank::computeComposites(): Assertion `Idx3 && "Sub-register
doesn't have an index"' failed"
if changed to subq0-subq7, it will report "error:Ran out of lanemask bits
to represent subregister subq4_then_w3"
Am I wrong in defining the SubRegIndex ?? Or something I understand wrong?
Basically I should use different SubRegIndex when declaring
gpr_w_simd8/gpr_d_simd8/gpr_q_simd8 as the subregs are of different size,
right?

def gpr_q_simd8 : RegisterTuples<[subd0, subd1, subd2, subd3, subd4, subd5,
subd6, subd7],
                                [(add (decimate gpr_q, 8)),
                                 (add (decimate (shl gpr_q, 1), 8)),
                                 (add (decimate (shl gpr_q, 2), 8)),
                                 (add (decimate (shl gpr_q, 3), 8)),
                                 (add (decimate (shl gpr_q, 4), 8)),
                                 (add (decimate (shl gpr_q, 5), 8)),
                                 (add (decimate (shl gpr_q, 6), 8)),
                                 (add (decimate (shl gpr_q, 7), 8))]>;

def RegD_Uniform : RegisterClass<"IntelGPU", [i32, f32], 32, (add gpr_d)>;
def RegD_SIMD8 : RegisterClass<"IntelGPU", [i32, f32], 32, (add
gpr_d_simd8)>;
def RegQ_Uniform : RegisterClass<"IntelGPU", [i64, f64], 64, (add gpr_q)>;
def RegQ_SIMD8 : RegisterClass<"IntelGPU", [i64, f64], 64, (add
gpr_q_simd8)>;

- Ruiling

Thanks for your suggestion. I agree that some region parameters need to be
part
of the instruction descriptor. But it is a little hard for me to point out
which parameters
should go to instruction descriptor, which should be declared in
RegisterInfo.td. My
current idea was to describe uniform/non-uniform register in
RegisterInfo.td. while other
register region paramters (like stride etc.) are left to instruction
descriptor. The simd-width of the
compiled program is used to determine the width of the non-uniform register
(normally 8 lanes or 16 lanes),
So I think this should be included in RegisterInfo.td. So if it is
non-uniform value, I would assgin non-uniform
registerClass to it. I am not sure whether this can be easily done in LLVM.
I don't know if there are any other possible way to do it instead of
declaring
uniform/non-uniform register in RegisterInfo.td file. Please share with me
if you have idea on how to allocate non-uniform registers if it is not
handled in RegisterInfo.td.

- Ruiling

>
> Yes, the arch is just as you said, something like AMD GPU, but Intel
GPU don't have separate register file for 'scalar/vector'.
> In fact my idea of defining the register tuples was borrowed from
SIRegisterInfo.td in AMD GPU.
> But seems that AMD GPU mainly support i32/i64 register type, while
Intel GPU also support byte/short register type.
> So I have to start defining the registers from 'byte' type, and then
build up other type registers through RegisterTuples.
> I thought RegisterTuple is kind of expressing register alias in
RegisterInfo.td file. I am not sure whether I understand it correctly. My
first trial was like below(to make things simple, I remove some WORD/QWORD
register class):
> let Namespace = "IntelGPU" in {
>
> foreach Index = 0-15 in {
> def sub#Index : SubRegIndex<32, !shl(Index, 5)>;
> }
> }
>
> class IntelGPUReg<string n, bits<13> regIdx> : Register<n> {
> bits<2> HStride;
> bits<1> regFile;
>
> let Namespace = "IntelGPU";
> let HWEncoding{12-0} = regIdx;
> let HWEncoding{15} = regFile;
> }
> // here I define the whole 4096 byte registers
> foreach Index = 0-4095 in {
> def Rb#Index : IntelGPUReg <"Rb"#Index, Index> {
> let regFile = 0;
> }
> }
>
> // b-->byte w-->word d-->dword q-->qword
> // the set of uniform byte register
> def gpr_b : RegisterClass<"IntelGPU", [i8], 8,
> (sequence "Rb%u", 0, 4095)> {
> let AllocationPriority = 1;
> }
>
> def gpr_d : RegisterTuples<[sub0, sub1, sub2, sub3],
> [(add (decimate gpr_b, 4)),
> (add (decimate (shl gpr_b, 1), 4)),
> (add (decimate (shl gpr_b, 2), 4)),
> (add (decimate (shl gpr_b, 3), 4))]>;
>
> // simd byte use stride 2 register as stride 1 does not support useful
ALU instruction
> def gpr_b_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5,
sub6, sub7],
> [(add (decimate gpr_b, 16)),
> (add (decimate (shl gpr_b, 2), 16)),
> (add (decimate (shl gpr_b, 4), 16)),
> (add (decimate (shl gpr_b, 6), 16)),
> (add (decimate (shl gpr_b, 8), 16)),
> (add (decimate (shl gpr_b, 10), 16)),
> (add (decimate (shl gpr_b, 12), 16)),
> (add (decimate (shl gpr_b, 14),
16))]>;
>
> def gpr_d_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5,
sub6, sub7],
> [(add (decimate gpr_d, 8)),
> (add (decimate (shl gpr_d, 1), 8)),
> (add (decimate (shl gpr_d, 2), 8)),
> (add (decimate (shl gpr_d, 3), 8)),
> (add (decimate (shl gpr_d, 4), 8)),
> (add (decimate (shl gpr_d, 5), 8)),
> (add (decimate (shl gpr_d, 6), 8)),
> (add (decimate (shl gpr_d, 7), 8))]>;
> def RegD_Uniform : RegisterClass<"IntelGPU", [i32, f32], 32, (add
gpr_d)>;
> def RegD_SIMD8 : RegisterClass<"IntelGPU", [i32, f32], 32, (add
gpr_d_simd8)> {
> }
> This is easy for me to define the register alias information. But it
won't works!
> the tablegen exit and tells me: "error:Ran out of lanemask bits to
represent subregister sub1_then_sub1"
> Anybody know what's wrong here?

lanemasks are used at several places in the compiler to describe
live/dead subregisters parts. That is if you take your largest register
(may be a tuple) how many different subregisters you can reach by that. I
would expect that in your example you can from a gpr_d_simd8 you can reach
8 gpr_d registers through sub0-sub7 and from each gpr_d you can reach 4
gpr_b registers through sub0-sub3. This should be fine with 32 bites/lanes.
I am not sure if that is the problem here but I think you should use
different subregisters indixes for the byte access (bsub0-bsub3) than you
used for the higher level tuples.

You could also experiment with increasing the limit in Tablegen and
changing the LaneBitmask typedef, however this has possible implications on
memory use and performance of the register allocator so it would be good to
find a way to avoid that.

- Matthias

Hi Matthias,

Thanks for your explanation. It really helps me! I tried and make sure
that 32bit lanemask works for gpr_d_simd8 to reach 8 gpr_d register through
subd0-subd7 and then reach 4 gpr_b register through sub0-sub3.
Based on this, the new RegisterInfo.td looks like below. As there is only
32 bit lanemask, I choose to define Rw# (register of word) instead of Rb#.
I think with word register as a base, I can describe simd8 QWord register
at least. But it does not works if I add in gpr_q_simd8 register.
Follow your advice, w0-w3 is used as subregister index for the low-level
to access word. and subd0-subd7 as the subregister index for the second
level for dword.

let Namespace = "IntelGPU" in {

foreach Index = 0-3 in {
  def w#Index : SubRegIndex<16, !shl(Index, 4)>;
}
foreach Index = 0-7 in {
// def subw#Index : SubRegIndex<16, !shl(Index, 4)>;
  def subd#Index : SubRegIndex<32, !shl(Index, 5)>;
// def subq#Index : SubRegIndex<64, !shl(Index, 6)>;
}
}

class IntelGPUReg<string n, bits<13> regIdx> : Register<n> {
  bits<2> HStride;
  bits<1> regFile;

  let Namespace = "IntelGPU";
  let HWEncoding{12-0} = regIdx;
  let HWEncoding{15} = regFile;
}
foreach Index = 0-2047 in {
  def Rw#Index : IntelGPUReg <"Rw"#Index, !shl(Index, 1)> {
    let regFile = 0;
  }
}

// b-->byte w-->word d-->dword q-->qword

def gpr_w : RegisterClass<"IntelGPU", [i16], 16,
                          (sequence "Rw%u", 0, 2047)> {
  let AllocationPriority = 1;
}

def gpr_d : RegisterTuples<[w0, w1],
                           [(add (decimate gpr_w, 2)),
                            (add (decimate (shl gpr_w, 1), 2))]>;

def gpr_q : RegisterTuples<[w0, w1, w2, w3],
                           [(add (decimate gpr_w, 4)),
                            (add (decimate (shl gpr_w, 1), 4)),
                            (add (decimate (shl gpr_w, 2), 4)),
                            (add (decimate (shl gpr_w, 3), 4))]>;

//def gpr_w_simd8 : RegisterTuples<[subw0, subw1, subw2, subw3, subw4,
subw5, subw6, subw7],
// [(add (decimate gpr_w, 8)),
// (add (decimate (shl gpr_w, 1), 8)),
// (add (decimate (shl gpr_w, 2), 8)),
// (add (decimate (shl gpr_w, 3), 8)),
// (add (decimate (shl gpr_w, 4), 8)),
// (add (decimate (shl gpr_w, 5), 8)),
// (add (decimate (shl gpr_w, 6), 8)),
// (add (decimate (shl gpr_w, 7), 8))]>;

def gpr_d_simd8 : RegisterTuples<[subd0, subd1, subd2, subd3, subd4,
subd5, subd6, subd7],
                                [(add (decimate gpr_d, 8)),
                                 (add (decimate (shl gpr_d, 1), 8)),
                                 (add (decimate (shl gpr_d, 2), 8)),
                                 (add (decimate (shl gpr_d, 3), 8)),
                                 (add (decimate (shl gpr_d, 4), 8)),
                                 (add (decimate (shl gpr_d, 5), 8)),
                                 (add (decimate (shl gpr_d, 6), 8)),
                                 (add (decimate (shl gpr_d, 7), 8))]>;

The issue comes out in the below line, using subd0-subd7 will cause
"llvm/utils/TableGen/CodeGenRegisters.cpp:1146: void
llvm::CodeGenRegBank::computeComposites(): Assertion `Idx3 &&
"Sub-register doesn't have an index"' failed"
if changed to subq0-subq7, it will report "error:Ran out of lanemask bits
to represent subregister subq4_then_w3"
Am I wrong in defining the SubRegIndex ?? Or something I understand wrong?
Basically I should use different SubRegIndex when declaring
gpr_w_simd8/gpr_d_simd8/gpr_q_simd8 as the subregs are of different size,
right?

I did some simple debugging for using subq0~subq7 through adding some log
in CodeGenRegBank::computeSubRegLaneMasks(),
1172 for (auto &Idx : SubRegIndices) {
1173 if (Idx.getComposites().empty()) {
1174 std::cout << std::string("SubRegIndex ") << Idx.getName() << "
"<< Bit << std::endl;

it looks like below subreg lane masks was generated:
SubRegIndex w0 0
SubRegIndex w1 1
SubRegIndex w2 2
SubRegIndex w3 3
SubRegIndex subd7_then_w0 4
SubRegIndex subd7_then_w1 5
SubRegIndex subd6_then_w0 6
SubRegIndex subd6_then_w1 7
SubRegIndex subd5_then_w0 8
SubRegIndex subd5_then_w1 9
SubRegIndex subd4_then_w0 10
SubRegIndex subd4_then_w1 11
SubRegIndex subd3_then_w0 12
SubRegIndex subd3_then_w1 13
SubRegIndex subd2_then_w0 14
SubRegIndex subd2_then_w1 15
SubRegIndex subd1_then_w0 16
SubRegIndex subd1_then_w1 17
My question was can subd1_then_w0 share same lane mask as w2? and the same
question for subd1_then_w1 and w3.

- Ruiling