how to allocate consecutive register?


The gpu target I am working on requires the ‘value’ and ‘address’ operands of memory store instruction in consecutive register. Anybody has suggestion?

  • Ruiling

Hi Ruiling,

Make the store instruction takes only one operand, a tuple register.
You have examples of tuple registers in the ARM backend.


There are other CPUs with similar restrictions. You could look at how they handle it. An example which springs to mind is ARM A32 LDRD and STRD (load/store two consecutive registers). I think some other architectures do the same for operations which return two results, such as div/mod or NxN->2N multiply.

The difficult bit will be if there are loads with the same property. I
don't think you can easily encode the fact that one half of a register
is read and the other written.


Seems like ARM target use reg_sequnce to form a register tuple and let the store instruction accept that register tuple.
Did I understand it correct? What if the address is 64bit while the value is 32bit? Is there any simple way? reg_sequence looks like only accept same type sub-registers.

But the real difficulty for me is I have already ran-out of lanemask bits.
I gave a brief introduction of Intel GPU register in the thread:

And in the later trial, I hit the lanemask bits ran-out issue.

Later I choose to define all register tuples using only Rw0~2047, and using subw0~31, I reached RegQ_SIMD8 at most!
Some piece of are listed:

11 foreach Index = 0-31 in {
12 def subw#Index : SubRegIndex<16, !shl(Index, 4)>;
13 }

18 class IntelGPUReg<string n, bits<13> regIdx> : Register {
20 bits<1> regFile;
22 let Namespace = “IntelGPU”;
23 let HWEncoding{12-0} = regIdx;
24 let HWEncoding{15} = regFile;
25 }
26 foreach Index = 0-2047 in {
27 def Rw#Index : IntelGPUReg <“Rw”#Index, !shl(Index, 1)> {
28 let regFile = 0;
29 }
30 }
32 // b–>byte w–>word d–>dword q–>qword
34 def gpr_w : RegisterClass<“IntelGPU”, [i16], 16,
35 (sequence “Rw%u”, 0, 2047)> {
36 let AllocationPriority = 1;
37 }

83 def gpr_q_simd8 : RegisterTuples<[subw0, subw1, subw2, subw3, subw4, subw5, subw6, subw7,
84 subw8, subw9, subw10, subw11, subw12, subw13, subw14, subw15,
85 subw16, subw17, subw18, subw19, subw20, subw21, subw22, subw23,
86 subw24, subw25, subw26, subw27, subw28, subw29, subw30, subw31],
87 [(add (decimate gpr_w, 16)),
88 (add (decimate (shl gpr_w, 1), 16)),
89 (add (decimate (shl gpr_w, 2), 16)),
90 (add (decimate (shl gpr_w, 3), 16)),
91 (add (decimate (shl gpr_w, 4), 16)),
92 (add (decimate (shl gpr_w, 5), 16)),
93 (add (decimate (shl gpr_w, 6), 16)),

117 (add (decimate (shl gpr_w, 30), 16)),
118 (add (decimate (shl gpr_w, 31), 16))]>;

def RegQ_SIMD8 : RegisterClass<“IntelGPU”, [i64, f64], 64, (add gpr_q_simd8)>;

If I introduce larger register tuple, then I need more lanemask bits.
Maybe I need to find some other way. Or increase lanemask bits greatly.
But for now it is hard for me as I am not quite familiar with llvm register allocator. Any suggestion?
If I do not state the problem clearly, please feel free to drop a mail.

  • Ruiling