I’m currently trying to add new custom “tensor” instructions to open RISC-V cores such as Vortex, I have already implemented it in Verilog but I wonder what files should I modify in the LLVM project to define instructions that can take a group of scalar registers into account for dependency checking. For example, the custom strided load instruction that I want to add has the following specs:
R-format
- Address in the
rs1
- Stride in
rs2,
rd specifies the start register (we need to load 4 values in consecutive registers)
- Custom opcode
funct7 specifies the size of the group of registers, which is how many consecutive registers are gonna be loaded from the start register rd
Currently, from what I have seen in the backend docs and the llvm/lib/Target/RISCV directory, I think that the way to do it is by declaring the different register groups in the RISCVRegisterInfo.td and the intrinsics in RISCVInstrinsics.td. Furthermore, I would like to be able to implicitly infer, at a higher level, the size of the register group by writing the dimension of the tensor in a “Fragment template” as the one provided in the CUDA WMMA API documentation, which is something that for me it feels like connecting the high-level implementation (potentially frontend?) with the backend Tablegen code.
Finally, to see a possible implementation I have also checked the llvm/lib/Target/NVPTX directory and I think I got a good grasp of the idea. However, I would like to know if there is something else that I should also change which are not Tablegen files, or something that I should modify in the frontend code. Thanks in advance.
I think it’s preferable to add MC support first before worrying about codegen. MC support is consisted of the instruction definition, its textual (assembly) representation, and encoding, but not including the instruction selection patterns. Test these parts before moving onto codegen, which specifies how to lower from LLVM IR to your instructions.
Having MC first means that you can test your instructions with inline assembly. It also requires much less works compared to adding codegen supports.
Yeah you probably need a separate register class similar to how RVV registers do with each LMUL. In you case it’s akin to the VRM4 register class.
As for codegen, having custom intrinsics in LLVM IR is a good way to start (Ultimately you probably want to teach ISel how to recognize specific patterns and generate your instructions). You probably also want a custom RISCVISD as an intermediate that translated directly from IR intrinsics before being ISel-ed into your machine instruction. Having custom ISD node also makes it easier to write the ISel pattern.
Well, actually I don’t want to use vector registers. Could I internally assign groups of registers for managing dependencies among the intrinsics calls and then just keep the first one when lowering to riscv?
Also, for the dependencies, I do not see a straightforward way of detecting them with the existing code, perhaps I’ll need my own analysis and transformation passes?
I said similar to how RVV groups registers rather than using vector registers.
Basically how it works is, using LMUL=4 (grouping 4 registers at a time) as an example: we first have normal vector registers like V1, V2, V3 etc. Then there are special RegisterWithSubRegs registers, for example, V1M4 is a register in which V1 ~ V4 are its sub registers. Similarly, V2M4 is a register in which V2 ~ V5 are its sub registers (you can enumerate all these easily with TableGen so you don’t have to do it manually).
Lastly, there is a register class VRM4 that contains all the V1M4, V2M4, V3M4, …etc. An register instruction operand uses these special register classes (VRM4) instead of the normal vector register class. The idea is that when RA allocates a register for this operand, let’s say it allocates V2M4, it effectively allocates four registers (V2 ~ V5) to it.
What I suggested in my previous comment is doing this trick, but on scalar registers.
I’m not 100% sure what you meant by dependencies here, but I think the trick I described earlier can do it.
Oh okay I see, sorry for the confusion
and thanks for the answer
Just in case this is useful for someone, this was the final implementation (ignore the test btw, it’s not functional and it just accidentally slipped into the PR)