[RFC] Implementing a general purpose 64-bit target (RISC-V 64-bit) with i64 as the only legal integer type

# Purpose of this RFC
This RFC describes the challenges of modelling the 64-bit RISC-V target (RV64)
and details the two most obvious implementation choices:
1) Having i64 as the only legal integer type
2) Introducing i32 subregisters

I've worked on implementing both approaches and fleshed out a pretty complete
implementation of 1), which is my preferred option. With this RFC, I would
welcome further feedback and insight, as well as suggestions or comments on
the target-independent modifications (e.g. TargetInstrInfo hooks) I suggest as
worthwhile.

# Background: RV64
The RISC-V instruction set is structured as a set of bases (RV32I, RV32E,
RV64I, RV128I) with a series of optional extensions (e.g. M for
multiply/divide, A for atomics, F+D for single+double precision floating
point). It's important to note that RV64I is not just RV32I with some
additional instructions, it's a completely different base where operations
work on 64-bit rather than 32-bit values. RV64I also introduces 10 new
instructions: ld/sd (64-bit load/store), addiw, slliw, srliw, sraiw, addw,
subw, sllw, srlw, sraw. The `*W` instructions all produce a sign-extended
result and take the lower 32-bits of their operands as inputs. Unlike MIPS64,
there is no requirement that inputs to these `*W` are sign-extended in order
to avoid unpredictable behaviour.

# Background: RISC-V backend implementation.
Other backends aiming to support both 32-bit and 64-bit architecture variants
handle this by defining two versions of each instruction with overlapping
encodings, with one marked as isCodeGenOnly. This leads to unwanted
duplication, both in terms of tablegen descriptions and throughout the C++
implementation of the backend (e.g. any code checking for RISCV::ADD would
also want to check for RISCV::ADD64). Fortunately we can avoid this thanks to
the work Krzysztof Parzyszek contributed to support variable-sized register
classes <http://lists.llvm.org/pipermail/llvm-dev/2016-September/105027.html&gt;\.
The in-tree RISC-V backend exploits this, parameterising the base instruction
definitions by XLEN (the size of the general purpose registers).

# Option 1: Have i64 as the only legal type
## Approach
Every register class in RISCVRegisterInfo.td is parameterised by XLenVT, which
is i32 for RV32 and i64 for RV64. No subregisters are defined, meaning i32 is
not a legal type. Patterns for the `*W` instructions tend to look something
like:

    def : Pat<(sext_inreg (add GPR:$rs1, GPR:$rs2), i32),
              (ADDW GPR:$rs1, GPR:$rs2)>;

Essentially all patterns for RV32I are also valid for RV64I.

## Changes needed
* Introduction of new patterns, RV64I-specific immediate materialisation

* A number of SelectionDAG nodes generated from LLVM intrinsics take i32
arguments and the DAG legalizer doesn't currently know how to legalize them.
Promoting these arguments is trivial but requires additions to
LegalizeIntegerTypes.cpp. So far I've had to do this for
frameaddr/returnaddr/prefetch, but there are likely more.

* The shift amount type is i64. If the shift amount operand is smaller than
this, SelectionDAGBuilder will zero-extend it (changed from any-extend in
rL125457). i32->i64 zero-extension is more expensive than sign-extension, but
it's unnecessary anyway as only the lower 6 bits are used. Introduce
TargetLowering::getExtendForShiftAmount which is called during
SelectionDAGBuilder::visitShift.

* When promoting setcc operands, DAGTypeLegalizer::PromoteSetCCOperands makes
the arbitrary choice to zero-extend. It is cheaper to sign-extend from i32 to
i64, so introduce TargetLowering::isSExtCheaperThanZExt(FromTY, ToTy). For now
this is only used through PromoteSetCCOperands, but perhaps there are other
cases where it would be useful?

* When 32-bit srl is legalized, the dag combiner will try to reduce the bits
in the mask in: (srl (and val, 0xffffffff), imm) based on the knowledge of the
lower bits that will be shifted out. This means a tablegen pattern matching
0xffffff won't work. Custom selection code in RISCVDAGToDAGISel can recognize
when this has happened and produce SRLIW.

* New i64 versions of the target-specific intrinsics added to aid the lowering
of part-word atomicrmw must be defined.

* RV64F (single-precision floating point) requires a little extra work due to
the fact i32 is not a legal type. When call lowering happens post-legalisation
(e.g. when an intrinsic was inserted during legalisation). A bitcast from f32
to i32 can't be introduced. There's a similar challenge for RV32D. Introduce
target-specific DAG nodes that perform bitcast+sext for f32->i64 and
trunc+bitcast for i64->f32. Custom-lower ISD::BITCAST to ensure these nodes
are selected.

## Questions
Does anyone have any reservations about this approach of having i64 as the
only legal type?

Some of the target hooks could perhaps be replaced with more heroics in the
backend. What are people's feelings here?

# Option 2: Model 32-bit subregs
## Approach
Define 32-bit subregisters for the GPRs that can be used in patterns and
instruction definitions. The following node types are potentially useful:
* `EXTRACT_SUBREG`: Supports getting the lower 32-bits of a 64-bit register
* `INSERT_SUBREG`: Assumes only the lower bits are modified. Can be used with
`IMPLICIT_DEF` to indicate that the upper bits are undefined. You can't
directly represent sign-extension, but you can do what Mips64 does and define
extra patterns to catch redundant sign-extension after one of the `*W`
instructions.
* `SUBREG_TO_REG`: a constant argument asserts the value of the bits left in
the upper portion of the register. This is perfect for zero-extension, and not
much good for the sign-extension RISC-V performs.

You end up with patterns like:

    def : Pat<(anyext GPR32:$reg),
              (SUBREG_TO_REG (i64 0), GPR32:$reg, sub_32)>;
def : Pat<(trunc GPR:$reg), (EXTRACT_SUBREG GPR:$reg, sub_32)>;

def : Pat<(add GPR32:$src, GPR32:$src2),
(ADDW GPR32:$src, GPR32:$src2)>;

def : Pat<(add GPR32:$rs1, simm12_i32:$imm12),
(ADDIW GPR32:$rs1, simm12_i32:$imm12)>;

## Changes needed
* 32-bit subregisters must be defined. Some register classes need GPR32
versions, e.g. GPR, GPRNoX0, GPRC.

* The RISCVAsmParser and RISCVDisassembler must be modified to support the new
register classes used for the 32-bit subregs.

* The calling convention implementation must handle promotion of i32
arguments/returns to i64.

* The `*W` instructions must be defined using GPR32.

* New `Operand<i32>` types must be defined and used in the `*W` instructions.

* When defining a variable-sized register class you specify a DefaultMode.
This must be set to i64 to avoid breaking RV32 compilation.

* This gives enough to define working support for the `*W` operations, but to
enable codegen for the other integer instructions requires either duplication
or smarts. To write patterns using i32 you need to define a new variant of the
instruction. TableGen changes might remove the need for this. Even with such
support, it's not particularly desirable to write a bunch of new patterns for
instructions other than the `*W` ones.

I'm sure solutions are possible, but given that the i64-only approach
seems to work very well, I'm not sure it's worth pushing further.

# Conclusion
Taking full advantage of support for variable-sized register classes and
sticking with i64 as the only legal integer type seems very workable and is
definitely my preference based on the work I've done. I'd be really interested
if anyone has any particular concerns or advice, or feedback on the suggested
new target hooks.

Best,

Alex Bradbury, lowRISC CIC

Having i64 as the only legal integer seems fine for a target that doesn't have architectural names for the 32-bit sub-registers. (For targets where the 32-bit registers have different names, you would run into issues with constructs like inline asm.)

The target-independent changes you've listed seem minor, and custom-lowering float<->int bitcasts is done on a lot of targets.

-Eli

Only having i64 seems cleaner to me. Of course you can still have i32 in the code up until legalisation.

I think the only real downside is you can end up with 64 bit arithmetic on things that are actually 32 bit, followed by a sext? That can be cleaned up to a *w instruction in most cases, and already is.

Example:

----------- ops.c

int add(int a, int b){return a+b;}
int sub(int a, int b){return a-b;}
int mul(int a, int b){return a*b;}
int div(int a, int b){return a/b;}

unsigned addu(unsigned a, unsigned b){return a+b;}
unsigned subu(unsigned a, unsigned b){return a-b;}
unsigned mulu(unsigned a, unsigned b){return a*b;}
unsigned divu(unsigned a, unsigned b){return a/b;}

Now rebased to ToT, as of now.

All that mess in divu is the same as is generated from:

long foo(){
return 0x00000000ffffffffl;
}

0000000000000000 :
0: 00000537 lui a0,0x0
4: 0005059b sext.w a1,a0
8: 1582 slli a1,a1,0x20
c: 1502 slli a0,a0,0x20
e: 9101 srli a0,a0,0x20
10: 8d4d or a0,a0,a1
12: 8082 ret

For sure that’s not the best way to generate that constant!

Definitely not. That pattern was a placeholder just to produce
something correct. The list of changes described in the RFC describes
the work implemented to end up with mostly reasonable-looking codegen.
I'm hoping to start posting these to phabricator later today.

That constant takes 3 instructions with smarter 64-bit immediate
materialisation. For zext i32 -> i64 you'd prefer to perform two
shifts, unless you can CSE the mask.

Best,

Alex

li a0,-1
srli a0,a0,0x20

… works for me. Both 16 bit instructions. And similar for any other sequence of 0s in hi bits followed by 1s in lo.

And indeed, yes, the divu() as a whole would be better as:

slli a0,a0,0x20
slli a1,a1,0x20
srli a0,a0,0x20
srli a1,a1,0x20
divu a0,a0,a1
sext.w a0,a0
ret

(scheduled for a dual-issue machine. Would be different for a machine with macro-op fusion)

Really looking forward to 64 bit in upstream!

Hi,

I too think option #1 is a workable. Similar to what Bruce mentioned, in our downstream riscv64-unknown-linux-gnu toolchain, we also found a few cases involving truncations where i32 is currently poorly handled now but nothing that seems impossible to fix.

Kind regards,
Roger

Missatge de Bruce Hoult via llvm-dev <llvm-dev@lists.llvm.org> del dia dj., 4 d’oct. 2018 a les 9:48:

Hi Alex,

I don't have anything to add with respect to the base instruction
sets, but you asked me to comment on possible interactions with the
vector extension. For context, like many SIMD instruction sets, RVV
supports sub-GPR data widths. That is, the vector unit can operate on
elements that are 8, 16, 32, or 64 bit wide (the last one only on
RV64I or RV32IFD) and vectors of i8, i16, etc. intuitively should be
legal.

That raises questions about whether the corresponding scalar types
should be legal too, not just because code will be inserting and
extracting vector elements, but also because RVV directly supports
scalar operations of the same element widths too. For example, if you
have a vector register holding 8 bit elements, you can also use it do
to scalar 8 bit integer operations (taking your inputs from lane 0 and
broadcasting the result back into all lanes). So in the sense of what
operations are supported in hardware, arguably all of i8, i16, i32 and
i64 could be legal on RV64IV. In reality that's probably unacceptable
because it would make lots of purely scalar code enable the vector
unit and move scalars back and forth between vector registers and
GPRs, instead of just legalizing the smaller integers in GPRs. There's
also other complications relating to how the element width of the
vector registers is determined (this part of the spec is still in flux
currently).

I have not done any significant work on these sub-XLEN vector element
widths yet, partly because of the aforementioned details still being
in flux. Still, considering the severe disadvantages of "i8, i16, i32,
i64 legal", I am leaning towards keeping XLenVT as the only legal
integer type in RVV. The ensuing legalization of smaller scalars is a
a bit of an obstacle if some scalar operations should happen in the
vector register field instead of in GPRs, but that seems like the
least bad option. It's not great if we have to reverse engineer from
legalization output that something was e.g. an 8 bit ISD::MULHU to be
able to emit an 8 bit "scalar vmulhu" instruction, but it beats the
above alternative.

So in summary, I don't know for sure but it looks like we'll want to
stick with "only i32 is legal for RV32, only i64 is legal for RV64"
despite RVV bringing in vectors of i8, i16, etc. and even native
support for the corresponding scalars.

Cheers,
Robin