[RFC] Half-Precision Support in the Arm Backends

Hi,

I am working on C/C++ language support for the Armv8.2-A half-precision
instructions. I’ve added support for _Float16 as a new source language type to
Clang. _Float16 is a C11 extension type for which arithmetic is well defined, as
opposed to e.g. __fp16 which is a storage-only type. I then fixed up the
AArch64 backend, which was mostly straightforward: this involved making
operations on f16 legal when FullFP16 is supported, thus avoiding promotions to
f32. This enables generation of AArch64 FP16 instruction from C/C++. For
AArch64, this work is finished and does not show problems in our testing; Solid
Sands provided us with beta versions of their FP16 extension to SuperTest -
their C/C++ language conformance test suite. However, as more testing can
always be done, and there are not a lot of code bases using _Float16, I
would be interested in more testing/feedback.

This RFC is thus a quick status update on the AArch64 implementation, but is
mainly about the AArch32 implementation in the ARM backend, which is a lot more
interesting than AArch64 for a number of reasons. Most importantly because
there is no soft-float ABI for AArch64 and it has half-precision H-registers,
which is all very different for AArch32. So it’s the different combinations, like
soft float, softfp with FP support but argument passing in integer registers,
hard float, hard float with FP16, and hard float with FullFP16, that makes things
interesting.

My AArch32 implementation in the ARM backend is nearly complete and I am
working on fixing a handful of regression tests (the WIP diff can be found
here: https://reviews.llvm.org/D38315). My approach to handle f16 types should
not lead to any codegen differences for existing tests, but the way half types
are handled and legalized is totally different in some cases and from that
point of view the changes could be considered intrusive. Thus, this is a heads
up, and below I will discuss the approach and some implementation decisions,
for which feedback is welcome of course.

Half-Precision RegisterClass

Technically, there are no f16 load/store instructions, yes, but we can use NEON vdl1 and vst1 to get something roughly equivalent, right? You probably want to custom-lower BITCAST instructions; the generic sequence emitted by the legalizer is pretty inefficient in most cases. — Overall, I think your approach makes sense. -Eli

Thanks a lot for the suggestions! I will look into using vld1/vst1, sounds good.

I am custom lowering the bitcasts, that’s now the only place where FP_TO_FP16

and FP16_TO_FP nodes are created to avoid inefficient code generation. I will

double check if I can’t achieve the same without using these nodes (because I

really would like to get completely rid of them).

Cheers,

Sjoerd.

>> >><i> Custom Lowering
</i>><i>> -------------------------
</i>>> >><i> Making f16 legal and not having native load/stores instructions available,
</i>>><i> (no FullFP16 support) means custom lowering loads/stores:
</i>>><i> 1) Since we don't have FP16 load/store instructions available, we create
</i>>><i>    integer half-word loads. I unfortunately need the FP16_TO_FP node here,
</i>>><i>    because that "models" creating an integer value, which is what we need
</i>>><i>    to create a "truncating i16" integer load instructions. Instead, of 
</i>>><i> using
</i>>><i>    FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation
</i>>><i>    to stack loads/stores which I don't want.
</i>>><i> 2) Custom lowering f16 stores is very similar, and creates truncating
</i>>><i>    half-word integer stores.
</i>>
>Technically, there are no f16 load/store instructions, yes, but we can 
>use NEON vdl1 and vst1 to get something roughly equivalent, right?
>
>You probably want to custom-lower BITCAST instructions; the generic 
>sequence emitted by the legalizer is pretty inefficient in most cases.
>
>---
>
>Overall, I think your approach makes sense.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

I would like to revive this thread, as I am struggling a lot with the FP16
implementation in the ARM backend. My implementation in
https://reviews.llvm.org/D38315 is finished (except one case), but a more
robust alternative implementation was suggested. One can indeed argue that my
current implementation is a bit fragile, because it involves manually patching
up the isel dags for a few cases. The suggestion was to look into CCState and
adjusting of the calling convention lowering, inspired by a recent discussion
on the list here: http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html.
The benefit of this approach is that I would get most of legalization for free, which is the
fragile bit in my approach at the moment.
Anyway, it’s become a long story, but (almost) in chronological order this is
what I’ve considered. Any hints, tips, suggestions welcome.

Goal of my exercise:

Hi Sjoerd,

For ISel, I think having a separate register class will give you less headache. I wondering if you could get away with not touching the instructions descriptions at all, instead defining external pattens for the FullFP16 case, like so:

def VCVTBHS: ASuI<0b11101, 0b11, 0b0010, 0b01, 0, (outs SPR:$Sd), (ins SPR:$Sm),
IIC_fpCVTSH, “vcvtb”, “.f32.f16\t$Sd, $Sm”,
[]>,
Requires<[HasFP16]>,
Sched<[WriteFPCVT]>;

def : FP16Pat<(f16_to_fp GPR:$a),
(VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>;

def : FullFP16Pat<(f32 (fpextend HPR:$Sm)),
(VCVTBHS (COPY_TO_REGLASS HPR:$Sm, SPR)>;

I’m not sure of the COPY_TO_REGLASS semantics, but I would (dangerously) assume that it when it comes to copying the values between registers, it will be noticed that HPR and SPR actually alias each other and so no copy is needed. I hope this approach would allow for a clean separation of the FP16 and FullFP16 implementations and negate the need to manually type cast each register access.

cheers,
sam

Hi Sam,

Thanks for the suggestions! I can confirm that this works, so this is indeed the

most elegant way to separate the FP16 and FullFP16 rules.

This was the last piece of the puzzle. I can now abandon my old approach,

and will go for the CCState and this tablegen separation approach; thus we get

most of the legalization for free and it is more robust than custom lowering

loads/stores (and some other corner cases).

Thanks,

Sjoerd.