Inlining mathematical function
Hi, I want suggestions/comments on my work.
Summary
The Arm A64 instruction set has instructions dedicated to some mathematical functions. I want to improve performance by inlining mathematical functions using these instructions. My primitive work was posted at ⚙ D142859 [PoC][AArch64] Inline math function (SVE sin/cos). I want comments about this feature and the implementation.
Motivation
Inlining mathematical functions like sin
, exp
, etc. can improve performance in terms of followings (especially when called in a loop).
- Utilize optimal dedicated instructions based on the target architecture feature (SVE, NEON)
(can be achieved also by dedicated math libraries) - Vectorize loops which include mathematical function calls
(can be achieved also by vectorized math libraries and compiler support, e.g. D134719) - Eliminate function call overhead
- Schedule instructions in caller and callee collectively
- Better software pipelining (in the future)
- Increase optimal candidates of fission points in loop fission (in the future)
Implementation
My primitive work was posted at ⚙ D142859 [PoC][AArch64] Inline math function (SVE sin/cos) for reference purposes. This patch is based on our downstream compiler and the porting is still incomplete.
In a new codegen IR pass, it replaces general intrinsics like llvm.sin.*
with AArch64-specific intrinsics like llvm.aarch64.sin.*
. In addition, constant tables are added to IR if the expanded instructions need constants. In SelectionDAG, the intrinsics are expanded to instructions.
The posted patch supports only sin
and cos
of SVE version. I want to support other mathematical functions and possibly NEON versions if the first patch is accepted.
Support in GlobalISel will be a future work.
A simple for (i=0; i<1024; i++) b[i] = sin(a[i]);
loop function will be expanded to the following code. ftsmul
, ftmad
, and ftssel
are instructions dedicated to trigonometric functions.
.text
.file "loop.c"
.globl foo
.p2align 2
.type foo,@function
foo:
adrp x8, .llvm.sin.nxv2f64.tbl
add x8, x8, :lo12:.llvm.sin.nxv2f64.tbl
ptrue p0.d
mov z7.d, #0
ld1rd { z0.d }, p0/z, [x8, #8]
ld1rd { z1.d }, p0/z, [x8, #48]
ld1rd { z2.d }, p0/z, [x8, #24]
ld1rd { z3.d }, p0/z, [x8, #32]
ld1rd { z4.d }, p0/z, [x8, #40]
ld1rd { z5.d }, p0/z, [x8]
ld1rd { z6.d }, p0/z, [x8, #56]
mov x8, xzr
.LBB0_1
ld1d { z16.d }, p0/z, [x1, x8, lsl #3]
mov z17.d, z1.d
mov z18.d, z2.d
mov z19.d, z3.d
mov z21.d, z4.d
facgt p1.d, p0/z, z16.d, z5.d
fmad z17.d, p0/m, z16.d, z0.d
fsub z20.d, z17.d, z0.d
fmsb z18.d, p0/m, z20.d, z16.d
fmsb z19.d, p0/m, z20.d, z18.d
mov z18.d, z7.d
fmsb z21.d, p0/m, z20.d, z19.d
ftsmul z19.d, z21.d, z17.d
ftssel z17.d, z21.d, z17.d
ftmad z18.d, z18.d, z19.d, #7
ftmad z18.d, z18.d, z19.d, #6
ftmad z18.d, z18.d, z19.d, #5
ftmad z18.d, z18.d, z19.d, #4
ftmad z18.d, z18.d, z19.d, #3
ftmad z18.d, z18.d, z19.d, #2
ftmad z18.d, z18.d, z19.d, #1
ftmad z18.d, z18.d, z19.d, #0
fmul z17.d, z18.d, z17.d
sel z16.d, p1, z6.d, z17.d
st1d { z16.d }, p0, [x0, x8, lsl #3]
add x8, x8, #8
cmp x8, #1024
b.ne .LBB0_1
ret
.Lfunc_end0:
.size foo, .Lfunc_end0-foo
.type .llvm.sin.nxv2f64.tbl,@object // @.llvm.sin.nxv2f64.tbl
.section .rodata,"a",@progbits
.p2align 4, 0x0
.llvm.sin.nxv2f64.tbl:
.xword 4839422400168542208 // 0x43291508581d4000
.xword 4843621399236968448 // 0x4338000000000000
.xword 4843621399236968449 // 0x4338000000000001
.xword 4609753056853098496 // 0x3ff921fb50000000
.xword 4490388670355865600 // 0x3e5110b460000000
.xword 4364452196894661639 // 0x3c91a62633145c07
.xword 4603909380684499074 // 0x3fe45f306dc9c882
.xword 9221120237041090560 // 0x7ff8000000000000
.size .llvm.sin.nxv2f64.tbl, 64
Applicable conditions
This optimization will be applied under the following conditions.
- The target is AArch64.
- The optimization size level is not 0.
(in Clang options, not-O0
,-Os
, or-Oz
) - The IR has any of
llvm.{sin,cos,tan,exp,...}.*
intrinsics.
(in Clang options,-fbuiltin -fno-rounding-math -fno-trapping-math -fno-math-errno
) - The fast-math flag
afn
is attached to the intrinsic.
(in Clang options,-fapprox-func
) - An instruction pattern more efficient than
libm
exists for the function under the given architecture features.
Questions
I’m not yet familiar with the code base. I want suggestions/comments about following issues.
- My implementation selects target intrinsics at IR level and expand them at SelectionDAG. Is it reasonable?
- Is the applicable conditions reasonable? Or should I add a new option?
- Any other suggestions/comments?