# Inlining mathematical function

Hi, I want suggestions/comments on my work.

## Summary

The Arm A64 instruction set has instructions dedicated to some mathematical functions. I want to improve performance by inlining mathematical functions using these instructions. My primitive work was posted at ⚙ D142859 [PoC][AArch64] Inline math function (SVE sin/cos). I want comments about this feature and the implementation.

## Motivation

Inlining mathematical functions like `sin`

, `exp`

, etc. can improve performance in terms of followings (especially when called in a loop).

- Utilize optimal dedicated instructions based on the target architecture feature (SVE, NEON)

(can be achieved also by dedicated math libraries) - Vectorize loops which include mathematical function calls

(can be achieved also by vectorized math libraries and compiler support, e.g. D134719) - Eliminate function call overhead
- Schedule instructions in caller and callee collectively
- Better software pipelining (in the future)
- Increase optimal candidates of fission points in loop fission (in the future)

## Implementation

My primitive work was posted at ⚙ D142859 [PoC][AArch64] Inline math function (SVE sin/cos) for reference purposes. This patch is based on our downstream compiler and the porting is still incomplete.

In a new codegen IR pass, it replaces general intrinsics like `llvm.sin.*`

with AArch64-specific intrinsics like `llvm.aarch64.sin.*`

. In addition, constant tables are added to IR if the expanded instructions need constants. In SelectionDAG, the intrinsics are expanded to instructions.

The posted patch supports only `sin`

and `cos`

of SVE version. I want to support other mathematical functions and possibly NEON versions if the first patch is accepted.

Support in GlobalISel will be a future work.

A simple `for (i=0; i<1024; i++) b[i] = sin(a[i]);`

loop function will be expanded to the following code. `ftsmul`

, `ftmad`

, and `ftssel`

are instructions dedicated to trigonometric functions.

```
.text
.file "loop.c"
.globl foo
.p2align 2
.type foo,@function
foo:
adrp x8, .llvm.sin.nxv2f64.tbl
add x8, x8, :lo12:.llvm.sin.nxv2f64.tbl
ptrue p0.d
mov z7.d, #0
ld1rd { z0.d }, p0/z, [x8, #8]
ld1rd { z1.d }, p0/z, [x8, #48]
ld1rd { z2.d }, p0/z, [x8, #24]
ld1rd { z3.d }, p0/z, [x8, #32]
ld1rd { z4.d }, p0/z, [x8, #40]
ld1rd { z5.d }, p0/z, [x8]
ld1rd { z6.d }, p0/z, [x8, #56]
mov x8, xzr
.LBB0_1
ld1d { z16.d }, p0/z, [x1, x8, lsl #3]
mov z17.d, z1.d
mov z18.d, z2.d
mov z19.d, z3.d
mov z21.d, z4.d
facgt p1.d, p0/z, z16.d, z5.d
fmad z17.d, p0/m, z16.d, z0.d
fsub z20.d, z17.d, z0.d
fmsb z18.d, p0/m, z20.d, z16.d
fmsb z19.d, p0/m, z20.d, z18.d
mov z18.d, z7.d
fmsb z21.d, p0/m, z20.d, z19.d
ftsmul z19.d, z21.d, z17.d
ftssel z17.d, z21.d, z17.d
ftmad z18.d, z18.d, z19.d, #7
ftmad z18.d, z18.d, z19.d, #6
ftmad z18.d, z18.d, z19.d, #5
ftmad z18.d, z18.d, z19.d, #4
ftmad z18.d, z18.d, z19.d, #3
ftmad z18.d, z18.d, z19.d, #2
ftmad z18.d, z18.d, z19.d, #1
ftmad z18.d, z18.d, z19.d, #0
fmul z17.d, z18.d, z17.d
sel z16.d, p1, z6.d, z17.d
st1d { z16.d }, p0, [x0, x8, lsl #3]
add x8, x8, #8
cmp x8, #1024
b.ne .LBB0_1
ret
.Lfunc_end0:
.size foo, .Lfunc_end0-foo
.type .llvm.sin.nxv2f64.tbl,@object // @.llvm.sin.nxv2f64.tbl
.section .rodata,"a",@progbits
.p2align 4, 0x0
.llvm.sin.nxv2f64.tbl:
.xword 4839422400168542208 // 0x43291508581d4000
.xword 4843621399236968448 // 0x4338000000000000
.xword 4843621399236968449 // 0x4338000000000001
.xword 4609753056853098496 // 0x3ff921fb50000000
.xword 4490388670355865600 // 0x3e5110b460000000
.xword 4364452196894661639 // 0x3c91a62633145c07
.xword 4603909380684499074 // 0x3fe45f306dc9c882
.xword 9221120237041090560 // 0x7ff8000000000000
.size .llvm.sin.nxv2f64.tbl, 64
```

## Applicable conditions

This optimization will be applied under the following conditions.

- The target is AArch64.
- The optimization size level is not 0.

(in Clang options, not`-O0`

,`-Os`

, or`-Oz`

) - The IR has any of
`llvm.{sin,cos,tan,exp,...}.*`

intrinsics.

(in Clang options,`-fbuiltin -fno-rounding-math -fno-trapping-math -fno-math-errno`

) - The fast-math flag
`afn`

is attached to the intrinsic.

(in Clang options,`-fapprox-func`

) - An instruction pattern more efficient than
`libm`

exists for the function under the given architecture features.

## Questions

I’m not yet familiar with the code base. I want suggestions/comments about following issues.

- My implementation selects target intrinsics at IR level and expand them at SelectionDAG. Is it reasonable?
- Is the applicable conditions reasonable? Or should I add a new option?
- Any other suggestions/comments?