Inlining mathematical function

Inlining mathematical function

Hi, I want suggestions/comments on my work.

Summary

The Arm A64 instruction set has instructions dedicated to some mathematical functions. I want to improve performance by inlining mathematical functions using these instructions. My primitive work was posted at ⚙ D142859 [PoC][AArch64] Inline math function (SVE sin/cos). I want comments about this feature and the implementation.

Motivation

Inlining mathematical functions like sin, exp, etc. can improve performance in terms of followings (especially when called in a loop).

  • Utilize optimal dedicated instructions based on the target architecture feature (SVE, NEON)
    (can be achieved also by dedicated math libraries)
  • Vectorize loops which include mathematical function calls
    (can be achieved also by vectorized math libraries and compiler support, e.g. D134719)
  • Eliminate function call overhead
  • Schedule instructions in caller and callee collectively
  • Better software pipelining (in the future)
  • Increase optimal candidates of fission points in loop fission (in the future)

Implementation

My primitive work was posted at ⚙ D142859 [PoC][AArch64] Inline math function (SVE sin/cos) for reference purposes. This patch is based on our downstream compiler and the porting is still incomplete.

In a new codegen IR pass, it replaces general intrinsics like llvm.sin.* with AArch64-specific intrinsics like llvm.aarch64.sin.*. In addition, constant tables are added to IR if the expanded instructions need constants. In SelectionDAG, the intrinsics are expanded to instructions.

The posted patch supports only sin and cos of SVE version. I want to support other mathematical functions and possibly NEON versions if the first patch is accepted.

Support in GlobalISel will be a future work.

A simple for (i=0; i<1024; i++) b[i] = sin(a[i]); loop function will be expanded to the following code. ftsmul, ftmad, and ftssel are instructions dedicated to trigonometric functions.

	.text
	.file	"loop.c"
	.globl	foo
	.p2align	2
	.type	foo,@function
foo:
	adrp	x8, .llvm.sin.nxv2f64.tbl
	add	x8, x8, :lo12:.llvm.sin.nxv2f64.tbl
	ptrue	p0.d
	mov	z7.d, #0
	ld1rd	{ z0.d }, p0/z, [x8, #8]
	ld1rd	{ z1.d }, p0/z, [x8, #48]
	ld1rd	{ z2.d }, p0/z, [x8, #24]
	ld1rd	{ z3.d }, p0/z, [x8, #32]
	ld1rd	{ z4.d }, p0/z, [x8, #40]
	ld1rd	{ z5.d }, p0/z, [x8]
	ld1rd	{ z6.d }, p0/z, [x8, #56]
	mov	x8, xzr
.LBB0_1
	ld1d	{ z16.d }, p0/z, [x1, x8, lsl #3]
	mov	z17.d, z1.d
	mov	z18.d, z2.d
	mov	z19.d, z3.d
	mov	z21.d, z4.d
	facgt	p1.d, p0/z, z16.d, z5.d
	fmad	z17.d, p0/m, z16.d, z0.d
	fsub	z20.d, z17.d, z0.d
	fmsb	z18.d, p0/m, z20.d, z16.d
	fmsb	z19.d, p0/m, z20.d, z18.d
	mov	z18.d, z7.d
	fmsb	z21.d, p0/m, z20.d, z19.d
	ftsmul	z19.d, z21.d, z17.d
	ftssel	z17.d, z21.d, z17.d
	ftmad	z18.d, z18.d, z19.d, #7
	ftmad	z18.d, z18.d, z19.d, #6
	ftmad	z18.d, z18.d, z19.d, #5
	ftmad	z18.d, z18.d, z19.d, #4
	ftmad	z18.d, z18.d, z19.d, #3
	ftmad	z18.d, z18.d, z19.d, #2
	ftmad	z18.d, z18.d, z19.d, #1
	ftmad	z18.d, z18.d, z19.d, #0
	fmul	z17.d, z18.d, z17.d
	sel	z16.d, p1, z6.d, z17.d
	st1d	{ z16.d }, p0, [x0, x8, lsl #3]
	add	x8, x8, #8
	cmp	x8, #1024
	b.ne	.LBB0_1
	ret
.Lfunc_end0:
	.size	foo, .Lfunc_end0-foo
	.type	.llvm.sin.nxv2f64.tbl,@object   // @.llvm.sin.nxv2f64.tbl
	.section	.rodata,"a",@progbits
	.p2align	4, 0x0
.llvm.sin.nxv2f64.tbl:
	.xword	4839422400168542208             // 0x43291508581d4000
	.xword	4843621399236968448             // 0x4338000000000000
	.xword	4843621399236968449             // 0x4338000000000001
	.xword	4609753056853098496             // 0x3ff921fb50000000
	.xword	4490388670355865600             // 0x3e5110b460000000
	.xword	4364452196894661639             // 0x3c91a62633145c07
	.xword	4603909380684499074             // 0x3fe45f306dc9c882
	.xword	9221120237041090560             // 0x7ff8000000000000
	.size	.llvm.sin.nxv2f64.tbl, 64

Applicable conditions

This optimization will be applied under the following conditions.

  1. The target is AArch64.
  2. The optimization size level is not 0.
    (in Clang options, not -O0, -Os, or -Oz)
  3. The IR has any of llvm.{sin,cos,tan,exp,...}.* intrinsics.
    (in Clang options, -fbuiltin -fno-rounding-math -fno-trapping-math -fno-math-errno)
  4. The fast-math flag afn is attached to the intrinsic.
    (in Clang options, -fapprox-func)
  5. An instruction pattern more efficient than libm exists for the function under the given architecture features.

Questions

I’m not yet familiar with the code base. I want suggestions/comments about following issues.

  • My implementation selects target intrinsics at IR level and expand them at SelectionDAG. Is it reasonable?
  • Is the applicable conditions reasonable? Or should I add a new option?
  • Any other suggestions/comments?