Performance analysis for TSVC

yus3710-fj · December 4, 2023, 7:57am

Hello.

As I said before, I’ve investigated the performance of TSVC on A64FX (Fujitsu AArch64 CPU), and now I’d like to share the information.
I’m afraid I’ve not completely finished the research yet, but I think it is worth sharing with you.

HLFIR lowering

There was a weird performance issue on HLFIR lowering.
The assembly code of the innermost loop is the same as FIR lowering but the performance is reduced by 10% when HLFIR lowering is enabled.

I found that the performance depends on the order of files in linking.

$ cat second.f90
function second()
  real second
  call cpu_time(second)
end function

$ flang-new mains.f loops.f second.f90 -Ofast -flang-deprecated-no-hlfir
$ ./a.out | grep s4117
 s4117   10     2.301718    1.0241E+01    1.0241E+01                      121
 s4117  100     2.187275    1.0042E+02    1.0042E+02   7.5978E-08         121
 s4117 1000     1.901932    1.0004E+03    1.0004E+03   2.4403E-07         121
$ flang-new mains.f loops.f second.f90 -Ofast
$ ./a.out | grep s4117
 s4117   10     2.664087    1.0241E+01    1.0241E+01                      121
 s4117  100     2.319328    1.0042E+02    1.0042E+02   7.5978E-08         121
 s4117 1000     2.351143    1.0004E+03    1.0004E+03   2.4403E-07         121
$ flang-new loops.f mains.f second.f90 -Ofast
$ ./a.out | grep s4117
 s4117   10     2.217472    1.0241E+01    1.0241E+01                      121
 s4117  100     2.333374    1.0042E+02    1.0042E+02   7.5978E-08         121
 s4117 1000     1.900425    1.0004E+03    1.0004E+03   2.4403E-07         121

Some other loops have the same issue.
I don’t know why, but I’d like to conclude there is no problem on HLFIR lowering.

vs gfortran

I also measured the performance of Gfortran.
According to the result below, Flang is 11% slower than Gfortran as a whole.
In particular, it seems that vectorization makes a big difference on their performance.

func	Gfortran [s]	Flang with FIR [s]	Flang with HLFIR [s]
ALL	1354.27	1525.36	1523.19
s111	5.572203	5.066218	5.052499
s112	10.112344(V)	4.050852(V)	3.947439(V)
s113	2.262772(V)	4.245930	4.716603
:
vdotr	4.834167(V)	3.709595(V)	3.713380(V)
vbor	10.649109(V)	10.333374(V)	10.352844(V)

Version
- Gfortran: 11.2.0
- Flang: main(1c1227846425883a3d39ff56700660236a97152c)
Option: -Ofast
- -falias-analysis is also specified for Flang
“(V)” says the loop is vectorized
- checked with -fopt-info-vec-optimized for Gfortran and -Rpass=vector for Flang

Vectorization

I checked whether vectorization works well for TSVC (Test Suite for Vectorizing Compilers).
There are 135 loops in TSVC and Flang can vectorize 52 loops while Gfortran can vectorize 58 loops.
I think the gap should be filled, and now I’m focusing on the loops which are vectorized if they are written in C.

github.com/llvm/llvm-project

[LICM] TSVC s113: not vectorized because LICM doesn't work

opened 12:19AM - 04 Dec 23 UTC

yus3710-fj

flang:ir vectorization

Flang can't vectorize the loop in `s113` of [TSVC](https://www.netlib.org/benchm…ark/vectors) while Clang can vectorize the loop written in C. ```fortran ! Fortran version do 1 nl = 1,ntimes do 10 i = 2,n a(i) = a(1) + b(i) 10 continue call dummy(ld,n,a,b,c,d,e,aa,bb,cc,1.) 1 continue ``` ```c // C version for (int nl = 0; nl < ntimes; nl++) { for (int i = 1; i < n; i++) { a[i] = a[0] + b[i]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } ``` ```console $ flang-new -v -Ofast s113.f -S -Rpass=licm\|vector -falias-analysis flang-new version 18.0.0 (https://github.com/llvm/llvm-project.git 1c1227846425883a3d39ff56700660236a97152c) Target: aarch64-unknown-linux-gnu Thread model: posix InstalledDir: /path/to/install/bin Found candidate GCC installation: /path/to/lib/gcc/aarch64-unknown-linux-gnu/11.2.0 Selected GCC installation: /path/to/lib/gcc/aarch64-unknown-linux-gnu/11.2.0 Candidate multilib: .;@m64 Selected multilib: .;@m64 "/path/to/install/bin/flang-new" -fc1 -triple aarch64-unknown-linux-gnu -S -fcolor-diagnostics -mrelocation-model pic -pic-level 2 -pic-is-pie -ffast-math -target-cpu generic -target-feature +neon -target-feature +v8a -fstack-arrays -fversion-loops-for-stride -falias-analysis -Rpass=vector -O3 -o s113.s -x f95-cpp-input s113.f $ clang -Ofast s113.c -Rpass=licm\|vector /path/to/s113.c:16:11: remark: hoisting load [-Rpass=licm] 16 | a[i] = a[0] + b[i]; | ^ /path/to/s113.c:15:3: remark: vectorized loop (vectorization width: 4, interleaved count: 2) [-Rpass=loop-vectorize] 15 | for (int i = 1; i < LEN; i++) { | ^ ``` It can be reproduced with the following C code which is the same program as the above C code essentially. ```c // C version for (int nl = 0; nl < ntimes; nl++) { for (int i = 2; i <= n; i++) { a[i-1] = a[0] + b[i-1]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } ``` Actually, Flang generates LLVM IR like this C code. LICM is necessary for vectorization because LoopAccessAnalysis can't analyze `a[0]` correctly. It seems that LICM doesn't work due to the linear expression in indices of arrays.

github.com/llvm/llvm-project

[Flang] TSVC s314: `fcmp` doesn't have fast-math flags

opened 12:21AM - 04 Dec 23 UTC

yus3710-fj

performance flang:ir

Flang can't vectorize the loop in `s314` of [TSVC](https://www.netlib.org/benchm…ark/vectors) while Clang can vectorize the loop written in C. ```fortran ! Fortran version do 1 nl = 1,ntimes x = a(1) do 10 i = 2,n if(a(i) .gt. x) x = a(i) 10 continue call dummy(ld,n,a,b,c,d,e,aa,bb,cc,x) 1 continue ``` ```c // C version for (int nl = 0; nl < ntimes; nl++) { x = a[0]; for (int i = 1; i < n; i++) { if (a[i] > x) { x = a[i]; } } dummy(a, b, c, d, e, aa, bb, cc, x); } ``` ```console $ flang-new -v -Ofast s314.f -S -Rpass=vector flang-new version 18.0.0 (https://github.com/llvm/llvm-project.git 1c1227846425883a3d39ff56700660236a97152c) Target: aarch64-unknown-linux-gnu Thread model: posix InstalledDir: /path/to/install/bin Found candidate GCC installation: /path/to/lib/gcc/aarch64-unknown-linux-gnu/11.2.0 Selected GCC installation: /path/to/lib/gcc/aarch64-unknown-linux-gnu/11.2.0 Candidate multilib: .;@m64 Selected multilib: .;@m64 "/path/to/install/bin/flang-new" -fc1 -triple aarch64-unknown-linux-gnu -S -fcolor-diagnostics -mrelocation-model pic -pic-level 2 -pic-is-pie -ffast-math -target-cpu generic -target-feature +neon -target-feature +v8a -fstack-arrays -fversion-loops-for-stride -Rpass=vector -O3 -o s314.s -x f95-cpp-input s314.f $ clang -Ofast s314.c -Rpass=vector /path/to/s314.c:17:3: remark: vectorized loop (vectorization width: 4, interleaved count: 2) [-Rpass=loop-vectorize] 17 | for (int i = 0; i < LEN; i++) { | ^ ``` `fcmp` should have fast-math flags to be recognized as max reduction but Flang doesn't support. ``` .lr.ph: ; preds = %.lr.ph.preheader, %26 %indvars.iv = phi i64 [ 2, %.lr.ph.preheader ], [ %indvars.iv.next, %26 ] %22 = phi float [ %18, %.lr.ph.preheader ], [ %27, %26 ] %gep = getelementptr float, ptr getelementptr ([1000 x float], ptr @_QMmodEa, i64 -1, i64 999), i64 %indvars.iv, !dbg !21 %23 = load float, ptr %gep, align 4, !dbg !21, !tbaa !15 %24 = fcmp ogt float %23, %22, !dbg !21 br i1 %24, label %25, label %26, !dbg !21 25: ; preds = %.lr.ph store float %23, ptr %12, align 4, !dbg !23, !tbaa !15 br label %26, !dbg !21 26: ; preds = %25, %.lr.ph %27 = phi float [ %23, %25 ], [ %22, %.lr.ph ] %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !24 %exitcond.not = icmp eq i64 %indvars.iv, %21, !dbg !21 br i1 %exitcond.not, label %._crit_edge, label %.lr.ph, !dbg !21 ```

github.com/llvm/llvm-project

[Flang] TSVC s314, s3111: not vectorized because the loops are not recognized as reduction loops

opened 12:28AM - 04 Dec 23 UTC

yus3710-fj

flang:ir vectorization

Flang can't vectorize the loop in `s314` and `s3111` of [TSVC](https://www.netli…b.org/benchmark/vectors) while Clang can vectorize the loop written in C. ```fortran ! Fortran version do 1 nl = 1,ntimes x = a(1) do 10 i = 2,n if(a(i) .gt. x) x = a(i) 10 continue call dummy(ld,n,a,b,c,d,e,aa,bb,cc,x) 1 continue ``` ```c // C version for (int nl = 0; nl < ntimes; nl++) { x = a[0]; for (int i = 1; i < n; i++) { if (a[i] > x) { x = a[i]; } } dummy(a, b, c, d, e, aa, bb, cc, x); } ``` ```console $ flang-new -v -Ofast s314.f -S -Rpass=vector flang-new version 18.0.0 (https://github.com/llvm/llvm-project.git 1c1227846425883a3d39ff56700660236a97152c) Target: aarch64-unknown-linux-gnu Thread model: posix InstalledDir: /path/to/install/bin Found candidate GCC installation: /path/to/lib/gcc/aarch64-unknown-linux-gnu/11.2.0 Selected GCC installation: /path/to/lib/gcc/aarch64-unknown-linux-gnu/11.2.0 Candidate multilib: .;@m64 Selected multilib: .;@m64 "/path/to/install/bin/flang-new" -fc1 -triple aarch64-unknown-linux-gnu -S -fcolor-diagnostics -mrelocation-model pic -pic-level 2 -pic-is-pie -ffast-math -target-cpu generic -target-feature +neon -target-feature +v8a -fstack-arrays -fversion-loops-for-stride -Rpass=vector -O3 -o s314.s -x f95-cpp-input s314.f $ clang -Ofast s314.c -Rpass=vector /path/to/s314.c:17:3: remark: vectorized loop (vectorization width: 4, interleaved count: 2) [-Rpass=loop-vectorize] 17 | for (int i = 0; i < LEN; i++) { | ^ ``` One reason is fast-math flag (#74263), but even if fast-math flag is set, the loop isn't vectorized. There seems to be a redundant store in the IR from Flang and that prevents from the pattern match. * IR from Flang (manually added fast-math flag) ``` .lr.ph: ; preds = %.lr.ph.preheader, %26 %indvars.iv = phi i64 [ 2, %.lr.ph.preheader ], [ %indvars.iv.next, %26 ] %22 = phi float [ %18, %.lr.ph.preheader ], [ %27, %26 ] %gep = getelementptr float, ptr getelementptr ([1000 x float], ptr @_QMmodEa, i64 -1, i64 999), i64 %indvars.iv, !dbg !21 %23 = load float, ptr %gep, align 4, !dbg !21, !tbaa !15 %24 = fcmp fast ogt float %23, %22, !dbg !21 br i1 %24, label %25, label %26, !dbg !21 25: ; preds = %.lr.ph store float %23, ptr %12, align 4, !dbg !23, !tbaa !15 br label %26, !dbg !21 26: ; preds = %25, %.lr.ph %27 = phi float [ %23, %25 ], [ %22, %.lr.ph ] %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !24 %exitcond.not = icmp eq i64 %indvars.iv, %21, !dbg !21 br i1 %exitcond.not, label %._crit_edge, label %.lr.ph, !dbg !21 ``` * IR from Clang ``` for.body4: ; preds = %for.body, %for.body4 %indvars.iv = phi i64 [ 0, %for.body ], [ %indvars.iv.next, %for.body4 ] %x.120 = phi float [ %0, %for.body ], [ %x.2, %for.body4 ] %arrayidx = getelementptr inbounds [32000 x float], ptr @a, i64 0, i64 %indvars.iv %1 = load float, ptr %arrayidx, align 4, !tbaa !5 %cmp5 = fcmp fast ogt float %1, %x.120 %x.2 = select i1 %cmp5, float %1, float %x.120 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %exitcond.not = icmp eq i64 %indvars.iv.next, 32000 br i1 %exitcond.not, label %for.cond.cleanup3, label %for.body4, !llvm.loop !11 ```

kiranchandramohan · December 4, 2023, 9:51am

Thanks @yus3710-fj for investigating the performance of TSVC with Flang.

Did you include the alias-tags pass (that propagates more alias info to LLVM) when you conducted the investigation?

github.com/llvm/llvm-project

[flang] (Re-)Enable alias tags pass by default

llvm:main ← tblah:ecclescake/reland-aa-tags-pass

opened 08:45PM - 03 Dec 23 UTC

tblah

+60 -21

Enable by default for optimization levels higher than 0 (same behavior as clang)…. For simplicity, only forward the flag to the frontend driver when it contradicts what is implied by the optimization level. This was first landed in https://github.com/llvm/llvm-project/pull/73111 but was later reverted due to a performance regression. That regression was fixed by https://github.com/llvm/llvm-project/pull/74065.

sjoerdmeijer · December 4, 2023, 5:26pm

I have done a similar exercise recently. I was interested in the C version and AArch64 codegen and raised a bunch of issues:

github.com/llvm/llvm-project

[AArch64] VLA slower than VLS (tsvc, s173)

opened 11:45AM - 07 Nov 23 UTC

sjoerdmeijer

backend:AArch64 vectorization

We are behind a lot compared to GCC. Compile this input with `-O3 -mcpu=neoverse…-v2 -ffast-math`: ``` __attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000], aa[256][256],bb[256][256],cc[256][256],tt[256][256]; int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float); float s173() { int k = 32000/2; for (int nl = 0; nl < 10*100000; nl++) { for (int i = 0; i < 32000/2; i++) { a[i+k] = a[i] + b[i]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } } ``` Clang's codegen: ``` .LBB0_3: // Parent Loop BB0_2 Depth=1 add x9, x19, x8, lsl #2 add x10, x20, x8, lsl #2 ld1w { z0.s }, p0/z, [x19, x8, lsl #2] ld1w { z2.s }, p0/z, [x20, x8, lsl #2] add x8, x8, x21 ld1w { z1.s }, p0/z, [x9, x28, lsl #2] ld1w { z3.s }, p0/z, [x10, x28, lsl #2] add x10, x9, x26 cmp x8, x22 fadd z0.s, z2.s, z0.s fadd z1.s, z3.s, z1.s st1w { z0.s }, p0, [x9, x23, lsl #2] st1w { z1.s }, p0, [x10, x28, lsl #2] b.ne .LBB0_3 ``` vs. GCC's codegen: ``` .L3: ldr q31, [x20, x0] ldr q30, [x19, x0] fadd v31.4s, v31.4s, v30.4s str q31, [x21, x0] add x0, x0, 16 cmp x0, x28 bne .L3 ``` See also: https://godbolt.org/z/9zs65h3aq Might be caused by the same underlying issue as: https://github.com/llvm/llvm-project/issues/71524

github.com/llvm/llvm-project

[AArch64] VLA slower than VLS (tsvc, s1111)

opened 11:40AM - 07 Nov 23 UTC

sjoerdmeijer

backend:AArch64 vectorization

We are about 25% behind with Clang compared to GCC12 on Grace for kernel s1111 i…n TSVC. The difference seems to be related to Clang generating a VLA loop, and GCC a simpler VLS loop. Compile this input with `-O3 -mcpu=neoverse-v2 -ffast-math`: ``` __attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000], aa[256][256],bb[256][256],cc[256][256],tt[256][256]; int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float); float s1111() { for (int nl = 0; nl < 2*100000; nl++) { for (int i = 0; i < 32000/2; i++) { a[2*i] = c[i] * b[i] + d[i] * b[i] + c[i] * c[i] + d[i] * b[i] + d[i] * c[i]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } } ``` Clang's codegen: ``` .LBB0_3: // Parent Loop BB0_2 Depth=1 ld1w { z4.s }, p0/z, [x19, x8, lsl #2] ld1w { z2.s }, p0/z, [x21, x8, lsl #2] ld1w { z3.s }, p0/z, [x20, x8, lsl #2] add x8, x8, x23 cmp x28, x8 fadd z5.s, z4.s, z4.s fmul z5.s, z3.s, z5.s fadd z3.s, z3.s, z2.s fadd z3.s, z3.s, z4.s lsl z4.d, z0.d, #1 add z0.d, z0.d, z6.d fmad z2.s, p0/m, z3.s, z5.s lsl z3.d, z1.d, #1 add z1.d, z1.d, z6.d uunpklo z5.d, z2.s uunpkhi z2.d, z2.s st1w { z5.d }, p1, [x22, z4.d, lsl #2] st1w { z2.d }, p1, [x22, z3.d, lsl #2] b.ne .LBB0_3 ``` vs. GCC's codegen: ``` .L3: ldr q29, [x25, x0] mov x7, x4 add x6, x4, 16 add x5, x4, 24 add x4, x4, 32 ldr q30, [x24, x0] ldr q28, [x23, x0] add x0, x0, 16 mov v31.16b, v29.16b fmla v31.4s, v30.4s, v27.4s fadd v30.4s, v30.4s, v29.4s fmul v31.4s, v31.4s, v28.4s fmla v31.4s, v30.4s, v29.4s str s31, [x7], 8 st1 {v31.s}[1], [x7] st1 {v31.s}[2], [x6] st1 {v31.s}[3], [x5] cmp x0, x26 bne .L3 ``` See also: https://godbolt.org/z/fnfa9eb3E TODO: Root cause analysis.

github.com/llvm/llvm-project

[AArch64] VLA slower than VLS (tsvc, s176)

opened 11:29AM - 07 Nov 23 UTC

sjoerdmeijer

backend:AArch64 vectorization

Clang generates a VLA style vector loop, and GCC a VLS vector loop. It looks lik…e we are about 50% slower as a result. Compile this input with `-O3 -mcpu=neoverse-v2 -ffast-math`: ``` __attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000], aa[256][256],bb[256][256],cc[256][256],tt[256][256]; int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float); float s176() { int m = 32000/2; for (int nl = 0; nl < 4*(100000/32000); nl++) { for (int j = 0; j < (32000/2); j++) { for (int i = 0; i < m; i++) { a[i] += b[i+m-j-1] * c[j]; } } dummy(a, b, c, d, e, aa, bb, cc, 0.); } } ``` Clang's codegen: ``` .LBB0_5: // Parent Loop BB0_2 Depth=1 ld1w { z2.s }, p0/z, [x12] ld1b { z3.b }, p1/z, [x12, x28] ld1w { z4.s }, p0/z, [x11] ld1b { z5.b }, p1/z, [x11, x28] add x12, x12, x23 subs x10, x10, x22 fmad z2.s, p0/m, z1.s, z4.s fmad z3.s, p0/m, z1.s, z5.s st1w { z2.s }, p0, [x11] st1b { z3.b }, p1, [x11, x28] add x11, x11, x23 b.ne .LBB0_5 ``` vs. GCC's codegen: ``` .L3: ldr q29, [x8, x0] ldr q31, [x19, x0] ldr q30, [x9, x0] fmla v31.4s, v28.4s, v29.4s fmla v31.4s, v27.4s, v30.4s str q31, [x19, x0] add x0, x0, 16 cmp x0, x22 bne .L3 ``` See also: https://godbolt.org/z/64nhv1o6z TODO: Root cause analysis.

github.com/llvm/llvm-project

[AArch64] Missed doubly loop fmla vectorisation (tsvc, s235)

opened 11:25AM - 07 Nov 23 UTC

sjoerdmeijer

backend:AArch64 vectorization

GCC12 vectorises the statements in both the outer and inner loop. Clang doesn't …do any vectorisation. As a result, we are about 90% behind for kernel s235 in TSVC. Compile this input with `-O3 -mcpu=neoverse-v2 -ffast-math`: ``` __attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000], aa[256][256],bb[256][256],cc[256][256],tt[256][256]; int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float); float s235() { for (int nl = 0; nl < 200*(100000/256); nl++) { for (int i = 0; i < 256; i++) { a[i] += b[i] * c[i]; for (int j = 1; j < 256; j++) { aa[j][i] = aa[j-1][i] + bb[j][i] * a[i]; } } dummy(a, b, c, d, e, aa, bb, cc, 0.); } return aa[1][2]; } ``` Clang's scalar codegen: ``` .LBB0_2: // Parent Loop BB0_1 Depth=1 ldr s0, [x21, x8, lsl #2] ldr s1, [x22, x8, lsl #2] ldr s2, [x23, x8, lsl #2] mov w11, #255 // =0xff mov x12, x9 mov x13, x10 fmadd s0, s1, s0, s2 ldr s1, [x20, x8, lsl #2] str s0, [x23, x8, lsl #2] .LBB0_3: // Parent Loop BB0_1 Depth=1 ldr s2, [x13, #1024] subs x11, x11, #3 fmadd s1, s2, s0, s1 ldr s2, [x13, #2048] str s1, [x12, #1024] fmadd s1, s2, s0, s1 ldr s2, [x13, #3072] add x13, x13, #3072 str s1, [x12, #2048] fmadd s1, s2, s0, s1 str s1, [x12, #3072] add x12, x12, #3072 b.ne .LBB0_3 add x8, x8, #1 add x10, x10, #4 add x9, x9, #4 cmp x8, #256 b.ne .LBB0_2 ``` vs. GCC's vector code: ``` .L4: add x10, x22, x11 sub x9, x8, #1024 ldr q29, [x21, x11] mov x0, 0 ldr q30, [x2, x11] ldr q31, [x28, x11] fmla v29.4s, v30.4s, v31.4s str q29, [x21, x11] .L3: ldr q30, [x10, x0] ldr q31, [x9, x0] fmla v31.4s, v30.4s, v29.4s str q31, [x8, x0] add x0, x0, 1024 cmp x0, x19 bne .L3 add x11, x11, 16 add x8, x8, 16 cmp x11, 1024 bne .L4 ``` See also: https://godbolt.org/z/5fG1bffqz TODO: Root cause analysis.

github.com/llvm/llvm-project

[AArch64] Missed if-conversion and vectorisation opportunity (tsvc, s124)

opened 11:22AM - 07 Nov 23 UTC

sjoerdmeijer

backend:AArch64 vectorization

We are generating a lot of code with Clang for a loop that contains an if-then s…tatement resulting in predicated instructions, which don't seem to be necessary looking at GCC's codegen. For this kernel s124 in TSVC, we are about 60% behind. Compile this input with `-O3 -mcpu=neoverse-v2 -ffast-math`: ``` __attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000], aa[256][256],bb[256][256],cc[256][256],tt[256][256]; int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float); float s124(struct args_t * func_args) { int j; for (int nl = 0; nl < 100000; nl++) { j = -1; for (int i = 0; i < 32000; i++) { if (b[i] > (float)0.) { j++; a[j] = b[i] + d[i] * e[i]; } else { j++; a[j] = c[i] + d[i] * e[i]; } } dummy(a, b, c, d, e, aa, bb, cc, 0.); } } ``` Clang's codegen: ``` .LBB0_3: // Parent Loop BB0_2 Depth=1 ld1w { z0.s }, p2/z, [x12, x9, lsl #2] ld1w { z1.s }, p2/z, [x25, x9, lsl #2] ld1w { z2.s }, p2/z, [x24, x9, lsl #2] asr x10, x8, #30 add x8, x8, x20 add x11, x23, x10 ld1w { z3.s }, p2/z, [x26, x9, lsl #2] ld1w { z4.s }, p2/z, [x22, x9, lsl #2] ld1w { z5.s }, p2/z, [x19, x9, lsl #2] fcmgt p0.s, p2/z, z0.s, #0.0 fcmgt p1.s, p2/z, z1.s, #0.0 sel z0.s, p0, z0.s, z2.s sel z1.s, p1, z1.s, z3.s ld1w { z2.s }, p2/z, [x21, x9, lsl #2] ld1w { z3.s }, p2/z, [x27, x9, lsl #2] add x9, x9, x15 cmp x28, x9 fmla z0.s, p2/m, z4.s, z2.s fmla z1.s, p2/m, z5.s, z3.s st1b { z0.b }, p3, [x23, x10] st1w { z1.s }, p2, [x11, x14, lsl #2] b.ne .LBB0_3 ``` vs. GCC's codegen: ``` .L2: ldr q30, [x23, x0] ldr q27, [x22, x0] ldr q28, [x21, x0] ldr q31, [x20, x0] fcmgt v29.4s, v30.4s, 0 fmla v30.4s, v27.4s, v28.4s fmla v31.4s, v27.4s, v28.4s bit v31.16b, v30.16b, v29.16b str q31, [x19, x0] add x0, x0, 16 cmp x0, x24 bne .L2 ``` See also: https://godbolt.org/z/nb6xYxxKo

github.com/llvm/llvm-project

[AArch64] Missed fmla vectorisation opportunity (tsvc, s2275)

opened 11:14AM - 07 Nov 23 UTC

sjoerdmeijer

backend:AArch64 vectorization

We are a lot behind (300%) for kernel s2275 in TSVC compared to GCC12. Compil…e this input with `-O3 -mcpu=neoverse-v2 -ffast-math`: ``` __attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000], aa[256][256],bb[256][256],cc[256][256],tt[256][256]; int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float); float s2275(struct args_t * func_args) { for (int nl = 0; nl < 100*(100000/256); nl++) { for (int i = 0; i < 256; i++) { for (int j = 0; j < 256; j++) { aa[j][i] = aa[j][i] + bb[j][i] * cc[j][i]; } a[i] = b[i] + c[i] * d[i]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } } ``` Clang's codegen: ``` .LBB0_2: // Parent Loop BB0_1 Depth=1 mov x12, xzr .LBB0_3: // Parent Loop BB0_1 Depth=1 add x14, x10, x12 add x13, x9, x12 ldr s2, [x14] ldr s3, [x14, #1024] add x14, x11, x12 ldr s0, [x13] ldr s1, [x13, #1024] add x12, x12, #2048 ldr s4, [x14] ldr s5, [x14, #1024] cmp x12, #64, lsl #12 // =262144 fmadd s0, s4, s2, s0 fmadd s1, s5, s3, s1 str s0, [x13] str s1, [x13, #1024] b.ne .LBB0_3 ldr s0, [x22, x8, lsl #2] ldr s1, [x23, x8, lsl #2] ldr s2, [x24, x8, lsl #2] add x11, x11, #4 add x10, x10, #4 add x9, x9, #4 fmadd s0, s2, s1, s0 str s0, [x25, x8, lsl #2] add x8, x8, #1 cmp x8, #256 b.ne .LBB0_2 ``` vs. GCC's codegen: ``` .L6: mov x0, 0 .L3: ldr q29, [x10, x0] ldr q30, [x9, x0] ldr q31, [x8, x0] fmla v31.4s, v29.4s, v30.4s str q31, [x8, x0] add x0, x0, 1024 cmp x0, 262144 bne .L3 ldr q29, [x27, x11] add x8, x8, 16 add x10, x10, 16 add x9, x9, 16 ldr q30, [x26, x11] ldr q31, [x25, x11] fmla v31.4s, v29.4s, v30.4s str q31, [x19, x11] add x11, x11, 16 cmp x11, 1024 bne .L6 ``` See also: https://godbolt.org/z/8E3fexn5o TODO: Root cause analysis.

github.com/llvm/llvm-project

[AArch64] Missed fadd vectorisation opportunity (tsvc, s231)

opened 11:11AM - 07 Nov 23 UTC

sjoerdmeijer

backend:AArch64 vectorization

Looks like we are 1400% (?!) behind for kernel s231 in TSVC compared to GCC. C…ompile this code with `-O3 -mcpu=neoverse-v2 -ffast-math`: ``` __attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000], aa[256][256],bb[256][256],cc[256][256],tt[256][256]; int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float); float s231() { for (int nl = 0; nl < 100*(100000/256); nl++) { for (int i = 0; i < 256; ++i) { for (int j = 1; j < 256; j++) { aa[j][i] = aa[j - 1][i] + bb[j][i]; } } dummy(a, b, c, d, e, aa, bb, cc, 0.); } } ``` Clang's codegen: ``` .LBB44_3: // Parent Loop BB44_1 Depth=1 // Parent Loop BB44_2 Depth=2 // => This Inner Loop Header: Depth=3 add x12, x21, x10 add x13, x20, x10 subs x11, x11, #5 add x10, x10, x19 ldr s1, [x12, #1024] fadd s0, s1, s0 ldr s1, [x12, #2048] str s0, [x13, #1024] fadd s0, s1, s0 ldr s1, [x12, #3072] str s0, [x13, #2048] fadd s0, s1, s0 ldr s1, [x12, #4096] str s0, [x13, #3072] fadd s0, s1, s0 ldr s1, [x12, #5120] str s0, [x13, #4096] fadd s0, s1, s0 str s0, [x13, #5120] b.ne .LBB44_3 ``` vs. GCC's codegen: ``` .L521: ldr q0, [x8, x0] ldr q1, [x2, x0] fadd v0.4s, v0.4s, v1.4s str q0, [x1, x0] add x0, x0, 16 cmp x0, 1024 bne .L521 ``` See also: https://godbolt.org/z/jr9WKW95v TODO: root cause analysis.

github.com/llvm/llvm-project

[AArch64] Missed vectorisation opportunity (tsvc, s172)

opened 11:08AM - 07 Nov 23 UTC

sjoerdmeijer

backend:AArch64 vectorization

We are not vectorising kernel s172 from TSVS and are 3x behind compared to GCC a…s a result. Compile this input with `-O3 -ffast-math -mcpu=neoverse-v2`: ``` __attribute__((aligned(64))) float x[32000]; __attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000], aa[256][256],bb[256][256],cc[256][256],tt[256][256]; int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float); float s172(int xa, int xb) { int n1 = xa; int n3 = xb; for (int nl = 0; nl < 100000; nl++) { for (int i = n1-1; i < 32000; i += n3) { a[i] += b[i]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } } ``` Clang's codegen: ``` .LBB0_3: // Parent Loop BB0_2 Depth=1 ldr s0, [x19, x8, lsl #2] ldr s1, [x20, x8, lsl #2] fadd s0, s1, s0 str s0, [x20, x8, lsl #2] add x8, x8, x22 cmp x8, x23 b.lt .LBB0_3 ``` GCC's codegen: ``` whilelo p7.s, wzr, w28 .L5: ld1w z31.s, p7/z, [x19, x0, lsl 2] ld1w z30.s, p7/z, [x27, x0, lsl 2] fadd z31.s, z31.s, z30.s st1w z31.s, p7, [x19, x0, lsl 2] add x0, x0, x2 whilelo p7.s, w0, w28 b.any .L5 ``` See also: https://godbolt.org/z/W8eEPKqET TODO: root cause analysis.

github.com/llvm/llvm-project

[AArch64] Missed vectorisation opportunity (tsvc, s122)

opened 11:02AM - 07 Nov 23 UTC

sjoerdmeijer

backend:AArch64 vectorization

We are not vectorising kernel s122 from TSVC whereas GCC is vectorising it. As a… result we are about 2x slower with Clang. Compile this input with `-O3 -ffast-math -mcpu=neoverse-v2`: ``` __attribute__((aligned(64))) float x[32000]; __attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000], aa[256][256],bb[256][256],cc[256][256],tt[256][256]; int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float); float s122(int xa, int xb) { int n1 = xa; int n3 = xb; int j, k; for (int nl = 0; nl < 100000; nl++) { j = 1; k = 0; for (int i = n1-1; i < 32000; i += n3) { k += j; a[i] += b[32000 - k]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } } ``` Clang's codegen: ``` .LBB0_3: // Parent Loop BB0_2 Depth=1 ldr s0, [x8], #-4 ldr s1, [x19, x9, lsl #2] fadd s0, s1, s0 str s0, [x19, x9, lsl #2] add x9, x9, x21 cmp x9, x22 b.lt .LBB0_3 ``` GCC's codegen: ``` .L6: ldr q30, [x1], -16 ldr q29, [x0] mov v31.16b, v30.16b tbl v30.16b, {v30.16b - v31.16b}, v28.16b fadd v31.4s, v29.4s, v30.4s str q31, [x0], 16 cmp x22, x0 bne .L6 ``` See also: https://godbolt.org/z/7zzrPazM7 TODO: root cause analysis.

github.com/llvm/llvm-project

[AArch64] suboptimal vectorisation (tsvc, s128)

opened 10:58AM - 07 Nov 23 UTC

sjoerdmeijer

backend:AArch64 vectorization

Clang tip of tree is 120% behind GCC for kernel s128 in TSVC. Compile this input… with `-O3 -ffast-math -mcpu=neoverse-v2`: ``` __attribute__((aligned(64))) float x[32000]; __attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000], aa[256][256],bb[256][256],cc[256][256],tt[256][256]; int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float); float s128(struct args_t * func_args) { int j, k; for (int nl = 0; nl < 2*100000; nl++) { j = -1; for (int i = 0; i < 32000/2; i++) { k = j + 1; a[i] = b[k] - d[i]; j = k + 1; b[k] = a[i] + c[k]; } dummy(a, b, c, d, e, aa, bb, cc, 1.); } } ``` GCC's codegen: ``` .L3: ld2 {v28.4s - v29.4s}, [x0] mov x6, x0 add x5, x0, 16 add x4, x0, 24 add x0, x0, 32 ldr q27, [x8], 16 ld2 {v30.4s - v31.4s}, [x7], 32 fsub v28.4s, v28.4s, v27.4s fadd v30.4s, v28.4s, v30.4s str q28, [x9], 16 str s30, [x6], 8 st1 {v30.s}[1], [x6] st1 {v30.s}[2], [x5] st1 {v30.s}[3], [x4] cmp x7, x23 bne .L3 ``` Clang's codegen: ``` .LBB0_2: // Parent Loop BB0_1 Depth=1 mov z2.d, z0.d add z2.d, z2.d, #1 // =0x1 adr z3.d, [z7.d, z2.d, lsl #2] fmov x9, d3 ld2w { z3.s, z4.s }, p0/z, [x9] ld1w { z5.s }, p0/z, [x21, x8, lsl #2] fmov x9, d2 fsub z3.s, z3.s, z5.s st1w { z3.s }, p0, [x22, x8, lsl #2] add x8, x8, x28 ld2w { z4.s, z5.s }, p0/z, [x20, x9, lsl #2] add x9, x19, #4 cmp x23, x8 fadd z2.s, z4.s, z3.s uunpklo z3.d, z2.s uunpkhi z2.d, z2.s st1w { z3.d }, p1, [x9, z0.d, lsl #2] add z0.d, z0.d, z6.d st1w { z2.d }, p1, [x9, z1.d, lsl #2] add z1.d, z1.d, z6.d b.ne .LBB0_2 ``` See also: https://godbolt.org/z/154McGMve Todo: root cause analysis

github.com/llvm/llvm-project

[AArch64] Missed vectorisation opportunity (tsvc, s112)

opened 10:54AM - 07 Nov 23 UTC

sjoerdmeijer

backend:AArch64 vectorization

With Clang top of tree, we are about 35% behind on our AArch64 platform compared… to GCC12 for the s122 kernel from TSVC: GCC vectorises the kernel, Clang doesn't. Clang seems to think it's not worthwhile vectorising this input with -O3 -ffast-math -mcpu=neoverse-v2: ``` __attribute__((aligned(64))) float x[32000]; __attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000], aa[256][256],bb[256][256],cc[256][256],tt[256][256]; int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float); float s112() { for (int nl = 0; nl < 3*100000; nl++) { for (int i = 32000 - 2; i >= 0; i--) { a[i+1] = a[i] + b[i]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } } ``` Clang's codegen: ``` .LBB0_3: // Parent Loop BB0_2 Depth=1 add x10, x19, x8 ldr s0, [x9, #-8]! sub x8, x8, #8 ldur s1, [x10, #-4] fadd s2, s1, s0 ldur s0, [x9, #-4] ldr s1, [x19, x8] fadd s0, s1, s0 stp s0, s2, [x9] cbnz x8, .LBB0_3 ``` whereas GCC generates: ``` .L4: ldr q30, [x27, x0] ldr q28, [x20, x0] mov v31.16b, v30.16b mov v29.16b, v28.16b tbl v30.16b, {v30.16b - v31.16b}, v27.16b tbl v28.16b, {v28.16b - v29.16b}, v27.16b fadd v30.4s, v30.4s, v28.4s mov v31.16b, v30.16b tbl v30.16b, {v30.16b - v31.16b}, v27.16b str q30, [x19, x0] sub x0, x0, #16 cmp x0, x28 bne .L4 ``` See https://godbolt.org/z/6cfK9sbj5 for the reproducer (and this codegen).

sjoerdmeijer · December 4, 2023, 5:28pm

For most issues a root cause analysis is missing, and I probably need to organise this in a better way but I raised the issues to create some awareness in case folks were interested.

yus3710-fj · December 5, 2023, 3:11am

Thank you for the information, @kiranchandramohan.

I didn’t include the alias-tags pass, so I measured again with -falias-analysis enabled and modified the above data.
The gap got smaller (14%->11%), but I can’t ignore it.

kiranchandramohan · December 11, 2023, 5:50pm

@yus3710-fj I guess the pending issues are [LICM] TSVC s113: not vectorized because LICM doesn't work · Issue #74262 · llvm/llvm-project · GitHub and [Flang] TSVC s314, s3111: not vectorized because the loops are not recognized as reduction loops · Issue #74264 · llvm/llvm-project · GitHub. I was kind of thinking that these are failing to vectorise due to llvm issues, but it seems you are suggesting that flang might need to change the code-generated so that llvm can vectorise the loops.

yus3710-fj · December 12, 2023, 6:17am

Thank you for your comment, and I apologize for my lack of explanation in the last call.

Those loops are not vectorized because the code generated by Flang isn’t suitable for LLVM optimization.
I think there are 2 ways to resolve these issues. One is to fix LLVM optimization for the code generated by Flang, and the other is to fix Flang to generate suitable code for LLVM optimization.
The former seems to be the better of the two, but I’m not sure LLVM can be fixed indeed since I’ve not finished the investigation yet.

yus3710-fj · February 1, 2024, 4:01am

I’ve found some more issues about vectorization.

github.com/llvm/llvm-project

[Flang] TSVC s243: the loop should be canonicalized for alias analysis in LLVM

opened 04:57AM - 22 Jan 24 UTC

yus3710-fj

loopoptim flang:ir

Flang can't vectorize the loop in `s243` of [TSVC](https://www.netlib.org/benchm…ark/vectors) while Clang can vectorize the loop written in C. ```fortran ! Fortran version module mod integer ld, nloops parameter (ld=1000,nloops=135) real a(ld), b(ld), c(ld), d(ld), e(ld) real aa(ld,ld), bb(ld,ld), cc(ld,ld) interface subroutine dummy(ld,n,a,b,c,d,e,aa,bb,cc,x) integer ld, n real a(ld), b(ld), c(ld), d(ld), e(ld) real aa(ld,ld), bb(ld,ld), cc(ld,ld) real, value :: x end subroutine end interface end module subroutine s243 (n) use mod integer n, i call init(ld,n,a,b,c,d,e,aa,bb,cc,'s243 ') do 10 i = 1,n-1 a(i) = b(i) + c(i) * d(i) b(i) = a(i) + d(i) * e(i) a(i) = b(i) + a(i+1) * d(i) 10 continue call dummy(ld,n,a,b,c,d,e,aa,bb,cc,1.) end ``` ```c // C version #define LEN 32000 #define LEN2 256 float a[LEN], b[LEN], c[LEN], d[LEN], e[LEN]; float aa[LEN2][LEN2], bb[LEN2][LEN2], cc[LEN2][LEN2]; int s243() { init( "s243 "); for (int i = 0; i < LEN-1; i++) { a[i] = b[i] + c[i ] * d[i]; b[i] = a[i] + d[i ] * e[i]; a[i] = b[i] + a[i+1] * d[i]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); return 0; } ``` ```console $ flang-new -v -Ofast s243.f -S -Rpass=vector flang-new version 18.0.0git (https://github.com/llvm/llvm-project.git 0e93d04001e45f39cabf0ffb5093512a7f622cc0) Target: aarch64-unknown-linux-gnu Thread model: posix InstalledDir: /path/to/install/bin Found candidate GCC installation: /path/to/lib/gcc/aarch64-unknown-linux-gnu/11.2.0 Selected GCC installation: /path/to/lib/gcc/aarch64-unknown-linux-gnu/11.2.0 Candidate multilib: .;@m64 Selected multilib: .;@m64 "/path/to/install/bin/flang-new" -fc1 -triple aarch64-unknown-linux-gnu -S -fcolor-diagnostics -mrelocation-model pic -pic-level 2 -pic-is-pie -ffast-math -target-cpu generic -target-feature +neon -target-feature +v8a -fstack-arrays -fversion-loops-for-stride -Rpass=vector -O3 -o s243.s -x f95-cpp-input s243.f $ clang -Ofast s243.c -S -Rpass=vector s243.c:15:3: remark: vectorized loop (vectorization width: 4, interleaved count: 1) [-Rpass=loop-vectorize] 15 | for (int i = 0; i < LEN-1; i++) { | ^ ``` The Fortran code isn't vectorized because DSE doesn't remove the first store to `a(i)`. DSE uses BasicAA, and BasicAA can't say that `a(i)` and `a(i+1)` don't alias each other. So DSE doesn't work. Actually, Flang generates LLVM IR like the following C code that makes alias analysis harder. (FYI: BasicAA avoids complicated analyses which can affect the compilation time.) ```c for (int i = 1; i <= n-1; i++) { a[i-1] = b[i-1] + c[i-1] * d[i-1]; b[i-1] = a[i-1] + d[i-1] * e[i-1]; a[i-1] = b[i-1] + a[i] * d[i-1]; } ``` Conversely, the following Fortran code is easy to be vectorized. ```fortran do 10 i = 2,n a(i-1) = b(i-1) + c(i-1) * d(i-1) b(i-1) = a(i-1) + d(i-1) * e(i-1) a(i-1) = b(i-1) + a(i) * d(i-1) 10 continue ``` ```console $ flang-new -Ofast s243_2.f -S -Rpass=vector path/to/s243_2.f:14:7: remark: vectorized loop (vectorization width: 4, interleaved count: 1) [-Rpass=loop-vectorize] ```

github.com/llvm/llvm-project

[Flang] TSVC s314: needs fast-math flags in function attributes for vectorization

opened 05:42AM - 24 Jan 24 UTC

yus3710-fj

performance flang:ir

Flang can't vectorize the loop in `s314` of [TSVC](https://www.netlib.org/benchm…ark/vectors) while Clang can vectorize the loop written in C. ```fortran ! Fortran version do 1 nl = 1,ntimes x = a(1) do 10 i = 2,n if(a(i) .gt. x) x = a(i) 10 continue call dummy(ld,n,a,b,c,d,e,aa,bb,cc,x) 1 continue ``` ```c // C version for (int nl = 0; nl < ntimes; nl++) { x = a[0]; for (int i = 1; i < n; i++) { if (a[i] > x) { x = a[i]; } } dummy(a, b, c, d, e, aa, bb, cc, x); } ``` ```console $ flang-new -v -Ofast s314.f -S -Rpass=vector flang-new version 18.0.0 (https://github.com/llvm/llvm-project.git 2759e47067ea286f6302adcfe93b653cfaf6f2eb) Target: aarch64-unknown-linux-gnu Thread model: posix InstalledDir: /path/to/install/bin Found candidate GCC installation: /path/to/lib/gcc/aarch64-unknown-linux-gnu/11.2.0 Selected GCC installation: /path/to/lib/gcc/aarch64-unknown-linux-gnu/11.2.0 Candidate multilib: .;@m64 Selected multilib: .;@m64 "/path/to/install/bin/flang-new" -fc1 -triple aarch64-unknown-linux-gnu -S -fcolor-diagnostics -mrelocation-model pic -pic-level 2 -pic-is-pie -ffast-math -target-cpu generic -target-feature +neon -target-feature +v8a -fstack-arrays -fversion-loops-for-stride -Rpass=vector -O3 -o s314.s -x f95-cpp-input s314.f $ clang -Ofast s314.c -S -Rpass=vector /path/to/s314.c:17:3: remark: vectorized loop (vectorization width: 4, interleaved count: 2) [-Rpass=loop-vectorize] 17 | for (int i = 0; i < LEN; i++) { | ^ ``` I thought fast-math flags are only needed for `fcmp` (#74263) , but it was insufficient (or wrong). The following function should return `true` to recognize the loop as the max reduction. https://github.com/llvm/llvm-project/blob/c41472dbafd0dcacd943a95a9a099c1942d50394/llvm/lib/Analysis/IVDescriptors.cpp#L805-L814 The max/min reduction is assumed to be the code like `select(fcmp(...))`. This function returns `true` for `FCmpInst`, but not for `SelectInst` which can't have fast-math flags. Clang generates fast-math flags in function attiributes and that makes the function return `true` for `SelectInst`. ```llvm define dso_local noundef i32 @s314() local_unnamed_addr #0 { : } : attributes #0 = { nounwind uwtable "approx-func-fp-math"="true" "denormal-fp-math"="preserve-sign,preserve-sign" "min-legal-vector-width"="0" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="true" } ``` `"no-nans-fp-math"="true"` and `"no-signed-zeros-fp-math"="true"` are necessary for vectorization.

github.com/llvm/llvm-project

[Flang] TSVC s118: not vectorized because LICM doesn't work

opened 05:53AM - 24 Jan 24 UTC

yus3710-fj

flang:ir vectorization

Flang can't vectorize the loop in `s118` of [TSVC](https://www.netlib.org/benchm…ark/vectors) while Clang can vectorize the loop written in C. ```fortran ! Fortran version module mod integer ld, nloops parameter (ld=1000,nloops=135) real a(ld), b(ld), c(ld), d(ld), e(ld) real aa(ld,ld), bb(ld,ld), cc(ld,ld) interface subroutine dummy(ld,n,a,b,c,d,e,aa,bb,cc,x) integer ld, n real a(ld), b(ld), c(ld), d(ld), e(ld) real aa(ld,ld), bb(ld,ld), cc(ld,ld) real, value :: x end subroutine end interface end module subroutine s118 (n) use mod integer n, i, j call init(ld,n,a,b,c,d,e,aa,bb,cc,'s118 ') do 10 i = 2,n do 20 j = 1,i-1 a(i) = a(i) + bb(i,j) * a(i-j) 20 continue 10 continue call dummy(ld,n,a,b,c,d,e,aa,bb,cc,1.) end ``` ```c // C version #define LEN 32000 #define LEN2 256 float a[LEN], b[LEN], c[LEN], d[LEN], e[LEN]; float aa[LEN2][LEN2], bb[LEN2][LEN2], cc[LEN2][LEN2]; int s118() { init( "s118 "); for (int i = 1; i < LEN2; i++) { for (int j = 0; j <= i - 1; j++) { a[i] += bb[j][i] * a[i-j-1]; } } dummy(a, b, c, d, e, aa, bb, cc, 0.); return 0; } ``` ```console $ flang-new -v -Ofast s118.f -S -Rpass=vector flang-new version 18.0.0 (https://github.com/llvm/llvm-project.git 2759e47067ea286f6302adcfe93b653cfaf6f2eb) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /path/to/install/bin Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12 Selected GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12 Candidate multilib: .;@m64 Selected multilib: .;@m64 "/path/to/install/bin/flang-new" -fc1 -triple x86_64-unknown-linux-gnu -emit-obj -fcolor-diagnostics -mrelocation-model pic -pic-level 2 -pic-is-pie -ffast-math -target-cpu x86-64 -fstack-arrays -fversion-loops-for-stride -mframe-pointer=none -O3 -o /tmp/s118-5868cd.o -x f95-cpp-input s118.f $ clang -Ofast s118.c -S -Rpass=vector /path/to/s118.c:16:4: remark: vectorized loop (vectorization width: 4, interleaved count: 2) [-Rpass=loop-vectorize] 16 | for (int j = 0; j <= i - 1; j++) { | ^ ``` Hoisting the store outside the loop is necessary for vectorization, but it doesn't work because BasicAA says `a(i)` and `a(i-j)` may alias each other. It's similar to #74262 but BasicAA won't do complicated analyses, so I suspect it's difficult to fix BasicAA. ```llvm 25: ; preds = %.lr.ph, %25 %indvars.iv = phi i64 [ 1, %.lr.ph ], [ %indvars.iv.next, %25 ], !dbg !30 %26 = phi float [ %.promoted, %.lr.ph ], [ %32, %25 ], !dbg !30 %27 = mul nuw nsw i64 %indvars.iv, 1000, !dbg !30 %gep13 = getelementptr float, ptr %invariant.gep, i64 %27, !dbg !30 %28 = load float, ptr %gep13, align 4, !dbg !30, !tbaa !31 %29 = sub nuw nsw i64 %indvars.iv21, %indvars.iv, !dbg !30 %gep12 = getelementptr float, ptr getelementptr ([1000 x float], ptr @_QMmodEa, i64 -1, i64 999), i64 %29, !dbg !30 %30 = load float, ptr %gep12, align 4, !dbg !30, !tbaa !27 %31 = fmul fast float %30, %28, !dbg !30 %32 = fadd fast float %31, %26, !dbg !30 store float %32, ptr %24, align 4, !dbg !30, !tbaa !27 ;; this can be hoisted %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !33 %exitcond.not = icmp eq i64 %indvars.iv.next, %indvars.iv21, !dbg !26 br i1 %exitcond.not, label %._crit_edge, label %25, !dbg !26 ```

yus3710-fj · February 1, 2024, 4:04am

Please let me explain the current status of the investigation.

The issues of s314 and s3111 (#74263, #74264, #79257) will be resolved thanks to Tom and Alex.
The other issues are found to be related to BasicAA in LLVM.

It seems that the issue of s113 (#74262) should be resolved by fixing the range analysis in BasicAA.
On the other hand, however, the other issues seem to be caused by the difference in LLVM IR generated by Clang and Flang. For example, the subtraction from loop indices appears in LLVM IR from Flang and that influences BasicAA.

.lr.ph:                                           ; preds = %.lr.ph.preheader, %.lr.ph
  %indvars.iv = phi i64 [ 1, %.lr.ph.preheader ], [ %indvars.iv.next, %.lr.ph ]
  %21 = add nsw i64 %indvars.iv, -1 ;; this makes alias analysis harder
  %22 = getelementptr float, ptr @_QMmodEb, i64 %21
  %23 = load float, ptr %22, align 4, !tbaa !12
  %24 = getelementptr float, ptr @_QMmodEc, i64 %21
  %25 = load float, ptr %24, align 4, !tbaa !15
  %26 = getelementptr float, ptr @_QMmodEd, i64 %21
  %27 = load float, ptr %26, align 4, !tbaa !17
   :
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %sext = shl i64 %indvars.iv.next, 32
  %35 = ashr exact i64 %sext, 32
  %gep = getelementptr float, ptr getelementptr ([1000 x float], ptr @_QMmodEa, i64 -1, i64 999), i64 %35
  %36 = load float, ptr %gep, align 4, !tbaa !19
  %37 = fmul fast float %36, %27
  %38 = fadd fast float %37, %34
  store float %38, ptr %30, align 4, !tbaa !19
  %exitcond.not = icmp eq i64 %indvars.iv, %20
  br i1 %exitcond.not, label %._crit_edge.loopexit, label %.lr.ph

We might have to introduce a complicated analysis into BasicAA, but BasicAA hates such a complicated analysis due to the increase in the compilation time.
So, I think Flang should be fixed to generate LLVM IR like Clang. (e.g. introducing loop canonicalization into Flang)

In addition, the issue of s113 can also be resolved by loop canonicalization.
Therefore, I’m not sure which should be fixed, Flang or LLVM.
(IMHO, I think it’s not worth fixing BasicAA since loop canonicalization is necessary anyway.)

Any ideas/comments are welcome.

tblah · February 1, 2024, 10:21am

Thank you for your investigations.

There was discussion about improving flang address calculation for array indices here [RFC] Changes to fircg.xarray_coor CodeGen to allow better hoisting

IIRC the latest is that we want to try adding the no-signed-wrap flag to these calculations so that LLVM is more free to re-arrange these calculations and hoist more of it out of the loop. Unfortunately, I got pulled onto different work and haven’t had time to try this. I will get back to it but it won’t be in the near future, so feel free to pick it up if it has a higher priority for you (but check first if this does apply to your example because I was looking at a different benchmark).

The extra subtractions in flang’s calculations are required because arrays in fortran can have different starting indices and these ranges need to be adapted to match llvm 0-based indices. The hope is that with more information, LLVM could hoist the subtractions out of loops: resulting in simpler address calculations inside of the loops.

Another option would be to re-order the mathematical operations generated by flang to calculate the address so that they are easier for LLVM to optimize. This was rejected because commenters felt that LLVM should be able to do this without help.

There is more information about NSW (no signed wrap) in the LLVM language reference entries for integer arithmetic operations e.g. LLVM Language Reference Manual — LLVM 19.0.0git documentation

Support is already in upstream MLIR dialects: [RFC] Integer overflow flags support in `arith` dialect
[mlir][LLVM] Add nsw and nuw flags by tblah · Pull Request #74508 · llvm/llvm-project · GitHub

I made a start here: [flang][CodeGen] add nsw to address calculations by tblah · Pull Request #74709 · llvm/llvm-project · GitHub, the main thing still to do is adding nsw to loop index calculations.

yus3710-fj · February 2, 2024, 6:57am

Thank you for your comment, @tblah.

IIRC the latest is that we want to try adding the no-signed-wrap flag to these calculations so that LLVM is more free to re-arrange these calculations and hoist more of it out of the loop. Unfortunately, I got pulled onto different work and haven’t had time to try this. I will get back to it but it won’t be in the near future, so feel free to pick it up if it has a higher priority for you (but check first if this does apply to your example because I was looking at a different benchmark).

I tried checking if nsw flags work for the issues, but I’m afraid most of them were not resolved. (I found that only adding nsw flags didn’t help vectorization, and I think I should investigate further.)

Actually, I suspect the issue addressed in your RFC looks similar to the issues of TSVC but not the same. When the corresponding subscripts of all arrays in the loop are the same, alias analysis would work even if there are extra subtractions in address calculations. On the other hand, arrays in TSVC are different in their subscripts and it makes alias analysis harder. In addition, the order of the address calculation doesn’t seem to help the analysis.

The extra subtractions in flang’s calculations are required because arrays in fortran can have different starting indices and these ranges need to be adapted to match llvm 0-based indices. The hope is that with more information, LLVM could hoist the subtractions out of loops: resulting in simpler address calculations inside of the loops.

I agree that the extra subtractions are required for Fortran codes and LLVM should optimize the address calculation. However, I’m not sure LLVM has the potential to optimize the address calculations in the loops of TSVC.

yus3710-fj · February 5, 2024, 7:16am

I tried perfoming loop canonicalization for FIR manually and found that loop canonicalization wasn’t a cure-all for alias analysis.
I apologize that my opinion changes again and again. It seems to be premature for me to make a conclusion.

yus3710-fj · February 29, 2024, 8:36am

Now the causes of the remaining vectorization issues are determined.

s113
- the range analysis in BasicAA is weak (#74262)
- no nsw flag on the increment of do-variable
s118
- BasicAA doesn’t support a specific instructions pattern (#79258)
s243
- no nsw flag on the calculations of indicies (#78934)

I’m considering the solution for nsw flags at first.

IIUC, nsw flags allow us to assume that the result of the operation with the flag never overflows in terms of optimization. Conversely, nsw flags can be set on operations which are found not to overflow.
I found IntegerRangeAnalysis in MLIR but I’m not sure I can realize the above function with this analysis at the moment because it needs more interface methods for some FIR operations (e.g. fir::DoLoopOp::getSingleInductionVar() from LoopLikeOpInterface).
If you have good ideas, the comments are welcome.

yus3710-fj · October 3, 2024, 4:56am

I found that I missed some loops which are vectorized by Clang but not by Flang.

There are 3 issues; one is related to integer overflow, and the others are related to strided accesses. One of commenters told me that some of these issues could be resolved by the polyhedral model, but I’m not sure whether it can be realized because I’m not familiar with it.

integer overflow
- [Flang] TSVC s115: compiler doesn't vectorize the loop considering an initial value of do-variable might overflow · Issue #110609 · llvm/llvm-project · GitHub
strided accesses
- [Flang][LAA] TSVC s2101, s233: not vectorized because the extents of arrays are not constant · Issue #110611 · llvm/llvm-project · GitHub
- [Flang] TSVC s233: loop interchange is necessary for vectorization · Issue #110612 · llvm/llvm-project · GitHub

In addition, vectorization with masks didn’t seem to work when I rewrote explicit-shape arrays to deferred-shape arrays.

github.com/llvm/llvm-project

[Flang][LICM] deferred-shape arrays are not vectorized in some cases

opened 01:17AM - 01 Oct 24 UTC

yus3710-fj

loopoptim flang:ir

Flang can't vectorize some loops in [TSVC](https://www.netlib.org/benchmark/vect…ors) if arrays are `ALLOCATABLE`. For example, Flang can't vectorize the loop in `s271` of TSVC if I rewrite explicit-shape arrays to deferred-shape arrays. ```fortran ! s271_allocatable.f90 subroutine s271 (ld,n,a,b,c) implicit none integer ld, n, i real, allocatable :: a(:), b(:), c(:) ! added ALLOCATABLE attribute call init(ld,n,a,b,c,'s271 ') do i=1,n if (b(i) .gt. 0.) a(i) = a(i) + b(i) * c(i) end do call dummy(ld,n,a,b,c,1.) end subroutine s271 ``` ```console $ flang-new -v -O3 -flang-experimental-integer-overflow s271_allocatable.f90 -S -Rpass=vector -mcpu=a64fx flang-new version 20.0.0git (https://github.com/llvm/llvm-project.git 2c770675ce36402b51a320ae26f369690c138dc1) Target: aarch64-unknown-linux-gnu Thread model: posix InstalledDir: /path/to/build/bin Build config: +assertions Found candidate GCC installation: /usr/lib/gcc/aarch64-redhat-linux/11 Selected GCC installation: /usr/lib/gcc/aarch64-redhat-linux/11 Candidate multilib: .;@m64 Selected multilib: .;@m64 "/path/to/build/bin/flang-new" -fc1 -triple aarch64-unknown-linux-gnu -S -fcolor-diagnostics -mrelocation-model pic -pic-level 2 -pic-is-pie -target-cpu a64fx -target-feature +outline-atomics -target-feature +v8.2a -target-feature +aes -target-feature +complxnum -target-feature +crc -target-feature +fp-armv8 -target-feature +fullfp16 -target-feature +lse -target-feature +neon -target-feature +perfmon -target-feature +ras -target-feature +rdm -target-feature +sha2 -target-feature +sve -fversion-loops-for-stride -flang-experimental-integer-overflow -Rpass=vector -resource-dir /path/to/build/lib/clang/20 -mframe-pointer=non-leaf -O3 -o /dev/null -x f95-cpp-input s271_allocatable.f90 ``` The base addresses and the lower bounds of arrays aren't recognized as loop-invariant. ```llvm 11: ; preds = %.lr.ph, %25 %indvars.iv = phi i64 [ 1, %.lr.ph ], [ %indvars.iv.next, %25 ] ;; i %12 = sub nsw i64 %indvars.iv, %.unpack322.unpack.unpack ;; i - lbound(b,1) %13 = getelementptr float, ptr %.unpack266.pre, i64 %12 %14 = load float, ptr %13, align 4, !tbaa !12 %15 = fcmp fast ogt float %14, 0.000000e+00 ;; b(i) > 0 br i1 %15, label %16, label %25 16: ; preds = %11 %.unpack329 = load ptr, ptr %2, align 8, !tbaa !4 ;; a %.unpack343.unpack.unpack = load i64, ptr %.elt342, align 8, !tbaa !4 ;; lbound(a,1) %17 = sub nsw i64 %indvars.iv, %.unpack343.unpack.unpack ;; i - lbound(a,1) %18 = getelementptr float, ptr %.unpack329, i64 %17 %19 = load float, ptr %18, align 4, !tbaa !14 ;; a(i) %.unpack350 = load ptr, ptr %4, align 8, !tbaa !4 ;; c %.unpack364.unpack.unpack = load i64, ptr %.elt363, align 8, !tbaa !4 ;; lbound(c,1) %20 = sub nsw i64 %indvars.iv, %.unpack364.unpack.unpack ;; i - lbound(c,1) %21 = getelementptr float, ptr %.unpack350, i64 %20 %22 = load float, ptr %21, align 4, !tbaa !16 ;; c(i) %23 = fmul fast float %22, %14 %24 = fadd fast float %23, %19 store float %24, ptr %18, align 4, !tbaa !14 br label %25 25: ; preds = %16, %11 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %exitcond.not = icmp eq i64 %indvars.iv.next, %10 br i1 %exitcond.not, label %._crit_edge.loopexit, label %11 ``` If I move `%.unpack329`, `%.unpack343.unpack.unpack`, `%.unpack350` and `%.unpack364.unpack.unpack` outside the loop manually, the loop is vectorized.

Topic		Replies	Views
[RFC] Enabling the HLFIR lowering by default Flang	12	1554	November 19, 2023
Status of Flang's Optimization Flang	11	1366	December 4, 2023
[RFC] Split ConvertExpr.cpp Flang	6	318	June 26, 2023
MLIR LLVM-IR dialect -- status and lowering questions Flang	8	152	December 10, 2019
[RFC][HLFIR] Optimized Bufferization for elemental array updates Flang	8	462	August 4, 2023

HLFIR lowering

vs gfortran

Vectorization

Related topics