Performance analysis for TSVC

yus3710-fj · October 3, 2024, 4:56am

I found that I missed some loops which are vectorized by Clang but not by Flang.

There are 3 issues; one is related to integer overflow, and the others are related to strided accesses. One of commenters told me that some of these issues could be resolved by the polyhedral model, but I’m not sure whether it can be realized because I’m not familiar with it.

integer overflow
- [Flang] TSVC s115: compiler doesn't vectorize the loop considering an initial value of do-variable might overflow · Issue #110609 · llvm/llvm-project · GitHub
strided accesses
- [Flang][LAA] TSVC s2101, s233: not vectorized because the extents of arrays are not constant · Issue #110611 · llvm/llvm-project · GitHub
- [Flang] TSVC s233: loop interchange is necessary for vectorization · Issue #110612 · llvm/llvm-project · GitHub

In addition, vectorization with masks didn’t seem to work when I rewrote explicit-shape arrays to deferred-shape arrays.

github.com/llvm/llvm-project

[Flang][LICM] deferred-shape arrays are not vectorized in some cases

opened 01:17AM - 01 Oct 24 UTC

yus3710-fj

loopoptim flang:ir

Flang can't vectorize some loops in [TSVC](https://www.netlib.org/benchmark/vect…ors) if arrays are `ALLOCATABLE`. For example, Flang can't vectorize the loop in `s271` of TSVC if I rewrite explicit-shape arrays to deferred-shape arrays. ```fortran ! s271_allocatable.f90 subroutine s271 (ld,n,a,b,c) implicit none integer ld, n, i real, allocatable :: a(:), b(:), c(:) ! added ALLOCATABLE attribute call init(ld,n,a,b,c,'s271 ') do i=1,n if (b(i) .gt. 0.) a(i) = a(i) + b(i) * c(i) end do call dummy(ld,n,a,b,c,1.) end subroutine s271 ``` ```console $ flang-new -v -O3 -flang-experimental-integer-overflow s271_allocatable.f90 -S -Rpass=vector -mcpu=a64fx flang-new version 20.0.0git (https://github.com/llvm/llvm-project.git 2c770675ce36402b51a320ae26f369690c138dc1) Target: aarch64-unknown-linux-gnu Thread model: posix InstalledDir: /path/to/build/bin Build config: +assertions Found candidate GCC installation: /usr/lib/gcc/aarch64-redhat-linux/11 Selected GCC installation: /usr/lib/gcc/aarch64-redhat-linux/11 Candidate multilib: .;@m64 Selected multilib: .;@m64 "/path/to/build/bin/flang-new" -fc1 -triple aarch64-unknown-linux-gnu -S -fcolor-diagnostics -mrelocation-model pic -pic-level 2 -pic-is-pie -target-cpu a64fx -target-feature +outline-atomics -target-feature +v8.2a -target-feature +aes -target-feature +complxnum -target-feature +crc -target-feature +fp-armv8 -target-feature +fullfp16 -target-feature +lse -target-feature +neon -target-feature +perfmon -target-feature +ras -target-feature +rdm -target-feature +sha2 -target-feature +sve -fversion-loops-for-stride -flang-experimental-integer-overflow -Rpass=vector -resource-dir /path/to/build/lib/clang/20 -mframe-pointer=non-leaf -O3 -o /dev/null -x f95-cpp-input s271_allocatable.f90 ``` The base addresses and the lower bounds of arrays aren't recognized as loop-invariant. ```llvm 11: ; preds = %.lr.ph, %25 %indvars.iv = phi i64 [ 1, %.lr.ph ], [ %indvars.iv.next, %25 ] ;; i %12 = sub nsw i64 %indvars.iv, %.unpack322.unpack.unpack ;; i - lbound(b,1) %13 = getelementptr float, ptr %.unpack266.pre, i64 %12 %14 = load float, ptr %13, align 4, !tbaa !12 %15 = fcmp fast ogt float %14, 0.000000e+00 ;; b(i) > 0 br i1 %15, label %16, label %25 16: ; preds = %11 %.unpack329 = load ptr, ptr %2, align 8, !tbaa !4 ;; a %.unpack343.unpack.unpack = load i64, ptr %.elt342, align 8, !tbaa !4 ;; lbound(a,1) %17 = sub nsw i64 %indvars.iv, %.unpack343.unpack.unpack ;; i - lbound(a,1) %18 = getelementptr float, ptr %.unpack329, i64 %17 %19 = load float, ptr %18, align 4, !tbaa !14 ;; a(i) %.unpack350 = load ptr, ptr %4, align 8, !tbaa !4 ;; c %.unpack364.unpack.unpack = load i64, ptr %.elt363, align 8, !tbaa !4 ;; lbound(c,1) %20 = sub nsw i64 %indvars.iv, %.unpack364.unpack.unpack ;; i - lbound(c,1) %21 = getelementptr float, ptr %.unpack350, i64 %20 %22 = load float, ptr %21, align 4, !tbaa !16 ;; c(i) %23 = fmul fast float %22, %14 %24 = fadd fast float %23, %19 store float %24, ptr %18, align 4, !tbaa !14 br label %25 25: ; preds = %16, %11 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %exitcond.not = icmp eq i64 %indvars.iv.next, %10 br i1 %exitcond.not, label %._crit_edge.loopexit, label %11 ``` If I move `%.unpack329`, `%.unpack343.unpack.unpack`, `%.unpack350` and `%.unpack364.unpack.unpack` outside the loop manually, the loop is vectorized.

Topic		Replies	Views
Status of Flang's Optimization Flang	11	1561	December 4, 2023
SNAP Performance analysis, more detailed than the presentation Flang	21	1636	July 21, 2022
Request advice on reporting many errors found in Fortran test sets Flang	25	2803	November 22, 2023
loop vectorizer LLVM Dev List Archives	34	309	November 6, 2013
[RFC] add nsw flags to arithmetic integer operations using the option -fno-wrapv Flang	23	932	October 28, 2024

Performance analysis for TSVC

Related topics