Performance analysis for TSVC

Hello.

As I said before, I’ve investigated the performance of TSVC on A64FX (Fujitsu AArch64 CPU), and now I’d like to share the information.
I’m afraid I’ve not completely finished the research yet, but I think it is worth sharing with you.

HLFIR lowering

There was a weird performance issue on HLFIR lowering.
The assembly code of the innermost loop is the same as FIR lowering but the performance is reduced by 10% when HLFIR lowering is enabled.

I found that the performance depends on the order of files in linking.

$ cat second.f90
function second()
  real second
  call cpu_time(second)
end function

$ flang-new mains.f loops.f second.f90 -Ofast -flang-deprecated-no-hlfir
$ ./a.out | grep s4117
 s4117   10     2.301718    1.0241E+01    1.0241E+01                      121
 s4117  100     2.187275    1.0042E+02    1.0042E+02   7.5978E-08         121
 s4117 1000     1.901932    1.0004E+03    1.0004E+03   2.4403E-07         121
$ flang-new mains.f loops.f second.f90 -Ofast
$ ./a.out | grep s4117
 s4117   10     2.664087    1.0241E+01    1.0241E+01                      121
 s4117  100     2.319328    1.0042E+02    1.0042E+02   7.5978E-08         121
 s4117 1000     2.351143    1.0004E+03    1.0004E+03   2.4403E-07         121
$ flang-new loops.f mains.f second.f90 -Ofast
$ ./a.out | grep s4117
 s4117   10     2.217472    1.0241E+01    1.0241E+01                      121
 s4117  100     2.333374    1.0042E+02    1.0042E+02   7.5978E-08         121
 s4117 1000     1.900425    1.0004E+03    1.0004E+03   2.4403E-07         121

Some other loops have the same issue.
I don’t know why, but I’d like to conclude there is no problem on HLFIR lowering.

vs gfortran

I also measured the performance of Gfortran.
According to the result below, Flang is 11% slower than Gfortran as a whole.
In particular, it seems that vectorization makes a big difference on their performance.

func Gfortran [s] Flang with FIR [s] Flang with HLFIR [s]
ALL 1354.27 1525.36 1523.19
s111 5.572203 5.066218 5.052499
s112 10.112344(V) 4.050852(V) 3.947439(V)
s113 2.262772(V) 4.245930 4.716603
:
vdotr 4.834167(V) 3.709595(V) 3.713380(V)
vbor 10.649109(V) 10.333374(V) 10.352844(V)
  • Version
    • Gfortran: 11.2.0
    • Flang: main(1c1227846425883a3d39ff56700660236a97152c)
  • Option: -Ofast
    • -falias-analysis is also specified for Flang
  • “(V)” says the loop is vectorized
    • checked with -fopt-info-vec-optimized for Gfortran and -Rpass=vector for Flang

Vectorization

I checked whether vectorization works well for TSVC (Test Suite for Vectorizing Compilers).
There are 135 loops in TSVC and Flang can vectorize 52 loops while Gfortran can vectorize 58 loops.
I think the gap should be filled, and now I’m focusing on the loops which are vectorized if they are written in C.

Thanks @yus3710-fj for investigating the performance of TSVC with Flang.

Did you include the alias-tags pass (that propagates more alias info to LLVM) when you conducted the investigation?

I have done a similar exercise recently. I was interested in the C version and AArch64 codegen and raised a bunch of issues:

For most issues a root cause analysis is missing, and I probably need to organise this in a better way but I raised the issues to create some awareness in case folks were interested.

2 Likes

Thank you for the information, @kiranchandramohan.

I didn’t include the alias-tags pass, so I measured again with -falias-analysis enabled and modified the above data.
The gap got smaller (14%->11%), but I can’t ignore it.

1 Like

@yus3710-fj I guess the pending issues are [LICM] TSVC s113: not vectorized because LICM doesn't work · Issue #74262 · llvm/llvm-project · GitHub and [Flang] TSVC s314, s3111: not vectorized because the loops are not recognized as reduction loops · Issue #74264 · llvm/llvm-project · GitHub. I was kind of thinking that these are failing to vectorise due to llvm issues, but it seems you are suggesting that flang might need to change the code-generated so that llvm can vectorise the loops.

Thank you for your comment, and I apologize for my lack of explanation in the last call.

Those loops are not vectorized because the code generated by Flang isn’t suitable for LLVM optimization.
I think there are 2 ways to resolve these issues. One is to fix LLVM optimization for the code generated by Flang, and the other is to fix Flang to generate suitable code for LLVM optimization.
The former seems to be the better of the two, but I’m not sure LLVM can be fixed indeed since I’ve not finished the investigation yet.

I’ve found some more issues about vectorization.

Please let me explain the current status of the investigation.

The issues of s314 and s3111 (#74263, #74264, #79257) will be resolved thanks to Tom and Alex.
The other issues are found to be related to BasicAA in LLVM.

It seems that the issue of s113 (#74262) should be resolved by fixing the range analysis in BasicAA.
On the other hand, however, the other issues seem to be caused by the difference in LLVM IR generated by Clang and Flang. For example, the subtraction from loop indices appears in LLVM IR from Flang and that influences BasicAA.

.lr.ph:                                           ; preds = %.lr.ph.preheader, %.lr.ph
  %indvars.iv = phi i64 [ 1, %.lr.ph.preheader ], [ %indvars.iv.next, %.lr.ph ]
  %21 = add nsw i64 %indvars.iv, -1 ;; this makes alias analysis harder
  %22 = getelementptr float, ptr @_QMmodEb, i64 %21
  %23 = load float, ptr %22, align 4, !tbaa !12
  %24 = getelementptr float, ptr @_QMmodEc, i64 %21
  %25 = load float, ptr %24, align 4, !tbaa !15
  %26 = getelementptr float, ptr @_QMmodEd, i64 %21
  %27 = load float, ptr %26, align 4, !tbaa !17
   :
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %sext = shl i64 %indvars.iv.next, 32
  %35 = ashr exact i64 %sext, 32
  %gep = getelementptr float, ptr getelementptr ([1000 x float], ptr @_QMmodEa, i64 -1, i64 999), i64 %35
  %36 = load float, ptr %gep, align 4, !tbaa !19
  %37 = fmul fast float %36, %27
  %38 = fadd fast float %37, %34
  store float %38, ptr %30, align 4, !tbaa !19
  %exitcond.not = icmp eq i64 %indvars.iv, %20
  br i1 %exitcond.not, label %._crit_edge.loopexit, label %.lr.ph

We might have to introduce a complicated analysis into BasicAA, but BasicAA hates such a complicated analysis due to the increase in the compilation time.
So, I think Flang should be fixed to generate LLVM IR like Clang. (e.g. introducing loop canonicalization into Flang)

In addition, the issue of s113 can also be resolved by loop canonicalization.
Therefore, I’m not sure which should be fixed, Flang or LLVM.
(IMHO, I think it’s not worth fixing BasicAA since loop canonicalization is necessary anyway.)

Any ideas/comments are welcome.

Thank you for your investigations.

There was discussion about improving flang address calculation for array indices here [RFC] Changes to fircg.xarray_coor CodeGen to allow better hoisting

IIRC the latest is that we want to try adding the no-signed-wrap flag to these calculations so that LLVM is more free to re-arrange these calculations and hoist more of it out of the loop. Unfortunately, I got pulled onto different work and haven’t had time to try this. I will get back to it but it won’t be in the near future, so feel free to pick it up if it has a higher priority for you (but check first if this does apply to your example because I was looking at a different benchmark).

The extra subtractions in flang’s calculations are required because arrays in fortran can have different starting indices and these ranges need to be adapted to match llvm 0-based indices. The hope is that with more information, LLVM could hoist the subtractions out of loops: resulting in simpler address calculations inside of the loops.

Another option would be to re-order the mathematical operations generated by flang to calculate the address so that they are easier for LLVM to optimize. This was rejected because commenters felt that LLVM should be able to do this without help.

There is more information about NSW (no signed wrap) in the LLVM language reference entries for integer arithmetic operations e.g. LLVM Language Reference Manual — LLVM 19.0.0git documentation

Support is already in upstream MLIR dialects: [RFC] Integer overflow flags support in `arith` dialect
[mlir][LLVM] Add nsw and nuw flags by tblah · Pull Request #74508 · llvm/llvm-project · GitHub

I made a start here: [flang][CodeGen] add nsw to address calculations by tblah · Pull Request #74709 · llvm/llvm-project · GitHub, the main thing still to do is adding nsw to loop index calculations.

2 Likes

Thank you for your comment, @tblah.

IIRC the latest is that we want to try adding the no-signed-wrap flag to these calculations so that LLVM is more free to re-arrange these calculations and hoist more of it out of the loop. Unfortunately, I got pulled onto different work and haven’t had time to try this. I will get back to it but it won’t be in the near future, so feel free to pick it up if it has a higher priority for you (but check first if this does apply to your example because I was looking at a different benchmark).

I tried checking if nsw flags work for the issues, but I’m afraid most of them were not resolved. (I found that only adding nsw flags didn’t help vectorization, and I think I should investigate further.)

Actually, I suspect the issue addressed in your RFC looks similar to the issues of TSVC but not the same. When the corresponding subscripts of all arrays in the loop are the same, alias analysis would work even if there are extra subtractions in address calculations. On the other hand, arrays in TSVC are different in their subscripts and it makes alias analysis harder. In addition, the order of the address calculation doesn’t seem to help the analysis.

The extra subtractions in flang’s calculations are required because arrays in fortran can have different starting indices and these ranges need to be adapted to match llvm 0-based indices. The hope is that with more information, LLVM could hoist the subtractions out of loops: resulting in simpler address calculations inside of the loops.

I agree that the extra subtractions are required for Fortran codes and LLVM should optimize the address calculation. However, I’m not sure LLVM has the potential to optimize the address calculations in the loops of TSVC.

I tried perfoming loop canonicalization for FIR manually and found that loop canonicalization wasn’t a cure-all for alias analysis.
I apologize that my opinion changes again and again. It seems to be premature for me to make a conclusion.

Now the causes of the remaining vectorization issues are determined.

  • s113
    • the range analysis in BasicAA is weak (#74262)
    • no nsw flag on the increment of do-variable
  • s118
    • BasicAA doesn’t support a specific instructions pattern (#79258)
  • s243
    • no nsw flag on the calculations of indicies (#78934)

I’m considering the solution for nsw flags at first.

IIUC, nsw flags allow us to assume that the result of the operation with the flag never overflows in terms of optimization. Conversely, nsw flags can be set on operations which are found not to overflow.
I found IntegerRangeAnalysis in MLIR but I’m not sure I can realize the above function with this analysis at the moment because it needs more interface methods for some FIR operations (e.g. fir::DoLoopOp::getSingleInductionVar() from LoopLikeOpInterface).
If you have good ideas, the comments are welcome.

1 Like

I found that I missed some loops which are vectorized by Clang but not by Flang.

There are 3 issues; one is related to integer overflow, and the others are related to strided accesses. One of commenters told me that some of these issues could be resolved by the polyhedral model, but I’m not sure whether it can be realized because I’m not familiar with it.

In addition, vectorization with masks didn’t seem to work when I rewrote explicit-shape arrays to deferred-shape arrays.