SNAP Performance analysis, more detailed than the presentation

Introduction

The SNAP application is an application used by Los Alamos National Labs as a proxy benchmark for their real applications. More about SNAP can be found here

This was a “first pass, look for low-hanging fruit” analysis at this point in time, January 2022.

Configuration

Benchmarking was done on both AArch64 (Graviton 2, M6g) and Lenovo P720 x86-64.

The SNAP application is compiled with OpenMP enabled, MPI disabled, and tests were run with a single thread, using export OMP_NUM_THREADS=1 for simplicity.

A run with gfortran on the same platform is used for reference, with a relative of 1.0, as well as a time in seconds. Lower numbers are better.

All measurements are doine using the input file qasnap/mms_src/2d_mms_st.inp . To make the existing benchmark run long enough for some decent analysis, the input file was modified to increase nx and ny values from 20 to 80. Also, as MPI is disabled, the value npey is changed to 1. There is no particular reason to choose this file, other than it is one that we’ve used to show SNAP is working, and it runs a reasonable amount of time after the modification.

All of the perf data percentages etc are recorded on an AWS machine, but the relative changes are very similar on both AArch64 and x86-64.

The benchmarks were run multiple times to confirm stable values. Whilst the exact time can vary a little bit in the third digit. Time recorded is the total execution time from snap-output file. The times given are cumulative, so optimisation 2 also includes the improvement of optimisation 1, etc.

When building with gfortran , the executable is called gsnap , when building with flang-new the executable is called fsnap.

Baseline

The baseline values are in the table below:

Compiler AArch64 Relative X86-64 Relative
GFortran 2.47 1.0 2.43 1.0
Flang New 18.1 7.3 12.1 5.0
Classc Flang 2.18 0.88 N/A N/A

Using perf record ./gsnap to measure where the time is spent, we get the following top 6 functions:

  48.32%  gsnap    gsnap                 __dim3_sweep_module_MOD_dim3_sweep
  23.94%  gsnap    gsnap                 __mms_module_MOD_mms_src_1._omp_fn.0fn.0   3.69%  gsnap    libc-2.31.so          __GI___printf_fp_l
   2.18%  gsnap    libc-2.31.so          __vfprintf_internal
   2.16%  gsnap    libc-2.31.so          hack_digit
   1.80%  gsnap    gsnap                 __expxs_module_MOD_expxs_slgg

The same for perf record ./fsnap :

  54.58%  fsnap    fsnap             _QMmms_modulePmms_src_1..omp_par
  19.26%  fsnap    fsnap             Fortran::runtime::DoTotalReduction<double, Fortran::runtime::RealSumAcc
  15.35%  fsnap    fsnap             _QMdim3_sweep_modulePdim3_sweep
   2.91%  fsnap    fsnap             _FortranASumReal8
   1.64%  fsnap    libc-2.31.so       _int_free
   0.76%  fsnap    libc-2.31.so       malloc

The conclusion here is that significantly more time is spent in the mms_src_1 parallel do section. On analysis it is spending nearly all time in the innermost loop:

Inner loop in mms_src_1

DO ll = 1, lma(l)

qim(m,i,j,k,n,g) = qim(m,i,j,k,n,g) - ec(m,lm,n)*&

slgg(mat(i,j,k),l,gp,g)*ref_fluxm(lm-1,i,j,k,g)

lm = lm + 1

END DO

Looking closer at that part of the code, the address calculations for qim(m, i, j, k, n, g) is done every loop - same for the other addresses of the elements in the arrays. When comparing to the gfortran version, it doesn’t do that, is simply adds an offset for each element of the array. Manually editing the generated mlir from Flang and moving all but the final step out of the loop generates a much better performance. Using flang-new -fc1 -emit-mlir

mms.mlir

% 632 = fir.load % 22 : !fir.ref<!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>>

%c0_103 = arith.constant 0 : index

% 633 : 3 = fir.box_dims % 632 , %c0_103 : (!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>, index) -> (index, index, index)

%c1_104 = arith.constant 1 : index

% 634 : 3 = fir.box_dims % 632 , %c1_104 : (!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>, index) -> (index, index, index)

%c2_105 = arith.constant 2 : index

% 635 : 3 = fir.box_dims % 632 , %c2_105 : (!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>, index) -> (index, index, index)

%c3_106 = arith.constant 3 : index

% 636 : 3 = fir.box_dims % 632 , %c3_106 : (!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>, index) -> (index, index, index)

%c4_107 = arith.constant 4 : index

% 637 : 3 = fir.box_dims % 632 , %c4_107 : (!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>, index) -> (index, index, index)

%c5_108 = arith.constant 5 : index

% 638 : 3 = fir.box_dims % 632 , %c5_108 : (!fir.box<https: //assets.zoro.co.uk/Images/f/262x262/57045/UBolt_Metric__Steel__BZP_Bright_Zinc_Plated__UBolt_Type_A_0.jpg!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>, index) -> (index, index, index)

% 639 = fir.box_addr % 632 : (!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>) -> !fir.heap<!fir.array<?x?x?x?x?x?xf64>>

% 640 = fir.convert % 639 : (!fir.heap<!fir.array<?x?x?x?x?x?xf64>>) -> !fir.ref<!fir.array<?xf64>>

%c1_109 = arith.constant 1 : index

%c0_110 = arith.constant 0 : index

% 641 = fir.load % 89 : !fir.ref<i32>

% 642 = fir.convert % 641 : (i32) -> i64

% 643 = fir.convert % 642 : (i64) -> index

% 644 = arith.subi % 643 , % 633 # 0 : index

% 645 = arith.muli %c1_109, % 644 : index

% 646 = arith.addi % 645 , %c0_110 : index

% 647 = arith.muli %c1_109, % 633 # 1 : index

% 648 = fir.load % 90 : !fir.ref<i32>

% 649 = fir.convert % 648 : (i32) -> i64

% 650 = fir.convert % 649 : (i64) -> index

% 651 = arith.subi % 650 , % 634 # 0 : index

% 652 = arith.muli % 647 , % 651 : index

% 653 = arith.addi % 652 , % 646 : index

% 654 = arith.muli % 647 , % 634 # 1 : index

% 655 = fir.load % 91 : !fir.ref<i32>

% 656 = fir.convert % 655 : (i32) -> i64

% 657 = fir.convert % 656 : (i64) -> index

% 658 = arith.subi % 657 , % 635 # 0 : index

% 659 = arith.muli % 654 , % 658 : index

% 660 = arith.addi % 659 , % 653 : index

% 661 = arith.muli % 654 , % 635 # 1 : index

% 662 = fir.load % 92 : !fir.ref<i32>

% 663 = fir.convert % 662 : (i32) -> i64

% 664 = fir.convert % 663 : (i64) -> index

% 665 = arith.subi % 664 , % 636 # 0 : index

% 666 = arith.muli % 661 , % 665 : index

% 667 = arith.addi % 666 , % 660 : index

% 668 = arith.muli % 661 , % 636 # 1 : index

% 669 = fir.load % 99 : !fir.ref<i32>

% 670 = fir.convert % 669 : (i32) -> i64

% 671 = fir.convert % 670 : (i64) -> index

% 672 = arith.subi % 671 , % 637 # 0 : index

% 673 = arith.muli % 668 , % 672 : index

% 674 = arith.addi % 673 , % 667 : index

% 675 = arith.muli % 668 , % 637 # 1 : index

% 676 = fir.load % 101 : !fir.ref<i32>

% 677 = fir.convert % 676 : (i32) -> i64

% 678 = fir.convert % 677 : (i64) -> index

% 679 = arith.subi % 678 , % 638 # 0 : index

% 680 = arith.muli % 675 , % 679 : index

% 681 = arith.addi % 680 , % 674 : index

% 682 = fir.coordinate_of % 640 , % 681 : (!fir.ref<!fir.array<?xf64>>, index) -> !fir.ref<f64>

% 683 = fir.load % 682 : !fir.ref<f64>

This is JUST the qim element address calculation, those three lines inside the loop turns into 230 lines of MLIR. This is not unreasonable in itself.
The real problem is that the next stages of compilation ought to simplify and move this out of the loop, and converting it to a simpler form. Looking at the assembler output from this function, it still has several dozen instructions in the inner-most loop. The three lines of Fortran results in around 120 lines of assembler on x86-64, and about 100 instructions on AArch64 (AArch64 is a little shorter thanks to the MADD instruction, which helps the sequence of arith.muli and arith.addi that occur several times in each address calculation).

Optimisation 1: Move index/address calculations out of inner loop

At first, I tried compiling this with various ways of optimizing the code - fir-opt with a variety of arguments, LLVM’s opt with a variety of arguments, as well as clang -O3 - all of those made little or no difference to the overall time - taking a tenth off here or there, but inspecting the code, it was clear that the address calculation was still inside the main loop.

I eventually took Kiran’s suggestion and edited the FIR (MLIR) output, and moved the majority of the address calculations out of the innermost loop, then compiled using the tco and clang with default optimisation. This gives us these times:

Compiler AArch64 Relative x86-64 Relative
Flang-new 13.8 5.6 10.0 4.1

Now the perf report looks like this:

30.62% fsnap fsnap [.] Fortran::runtime::DoTotalReduction<double, Fortran::runtime::RealSumAcc 27.33% fsnap fsnap [.] _QMmms_modulePmms_src_1…omp_par 24.27% fsnap fsnap [.] _QMdim3_sweep_modulePdim3_sweep 4.80% fsnap fsnap [.] _FortranASumReal8 2.92% fsnap libc-2.31.so [.] _int_free 1.20% fsnap libc-2.31.so [.] malloc

Conclusion and next steps

Conclusion: The compiler ought to be able to move the address calculation for elements in multidimensional arrays out of the respective loops.

Next step:

  • Small reproducer: multi.f90 & cmulti.cpp
  • Ticket on fir-dev: #1466
  • Ticket summary: Figure out what needs to be done to make the compiler hoist the address calculations out of the loop. The current understanding is that LLVM passes aren’t optimizing the LLVM-IR generated because it doesn’t understand that the descriptors aren’t being changed by the code that writes to the content of the actual array. This is likely due to LLVM not having alias information.

Optimisation 2: Sums, reduction function

At this point, we can look at the “new biggest time used”, the Flang library function DoTotalReducton . Using GDB, setting a breakpoint in the (address from perf ) then looking at the call stack. There’s a few dozen calls in random other places, but eventually it is clear that the majority of calls are from the dim3_sweep function, and writing a replacement for the several calls to SUM gives a substantial improvement. The sum1d is called with 12 elements, and sum2d is called with a 48 element array, since nang=12 for this test. With the source of the function available, the compiler inlines the code of sum2d and sum1d

Sum functions

FUNCTION sum2d(arr)

REAL(r_knd), DIMENSION(nang, 4 ), INTENT(IN) :: arr

REAL(r_knd) :: sum2d

REAL(r_knd) :: res

INTEGER :: i, j

res = 0

do i = 1 , nang

do j = 1 , 4

res = res + arr(i, j)

end do

end do

sum2d = res

end function sum2d

FUNCTION sum1d(arr)

REAL(r_knd), DIMENSION(nang), INTENT(IN) :: arr

REAL(r_knd) :: sum1d

REAL(r_knd) :: res

INTEGER :: i

res = 0

do i = 1 , nang

res = res + arr(i)

end do

sum1d = res

end function sum1d

The summary of the perf report now looks like this:

33.39% fsnap fsnap _QMmms_modulePmms_src_1…omp_par 27.26% fsnap fsnap _QMdim3_sweep_modulePdim3_sweep 26.15% fsnap fsnap _QMdim3_sweep_modulePsum1d 2.21% fsnap libc-2.31.so _int_free 1.52% fsnap libc-2.31.so malloc 1.04% fsnap fsnap Fortran::runtime::DoTotalReduction<double, Fortran::runtime::NumericExtr

This produces the following times:

Compiler AArch64 Relative X86-64 Relative
Flang-new 9.2 3.7 6.4 2.6

Conclusion and next steps

Conclusion : the compiler should inline (small) calls to FortranASum to avoid the overhead in the function calls.

Next steps:

  • Write small reproducer: sum.f90
  • TIcket in fir-dev: #1490
  • Ticket summary: Write some sort of “replace calls to this library function with inline call” pass.

Optimisation 3: Move fir.allocmem/fir.freemem out of loops
While looking at the dim3_sweep FIR code from flang-new , it became clear that there are several calls to fir.allocmem and fir.freemem within the main loop. Moving those by hand in the MLIR code for that function gives us a bit more gain.

Now the perf report shows this:

49.74% fsnap fsnap _QMmms_modulePmms_src_1…omp_par
36.94% fsnap fsnap _QMdim3_sweep_modulePdim3_sweep
1.61% fsnap fsnap Fortran::runtime::DoTotalReduction<double, Fortran::runtime::NumericEx
1.43% fsnap fsnap _QMexpxs_modulePexpxs_slgg
1.40% fsnap fsnap Fortran::runtime::DoTotalReduction<double, Fortran::runtime::NumericEx
1.05% fsnap fsnap Fortran::decimal::BigRadixFloatingPointNumber<53, 16>::ConvertToDecima

And the time taken is:

Flang-new 7.3 3.0 4.6 1.9

Conclusion and next steps
Conclusion: Moving memory allocations out of the inner loop. Ideally, if the size is constant(ish), using alloca would be better. These are all small in this example.

Next steps:

Write small reproducer
Ticket in fir-dev: #1500
Ticket summary: A) hoist fir.allocmem where possible, out of loops and B) use alloca for small fir.memalloc (reverse of Add memory allocation optimization pass to the tool pipeline. by schweitzpgi · Pull Request #1355 · flang-compiler/f18-llvm-project · GitHub)

Using clang -O3 on the hand-optimised code

Adding -O3 to {{clang} compilation of the LLVM-IR - this only on the modified dim3_sweep.ll and mms.ll - this gives a little more improvement:

Compiler AArch64 Relative X86-64 Relative
Flang-new 6.4 2.6 4.6 1.8

The AArch64 solution appear to get a bit more congested (or have smaller caches?), so the mms_src_1 function takes up a bigger portion of the overall time than the x86-64 solution.

Summary table

This is table shows all the steps in a single table, as well as how many percent improvement for each step

Compiler Descripton AArch64 Relative Improvement X86-64 Relative Improvement
Compiler Descripton AArch64 Relative Improvement X86-64 Relative Improvement
GFortran Baseline 2.47 1.0 2.43 1.0
Flang-new Baseline 18.1 7.3 0% 12.1 5.0 0%
Flang-new Hoist address calc 13.8 5.6 24% 10.0 4.1 17%
Flang-new Inline SUM 9.2 3.7 33% 6.4 2.6 36%
Flang-new Hoist allocations 7.3 3.0 21% 4.6 1.9 28%
Flang-new Use clang -O3 6.4 2.6 12% 4.6 1.8 4.3%
2 Likes

Nice analysis! Thanks for reporting this, that was interesting to read :slight_smile:

Spent some time on looking at Alias analysis and such, but in the end, couldn’t find something that is manageable for hand-modifying the LLVM-IR to add alias info.

Instead, I hacked ScopedNoAliasAA to always say NoAlias. Compiling SNAP completely with the modified compiler didn’t work, but selectively compiling the “interesting” modules took us from about 12.2s to 5.7s (this includes the Sum modifications).

The point here is that we can make some significant improvements by alias analysis - just need to identify where to say “this can alias” - where it currently doesn’t work :slight_smile:

1 Like

Quick question. So when new flang generates code, it runs the “O3” pipeline in the middle end and backend (via opt/llc/clang) right? Can you share the IR we generate from MLIR?

Currently, flang-new (i.e. the driver) only runs the bare minimum to generate machine code (i.e. no optimisations).

At least inlining, and addr hoisting should happen if we pass O3 to clang/opt, no? Would be interesting to check. Allocation hoisting is something on my todo list already.

The logic for generating the machine code (i.e. driving the LLVM) is implemented in Clang’s frontend driver and hence not shared with Flang. And that’s where pass pipelines are defined, e.g. -O3. So, even though Clang and opt have this stuff well defined, there’s just the bare minimum in Flang.

We could clone -O3 from Clang and see what happens. I guess that we would need to do this for both the middlend and the backend. However, AFAIK Mats tried generating code with opt -O3 (and clang -O3) and that didn’t help as much as we hoped (i.e. flang-new -S -emit-llvm file.f90 -o - | opt -O3 -o - | clang -O3 -o file.o).

So, when I was compiling my hand-modified LLVM-IR, I used clang with default optimisation, then I used -O3 to “see what difference it makes”.

The main thing with this was to show that “there’s a little more to get out of the compiler”.

And in reply to your second comment: yes, clang will inline and hoist some things, but it apparently doesn’t cope with something in flang’s generated IR when it comes to the alias analysis, and “gives up”. Plausibly because it simply takes too much depth or whatever the “limit” is - but that’s a guess, I’m still figuring things out in this part of compilers - my past experience has mostly been “call these functions in LLVM, and out comes code that works” :wink:

What are clang’s default optimizations? Or better, how did you use those?

From memory, I was just using clang - or maybe clang -O1 - it was a while ago, and I used a small script that I have since modified several times, so I’m not sure of the exact details, I’m afraid.

the latter is “ok”, the former is equivalent to -O0 and actually not optimizing anything. It is in fact trying to not modify much at all :wink:

The original attempt was to use a chain of opt -O3 and clang -O3, as Andrzej described. That gave ~2-3% improvement over the “no optimisation” that flang currently supports - because there’s little to inline and move around that improves the code.

The main problem being that this does not hoist the address/index calculations from the multi-dimensional arrays in mms.f90 out of the loops - and that’s a huge part of the overall time in this case.

[It also prevents unrolling of the loop and vectorizing, which further makes it less efficient, but hoisting on its own is a good improvement].

I’ll update with an example of using clang and/or opt on the mms.f90, with details of how much/little it did - without the forced NoAlias.

So, I repeated my experiment using clang to optimise the LLVM-IR from flang. Essentially flang-new -fc1 -emit-llvm -fopenmp something.f90 and then clang -O3 soemthing.ll, followed by make to link.

Without NoAlias hack:

experiment Relative Performance (lower = better)
Flang-new 100%
Clang -O3 mms.ll 96.75%
Clang -03 dim3_sweep.ll 96.64%
Sum1d + sum2D + Clang -O3 dim3_sweep.ll 60.00%

With NoAlias hack:

experiment Relative Performance (lower = better)
make clean fsnap 100%
Clang -O3 mms.ll 76.45%
Clang -03 dim3_sweep.ll 95.14%
Sum1d + sum2D + Clang -O3 dim3_sweep.ll 46.59%

Only tested on my x86-64 desktop machine.

1 Like

Also tried comparing opt and clang, both with -O3 (without the hack for noalias).

make clean fsnap relative (lower is better)
Using opt -O3 as optimiser 95.87%
Using clang -O3 as optimiser 94.83%

So, both of these give 4-5% improvement.

Only tested on my x86-64 desktop machine.

I ran the large benchmark workload in SNAP, with the following results - nothing reviolutionary, “more of what we already knew”, basically, but it is a bit interesting to look at a slightly larger benchmark.

The command is ./fsnap ../qasnap/benchmark/inp output (or ./gsnap ...) - this takes many CPU hours to run, so don’t do this on your 4-core laptop, unless you intend to leave it over the weekend or something. :slight_smile: About 15 hours of CPU time for gsnap, and an about 5.6 times longer for fsnap.

Note that this was run with 54 threads, on my desktop (x86-64) machine:

Fsnap + sum opts Gsnap Relative Clang -01 fsnap Relative
Parallel Setup 0.0064 0.0000 581.08 0.0065 589.95
Input 0.0002 0.0003 0.73 0.0003 0.74
Setup 0.0695 0.0670 1.04 0.0696 1.04
Solve 97986.0000 17459.0000 5.61 123480.0000 7.07
Parameter Setup 36.1730 42.3380 0.85 38.7510 0.92
Outer Source 627.7300 469.6600 1.34 646.8800 1.38
Inner Iterations 97297.0000 16925.0000 5.75 122770.0000 7.25
Inner Source 336.8400 198.0900 1.70 312.9200 1.58
Transport Sweeps 96853.0000 16659.0000 5.81 122360.0000 7.34
Inner Misc Ops 107.5000 67.7920 1.59 95.5190 1.41
Solution Misc Ops 24.7930 22.0700 1.12 23.0390 1.04
Output 0.0000 0.0000 0.41 0.0000 0.46
Total Execution time 97997.0000 17462.0000 5.61 123490.0000 7.07

Above is all the individual timings from the different phases of the SNAP application. In some cases, the setup is a much more significant component of the overall time, but in this case, almost all te time is spent in the solve phase.

The columns “Relative” show the “times slower” compared to the gsnap times. Lower is better.

Fsnap + sum opts is “best” I’ve been able to do - using the sum1d and sum2d functions from above, compiled with clang -O3 dim3_sweep.ll. The clang -O1 fsnap is using clang to optimise the LLVM-IR, with hard-coded “no alias”. With -O1, that’s not so agressive, and it appears to come up with the same answer as gfortran.

The Parallel setup is clearly much longer in Flang than Gfortran, but it’s tiny in both cases, so I don’t think it is anyting to worry about.

One observation here is that the Flang generated code spends a lot of time in malloc (actually, not so much in malloc as in the kernel, updating page-tables, to supply back to malloc).

I debugged a little bit, and it’s allocating something like 6.3 MB (size of 0x3c00000) in octsweep - ~50 * 6.3MB is around 315 MB, quite a lot of memory to allocate and free regularly. I don’t know if there’s any easy solution to avoid that - haven’t really looked deeper into it than that.

Are you working on any patch to solve some of your findings? Just curious about it.

I have just started on the first patch - inlining of the SUM functionality (for basic cases, the Fortran intrinsic of SUM supports a wide range of ways to add up numbers - the advanced features will still make the function call).

This is the least complicated thing, so a good starting point.

1 Like

There’s a thought about fir to keep things at a high-level for as long as possible. I don’t know how well that axiom is followed today, but it’s a worthy goal.

For lowering sum, are you thinking of calling specialized sum1d and sum2d routines? Or changing the existing sum (et al) to special case 1d and 2d cases?

Perhaps the runtime can be compiled to bitcode & supplied to llvm for llvm-driven inlining?

So many options, which to choose! Will you publish a short design spec before you get too far along with the coding?

I think that will be crucial in the long run. If it’s too costly we need at least good annotations for runtime functions.

For the moment, I’m going to just going to generate the inline form, for 1D and just basic INTEGER&REAL cases only, as a way to determine if this is at all a workable way. And only for SUM right now, to keep it as simple and small as possible.

There are indeed many different ways to skin a feline creature, and all of them have a varying degree of benefit vs. drawback. I appreciate the idea of keeping FIR high level, and my initial idea was to add some sort of pass in the FIR.

Right now, I’d like to just do this experiment to see what we get and how complicated it gets - I’m still quite new to this project, and I’m definitely not that familiar with FIR code generation. Then run it on a few other applications for example. Going round and modifying source code (or FIR/MLIR/LLVM-IR) to optimize the code is definitely a hard way to determine the benefit of an optimization approach! :slight_smile:

I’m sure we’ll be going over the various approaches some time soon. Once I have something that can inline a simple SUM call, moving from doing it inside the intrinsic call to a FIR pass shouldn’t be too hard - and may be a good thing to implement both, and see which “seems better” - a fair bit depends on what sort of information is required to do the inlining, and what of that is available/missing at that point.

Long term, having (some parts of) the runtime library as FIR, MLIR or LLVM-IR sound like a good solution - I have worked on a different project which used that approach - I only know that was the general approach, I wasn’t really involved with the compiler parts in the overall project, but I understood that much! That would definitely be a much bigger change than what I have in mind… :slight_smile: