Introduction
The SNAP application is an application used by Los Alamos National Labs as a proxy benchmark for their real applications. More about SNAP can be found here
This was a “first pass, look for low-hanging fruit” analysis at this point in time, January 2022.
Configuration
Benchmarking was done on both AArch64 (Graviton 2, M6g) and Lenovo P720 x86-64.
The SNAP application is compiled with OpenMP enabled, MPI disabled, and tests were run with a single thread, using export OMP_NUM_THREADS=1
for simplicity.
A run with gfortran on the same platform is used for reference, with a relative of 1.0, as well as a time in seconds. Lower numbers are better.
All measurements are doine using the input file qasnap/mms_src/2d_mms_st.inp
. To make the existing benchmark run long enough for some decent analysis, the input file was modified to increase nx
and ny
values from 20 to 80. Also, as MPI is disabled, the value npey
is changed to 1. There is no particular reason to choose this file, other than it is one that we’ve used to show SNAP is working, and it runs a reasonable amount of time after the modification.
All of the perf
data percentages etc are recorded on an AWS machine, but the relative changes are very similar on both AArch64 and x86-64.
The benchmarks were run multiple times to confirm stable values. Whilst the exact time can vary a little bit in the third digit. Time recorded is the total execution time
from snap-output
file. The times given are cumulative, so optimisation 2 also includes the improvement of optimisation 1, etc.
When building with gfortran
, the executable is called gsnap
, when building with flang-new
the executable is called fsnap
.
Baseline
The baseline values are in the table below:
Compiler | AArch64 | Relative | X86-64 | Relative |
---|---|---|---|---|
GFortran | 2.47 | 1.0 | 2.43 | 1.0 |
Flang New | 18.1 | 7.3 | 12.1 | 5.0 |
Classc Flang | 2.18 | 0.88 | N/A | N/A |
Using perf record ./gsnap
to measure where the time is spent, we get the following top 6 functions:
48.32% gsnap gsnap __dim3_sweep_module_MOD_dim3_sweep
23.94% gsnap gsnap __mms_module_MOD_mms_src_1._omp_fn.0fn.0 3.69% gsnap libc-2.31.so __GI___printf_fp_l
2.18% gsnap libc-2.31.so __vfprintf_internal
2.16% gsnap libc-2.31.so hack_digit
1.80% gsnap gsnap __expxs_module_MOD_expxs_slgg
The same for perf record ./fsnap
:
54.58% fsnap fsnap _QMmms_modulePmms_src_1..omp_par
19.26% fsnap fsnap Fortran::runtime::DoTotalReduction<double, Fortran::runtime::RealSumAcc
15.35% fsnap fsnap _QMdim3_sweep_modulePdim3_sweep
2.91% fsnap fsnap _FortranASumReal8
1.64% fsnap libc-2.31.so _int_free
0.76% fsnap libc-2.31.so malloc
The conclusion here is that significantly more time is spent in the mms_src_1 parallel do section. On analysis it is spending nearly all time in the innermost loop:
Inner loop in mms_src_1
DO ll = 1, lma(l)
qim(m,i,j,k,n,g) = qim(m,i,j,k,n,g) - ec(m,lm,n)*&
slgg(mat(i,j,k),l,gp,g)*ref_fluxm(lm-1,i,j,k,g)
lm = lm + 1
END DO
Looking closer at that part of the code, the address calculations for qim(m, i, j, k, n, g)
is done every loop - same for the other addresses of the elements in the arrays. When comparing to the gfortran version, it doesn’t do that, is simply adds an offset for each element of the array. Manually editing the generated mlir from Flang and moving all but the final step out of the loop generates a much better performance. Using flang-new -fc1 -emit-mlir
mms.mlir
%
632
= fir.load %
22
: !fir.ref<!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>>
%c0_103 = arith.constant
0
: index
%
633
:
3
= fir.box_dims %
632
, %c0_103 : (!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>, index) -> (index, index, index)
%c1_104 = arith.constant
1
: index
%
634
:
3
= fir.box_dims %
632
, %c1_104 : (!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>, index) -> (index, index, index)
%c2_105 = arith.constant
2
: index
%
635
:
3
= fir.box_dims %
632
, %c2_105 : (!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>, index) -> (index, index, index)
%c3_106 = arith.constant
3
: index
%
636
:
3
= fir.box_dims %
632
, %c3_106 : (!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>, index) -> (index, index, index)
%c4_107 = arith.constant
4
: index
%
637
:
3
= fir.box_dims %
632
, %c4_107 : (!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>, index) -> (index, index, index)
%c5_108 = arith.constant
5
: index
%
638
:
3
= fir.box_dims %
632
, %c5_108 : (!fir.box<https:
//assets.zoro.co.uk/Images/f/262x262/57045/UBolt_Metric__Steel__BZP_Bright_Zinc_Plated__UBolt_Type_A_0.jpg!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>, index) -> (index, index, index)
%
639
= fir.box_addr %
632
: (!fir.box<!fir.heap<!fir.array<?x?x?x?x?x?xf64>>>) -> !fir.heap<!fir.array<?x?x?x?x?x?xf64>>
%
640
= fir.convert %
639
: (!fir.heap<!fir.array<?x?x?x?x?x?xf64>>) -> !fir.ref<!fir.array<?xf64>>
%c1_109 = arith.constant
1
: index
%c0_110 = arith.constant
0
: index
%
641
= fir.load %
89
: !fir.ref<i32>
%
642
= fir.convert %
641
: (i32) -> i64
%
643
= fir.convert %
642
: (i64) -> index
%
644
= arith.subi %
643
, %
633
#
0
: index
%
645
= arith.muli %c1_109, %
644
: index
%
646
= arith.addi %
645
, %c0_110 : index
%
647
= arith.muli %c1_109, %
633
#
1
: index
%
648
= fir.load %
90
: !fir.ref<i32>
%
649
= fir.convert %
648
: (i32) -> i64
%
650
= fir.convert %
649
: (i64) -> index
%
651
= arith.subi %
650
, %
634
#
0
: index
%
652
= arith.muli %
647
, %
651
: index
%
653
= arith.addi %
652
, %
646
: index
%
654
= arith.muli %
647
, %
634
#
1
: index
%
655
= fir.load %
91
: !fir.ref<i32>
%
656
= fir.convert %
655
: (i32) -> i64
%
657
= fir.convert %
656
: (i64) -> index
%
658
= arith.subi %
657
, %
635
#
0
: index
%
659
= arith.muli %
654
, %
658
: index
%
660
= arith.addi %
659
, %
653
: index
%
661
= arith.muli %
654
, %
635
#
1
: index
%
662
= fir.load %
92
: !fir.ref<i32>
%
663
= fir.convert %
662
: (i32) -> i64
%
664
= fir.convert %
663
: (i64) -> index
%
665
= arith.subi %
664
, %
636
#
0
: index
%
666
= arith.muli %
661
, %
665
: index
%
667
= arith.addi %
666
, %
660
: index
%
668
= arith.muli %
661
, %
636
#
1
: index
%
669
= fir.load %
99
: !fir.ref<i32>
%
670
= fir.convert %
669
: (i32) -> i64
%
671
= fir.convert %
670
: (i64) -> index
%
672
= arith.subi %
671
, %
637
#
0
: index
%
673
= arith.muli %
668
, %
672
: index
%
674
= arith.addi %
673
, %
667
: index
%
675
= arith.muli %
668
, %
637
#
1
: index
%
676
= fir.load %
101
: !fir.ref<i32>
%
677
= fir.convert %
676
: (i32) -> i64
%
678
= fir.convert %
677
: (i64) -> index
%
679
= arith.subi %
678
, %
638
#
0
: index
%
680
= arith.muli %
675
, %
679
: index
%
681
= arith.addi %
680
, %
674
: index
%
682
= fir.coordinate_of %
640
, %
681
: (!fir.ref<!fir.array<?xf64>>, index) -> !fir.ref<f64>
%
683
= fir.load %
682
: !fir.ref<f64>
This is JUST the qim
element address calculation, those three lines inside the loop turns into 230 lines of MLIR. This is not unreasonable in itself.
The real problem is that the next stages of compilation ought to simplify and move this out of the loop, and converting it to a simpler form. Looking at the assembler output from this function, it still has several dozen instructions in the inner-most loop. The three lines of Fortran results in around 120 lines of assembler on x86-64, and about 100 instructions on AArch64 (AArch64 is a little shorter thanks to the MADD instruction, which helps the sequence of arith.muli
and arith.addi
that occur several times in each address calculation).
Optimisation 1: Move index/address calculations out of inner loop
At first, I tried compiling this with various ways of optimizing the code - fir-opt
with a variety of arguments, LLVM’s opt
with a variety of arguments, as well as clang -O3
- all of those made little or no difference to the overall time - taking a tenth off here or there, but inspecting the code, it was clear that the address calculation was still inside the main loop.
I eventually took Kiran’s suggestion and edited the FIR (MLIR) output, and moved the majority of the address calculations out of the innermost loop, then compiled using the tco
and clang
with default optimisation. This gives us these times:
Compiler | AArch64 | Relative | x86-64 | Relative |
---|---|---|---|---|
Flang-new | 13.8 | 5.6 | 10.0 | 4.1 |
Now the perf report
looks like this:
30.62% fsnap fsnap [.] Fortran::runtime::DoTotalReduction<double, Fortran::runtime::RealSumAcc 27.33% fsnap fsnap [.] _QMmms_modulePmms_src_1…omp_par 24.27% fsnap fsnap [.] _QMdim3_sweep_modulePdim3_sweep 4.80% fsnap fsnap [.] _FortranASumReal8 2.92% fsnap libc-2.31.so [.] _int_free 1.20% fsnap libc-2.31.so [.] malloc
Conclusion and next steps
Conclusion: The compiler ought to be able to move the address calculation for elements in multidimensional arrays out of the respective loops.
Next step:
- Small reproducer: multi.f90 & cmulti.cpp
- Ticket on fir-dev: #1466
- Ticket summary: Figure out what needs to be done to make the compiler hoist the address calculations out of the loop. The current understanding is that LLVM passes aren’t optimizing the LLVM-IR generated because it doesn’t understand that the descriptors aren’t being changed by the code that writes to the content of the actual array. This is likely due to LLVM not having alias information.
Optimisation 2: Sums, reduction function
At this point, we can look at the “new biggest time used”, the Flang library function DoTotalReducton
. Using GDB, setting a breakpoint in the (address from perf
) then looking at the call stack. There’s a few dozen calls in random other places, but eventually it is clear that the majority of calls are from the dim3_sweep
function, and writing a replacement for the several calls to SUM
gives a substantial improvement. The sum1d
is called with 12 elements, and sum2d
is called with a 48 element array, since nang=12
for this test. With the source of the function available, the compiler inlines the code of sum2d
and sum1d
Sum functions
FUNCTION sum2d(arr)
REAL(r_knd), DIMENSION(nang,
4
), INTENT(IN) :: arr
REAL(r_knd) :: sum2d
REAL(r_knd) :: res
INTEGER :: i, j
res =
0
do
i =
1
, nang
do
j =
1
,
4
res = res + arr(i, j)
end
do
end
do
sum2d = res
end function sum2d
FUNCTION sum1d(arr)
REAL(r_knd), DIMENSION(nang), INTENT(IN) :: arr
REAL(r_knd) :: sum1d
REAL(r_knd) :: res
INTEGER :: i
res =
0
do
i =
1
, nang
res = res + arr(i)
end
do
sum1d = res
end function sum1d
The summary of the perf report
now looks like this:
33.39% fsnap fsnap _QMmms_modulePmms_src_1…omp_par 27.26% fsnap fsnap _QMdim3_sweep_modulePdim3_sweep 26.15% fsnap fsnap _QMdim3_sweep_modulePsum1d 2.21% fsnap libc-2.31.so _int_free 1.52% fsnap libc-2.31.so malloc 1.04% fsnap fsnap Fortran::runtime::DoTotalReduction<double, Fortran::runtime::NumericExtr
This produces the following times:
Compiler | AArch64 | Relative | X86-64 | Relative |
---|---|---|---|---|
Flang-new | 9.2 | 3.7 | 6.4 | 2.6 |
Conclusion and next steps
Conclusion : the compiler should inline (small) calls to FortranASum
to avoid the overhead in the function calls.
Next steps:
- Write small reproducer: sum.f90
- TIcket in fir-dev: #1490
- Ticket summary: Write some sort of “replace calls to this library function with inline call” pass.
Optimisation 3: Move fir.allocmem/fir.freemem out of loops
While looking at the dim3_sweep FIR code from flang-new , it became clear that there are several calls to fir.allocmem and fir.freemem within the main loop. Moving those by hand in the MLIR code for that function gives us a bit more gain.
Now the perf report shows this:
49.74% fsnap fsnap _QMmms_modulePmms_src_1…omp_par
36.94% fsnap fsnap _QMdim3_sweep_modulePdim3_sweep
1.61% fsnap fsnap Fortran::runtime::DoTotalReduction<double, Fortran::runtime::NumericEx
1.43% fsnap fsnap _QMexpxs_modulePexpxs_slgg
1.40% fsnap fsnap Fortran::runtime::DoTotalReduction<double, Fortran::runtime::NumericEx
1.05% fsnap fsnap Fortran::decimal::BigRadixFloatingPointNumber<53, 16>::ConvertToDecima
And the time taken is:
Flang-new 7.3 3.0 4.6 1.9
Conclusion and next steps
Conclusion: Moving memory allocations out of the inner loop. Ideally, if the size is constant(ish), using alloca would be better. These are all small in this example.
Next steps:
Write small reproducer
Ticket in fir-dev: #1500
Ticket summary: A) hoist fir.allocmem where possible, out of loops and B) use alloca for small fir.memalloc (reverse of Add memory allocation optimization pass to the tool pipeline. by schweitzpgi · Pull Request #1355 · flang-compiler/f18-llvm-project · GitHub)
Using clang -O3 on the hand-optimised code
Adding -O3 to {{clang} compilation of the LLVM-IR - this only on the modified dim3_sweep.ll
and mms.ll
- this gives a little more improvement:
Compiler | AArch64 | Relative | X86-64 | Relative |
---|---|---|---|---|
Flang-new | 6.4 | 2.6 | 4.6 | 1.8 |
The AArch64 solution appear to get a bit more congested (or have smaller caches?), so the mms_src_1 function takes up a bigger portion of the overall time than the x86-64 solution.
Summary table
This is table shows all the steps in a single table, as well as how many percent improvement for each step
Compiler | Descripton | AArch64 | Relative | Improvement | X86-64 | Relative | Improvement |
---|---|---|---|---|---|---|---|
Compiler | Descripton | AArch64 | Relative | Improvement | X86-64 | Relative | Improvement |
GFortran | Baseline | 2.47 | 1.0 | 2.43 | 1.0 | ||
Flang-new | Baseline | 18.1 | 7.3 | 0% | 12.1 | 5.0 | 0% |
Flang-new | Hoist address calc | 13.8 | 5.6 | 24% | 10.0 | 4.1 | 17% |
Flang-new | Inline SUM | 9.2 | 3.7 | 33% | 6.4 | 2.6 | 36% |
Flang-new | Hoist allocations | 7.3 | 3.0 | 21% | 4.6 | 1.9 | 28% |
Flang-new | Use clang -O3 | 6.4 | 2.6 | 12% | 4.6 | 1.8 | 4.3% |