Performance comparison with Simplified and Inlined intrinsics

Running the SNAP application as a benchmark here, on my x86-64 based build machine, with modified 2d_mms_st.inp (nx = ny = 80 instead of 20) as input, and using release build of flang-new -O3 to compile the source files in both cases. This is measured with OMP=off, MPI=off.

I have a small fix where I’ve swapped two lines to fix [FLANG] Use after free in "mask" when using WHERE construct. · Issue #56921 · llvm/llvm-project · GitHub

Difference percentage is Optimised SUM divided by Standard with percentage formatting.

The overall summary is that “Significant improvement on the overall time it takes to run the benchmark”, in line with earlier measurements.

I ran both versions a few times, and all of the times vary up and down by a percent or two, but the overall difference is solidly around 57% of the Standard, and roughtly 37% in the Inner Iterations (where you’d expect SUM to make a difference).

Measure Standard Optimised SUM Difference %
Parallel Setup 0.000005 0.000005 103.98%
Input 0.000233 0.000237 101.75%
Setup 3.596800 3.526600 98.05%
Solve 8.155600 3.114100 38.18%
Parameter Setup 0.022620 0.022232 98.28%
Outer Source 0.068606 0.068472 99.80%
Inner Iterations 8.061600 3.020500 37.47%
Inner Source 0.033129 0.032798 99.00%
Transport Sweeps 8.017600 2.977000 37.13%
Inner Misc Ops 0.010913 0.010674 97.81%
Solution Misc Ops 0.002813 0.002879 102.35%
Output 0.300240 0.288240 96.00%
Total Execution time 12.059000 6.935800 57.52%

The patch for making SUM inline is here (it just makes an inlineable/simplified version, MLIR and/or LLVM IR optimisation inlines the function once it’s available as MLIR)


Great work!

The LLD guys use ministat to show performance improvements.