Hello.
As I said before, I’ve investigated the performance of TSVC on A64FX (Fujitsu AArch64 CPU), and now I’d like to share the information.
I’m afraid I’ve not completely finished the research yet, but I think it is worth sharing with you.
HLFIR lowering
There was a weird performance issue on HLFIR lowering.
The assembly code of the innermost loop is the same as FIR lowering but the performance is reduced by 10% when HLFIR lowering is enabled.
I found that the performance depends on the order of files in linking.
$ cat second.f90
function second()
real second
call cpu_time(second)
end function
$ flang-new mains.f loops.f second.f90 -Ofast -flang-deprecated-no-hlfir
$ ./a.out | grep s4117
s4117 10 2.301718 1.0241E+01 1.0241E+01 121
s4117 100 2.187275 1.0042E+02 1.0042E+02 7.5978E-08 121
s4117 1000 1.901932 1.0004E+03 1.0004E+03 2.4403E-07 121
$ flang-new mains.f loops.f second.f90 -Ofast
$ ./a.out | grep s4117
s4117 10 2.664087 1.0241E+01 1.0241E+01 121
s4117 100 2.319328 1.0042E+02 1.0042E+02 7.5978E-08 121
s4117 1000 2.351143 1.0004E+03 1.0004E+03 2.4403E-07 121
$ flang-new loops.f mains.f second.f90 -Ofast
$ ./a.out | grep s4117
s4117 10 2.217472 1.0241E+01 1.0241E+01 121
s4117 100 2.333374 1.0042E+02 1.0042E+02 7.5978E-08 121
s4117 1000 1.900425 1.0004E+03 1.0004E+03 2.4403E-07 121
Some other loops have the same issue.
I don’t know why, but I’d like to conclude there is no problem on HLFIR lowering.
vs gfortran
I also measured the performance of Gfortran.
According to the result below, Flang is 11% slower than Gfortran as a whole.
In particular, it seems that vectorization makes a big difference on their performance.
func | Gfortran [s] | Flang with FIR [s] | Flang with HLFIR [s] |
---|---|---|---|
ALL | 1354.27 | 1525.36 | 1523.19 |
s111 | 5.572203 | 5.066218 | 5.052499 |
s112 | 10.112344(V) | 4.050852(V) | 3.947439(V) |
s113 | 2.262772(V) | 4.245930 | 4.716603 |
: | |||
vdotr | 4.834167(V) | 3.709595(V) | 3.713380(V) |
vbor | 10.649109(V) | 10.333374(V) | 10.352844(V) |
- Version
- Gfortran: 11.2.0
- Flang: main(1c1227846425883a3d39ff56700660236a97152c)
- Option:
-Ofast
-falias-analysis
is also specified for Flang
- “(V)” says the loop is vectorized
- checked with
-fopt-info-vec-optimized
for Gfortran and-Rpass=vector
for Flang
- checked with
Vectorization
I checked whether vectorization works well for TSVC (Test Suite for Vectorizing Compilers).
There are 135 loops in TSVC and Flang can vectorize 52 loops while Gfortran can vectorize 58 loops.
I think the gap should be filled, and now I’m focusing on the loops which are vectorized if they are written in C.