I am interested in speeding up spec2017/fotonik3d_r. Previous analysis  showed that this needs alias information to be passed along to LLVM. I would like to add TBAA information to generated LLVM IR using alias analysis information from
The use of TBAA is a temporary measure until the full restrict patches are merged into LLVM.
I have written a prototype implementation of my scheme and the results look promising.
LLVM-Flang already outputs tbaa tags during codegen. These tags categorise memory accesses into descriptor (box) access and data access. Any load/store from/to a box is a descriptor access. All other accesses are considered data accesses. All box accesses may alias with eachother and all data accesses may alias with eachother; but no box access can alias with any data access (and vice-versa).
The generated TBAA trees look like this
|-> Descriptor Member Flang Type TBAA Root -> Any Access -| |-> Any data access
Classic Flang provides detailed alias information via TBAA trees. In particular, it was found that each function had to be given its own TBAA root so that alias analysis information cannot be defeated by inlining inside LLVM. For example
subroutine caller() global_x = 0 call callee(global_x) end subroutine
In fortran, non-pointer/target globals cannot alias with a (non pointer/target) dummy argument inside of
callee. But if
callee gets inlined, any of its instructions could be moved before
global_x = 0 because the alias information would say that any
global_x cannot alias with
Putting each function in its own separate TBAA root causes all accesses inside of
callee to alias with all accesses in
[thanks @vzakhari for pointing this out].
The existing TBAA LLVM-Flang TBAA tree will be duplicated per function.
I will use
fir::AliasAnalysis::getSource to fetch the source of each address loaded from or stored to. If it is found to be a dummy argument it will be placed in a subtree below
Any data access, with each argument having its own node in that tree. If the argument has POINTER/TARGET attributes it will remain as
any data access. This means that
- Dummy arguments (without POINTER/TARGET attributes) do not alias with each other
- Dummy arguments with POINTER/TARGET attributes may alias with any other data access, including other dummy arguments
- Other kinds of data access (e.g. local allocations, global variables) may alias with any other data access, including dummy arguments.
I originally planned to also put local allocations and global variables in their own subtrees in the same way as for dummy arguments. However, this led to no discernable performance benefit to spec2017 and a misscompare in spec2017/wrf_r which I was unable to track down in the time available.
fir::AliasAnalysis does not produce good results in the presence of a lot of unstructured control flow (branches between different basic blocks). This makes it not particularly useful during CodeGen (where the existing TBAA tags are added) because by this time structured control flow (e.g.
fir.do_loop) have already been lowered to unstructured control flow. This limitation occurs because values are traced backwards through the IR to find their source. If a value passes through a block argument and there are multiple predecessor blocks it is hard to determine where that value first came from without performing full data flow analysis (which might be a good extension in the future, but would add considerable complexity).
To get around this, I have added an alias analysis interface to
fir.store to allow this alias analysis to be performed in an earlier pass towards the end of the FIR optimization pipeline (before structured control flow is lowered).
Some box memory accesses in LLVM IR are implicit in FIR and so the existing TBAA tags for box accesses must remain part of in CodeGen and so TBAA tags will have to be added in two different places.
See attached files (too long to post here).
- Fortran: tbaa-example.f90 - Pastebin.com
- FIR: tbaa-example.mlir - Pastebin.com
- LLVM IR: tbaa-example.ll - Pastebin.com
In the FIR file one can see the TBAA tags added by my new pass. Loads and stores to addresses not handled by my pass (not data access to dummy arguments) have no TBAA tags at this stage.
In the LLVM file one can see the TBAA tags after CodeGen has run. This ensures that other accesses are either “any data access” or “descriptor member”.
I have run SPEC2017 (rate), SPEC2006(*), SPEC2000, polyhedron and SNAP. All benchmarks pass. The gfortran test suite also passes. I measured on aarch64 using
-flang-experimental-hlfir -Ofast -fno-associative-math -flto -mcpu=native.
- SPEC2017 sees a 35% improvement to fotonik3d, 12% improvement to roms and single digit improvments to the other benchmarks.
- SPEC2000 has the only other significant improvment: 30% for apsi. facerec, lucas and fma3d are very slightly slower (barely above measurement error). wupise, mgrid and galgel might be slightly faster, but the difference is small.
- SPEC2006 sees little change to benchmark scores.
- I ran SNAP as featured in the llvm test suite repo, except increasing nx and ny to 200 (so that it ran for long enough to get a good measurement). Like many benchmarks, this appears to have improved but the difference is small enough that it may have been measurement error.
- Polyhedron mostly saw similar very small improvements. But linpk has a 64% performance regression!
- I not yet investigated the linpk performance regression, but it might be because the box/data access disambiguation now has to use per-function TBAA trees and so provides less precice information for inlined functions.
(*) I haven’t gotten results for spec2006/wrf due to some issue unrelated to my pass.
- My existing prototype is very rough code and needs a lot of cleanup, documentation, unit testing, etc.
- I need to spend some time considering the performance regression in polyhedron/linpk
- I have not measured the impact on compilation time of this change. Anecdotally, flang-new is a bit slower but not catastrophically so. Once my implementation is of higher quality I would like to spend some time looking for easy compile-time performance wins (memoisation, multithreading).
- I have tested using all of the Fortran programs I have easily to hand, but I would be greatful if others try their source with these changes once I publish the code.
- Understand if there would be benefits to adding TBAA subtrees for local allocations and/or global variables in other benchmarks (I only checked spec2017)