[RFC] Propagate FIR Alias Analysis Information using TBAA

I am interested in speeding up spec2017/fotonik3d_r. Previous analysis [1] showed that this needs alias information to be passed along to LLVM. I would like to add TBAA information to generated LLVM IR using alias analysis information from fir::AliasAnalysis.

The use of TBAA is a temporary measure until the full restrict patches are merged into LLVM.

I have written a prototype implementation of my scheme and the results look promising.

[1] [Flang] Fix performance issue in 549.fotonik3d_r · Issue #58303 · llvm/llvm-project · GitHub

Prior work

LLVM-Flang

LLVM-Flang already outputs tbaa tags during codegen. These tags categorise memory accesses into descriptor (box) access and data access. Any load/store from/to a box is a descriptor access. All other accesses are considered data accesses. All box accesses may alias with eachother and all data accesses may alias with eachother; but no box access can alias with any data access (and vice-versa).

The generated TBAA trees look like this

                                    |-> Descriptor Member
Flang Type TBAA Root -> Any Access -|
                                    |-> Any data access

Classic-Flang

Classic Flang provides detailed alias information via TBAA trees. In particular, it was found that each function had to be given its own TBAA root so that alias analysis information cannot be defeated by inlining inside LLVM. For example

subroutine caller()
  global_x = 0
  call callee(global_x)
end subroutine

In fortran, non-pointer/target globals cannot alias with a (non pointer/target) dummy argument inside of callee. But if callee gets inlined, any of its instructions could be moved before global_x = 0 because the alias information would say that any global_x cannot alias with callee_arg0.

Putting each function in its own separate TBAA root causes all accesses inside of callee to alias with all accesses in caller.

[thanks @vzakhari for pointing this out].

Design

The existing TBAA LLVM-Flang TBAA tree will be duplicated per function.

I will use fir::AliasAnalysis::getSource to fetch the source of each address loaded from or stored to. If it is found to be a dummy argument it will be placed in a subtree below Any data access, with each argument having its own node in that tree. If the argument has POINTER/TARGET attributes it will remain as any data access. This means that

  • Dummy arguments (without POINTER/TARGET attributes) do not alias with each other
  • Dummy arguments with POINTER/TARGET attributes may alias with any other data access, including other dummy arguments
  • Other kinds of data access (e.g. local allocations, global variables) may alias with any other data access, including dummy arguments.

I originally planned to also put local allocations and global variables in their own subtrees in the same way as for dummy arguments. However, this led to no discernable performance benefit to spec2017 and a misscompare in spec2017/wrf_r which I was unable to track down in the time available.

Implementation details of prototoype

fir::AliasAnalysis does not produce good results in the presence of a lot of unstructured control flow (branches between different basic blocks). This makes it not particularly useful during CodeGen (where the existing TBAA tags are added) because by this time structured control flow (e.g. fir.do_loop) have already been lowered to unstructured control flow. This limitation occurs because values are traced backwards through the IR to find their source. If a value passes through a block argument and there are multiple predecessor blocks it is hard to determine where that value first came from without performing full data flow analysis (which might be a good extension in the future, but would add considerable complexity).

To get around this, I have added an alias analysis interface to fir.load and fir.store to allow this alias analysis to be performed in an earlier pass towards the end of the FIR optimization pipeline (before structured control flow is lowered).

Some box memory accesses in LLVM IR are implicit in FIR and so the existing TBAA tags for box accesses must remain part of in CodeGen and so TBAA tags will have to be added in two different places.

Example

See attached files (too long to post here).

In the FIR file one can see the TBAA tags added by my new pass. Loads and stores to addresses not handled by my pass (not data access to dummy arguments) have no TBAA tags at this stage.

In the LLVM file one can see the TBAA tags after CodeGen has run. This ensures that other accesses are either “any data access” or “descriptor member”.

Performance measurements

I have run SPEC2017 (rate), SPEC2006(*), SPEC2000, polyhedron and SNAP. All benchmarks pass. The gfortran test suite also passes. I measured on aarch64 using -flang-experimental-hlfir -Ofast -fno-associative-math -flto -mcpu=native.

  • SPEC2017 sees a 35% improvement to fotonik3d, 12% improvement to roms and single digit improvments to the other benchmarks.
  • SPEC2000 has the only other significant improvment: 30% for apsi. facerec, lucas and fma3d are very slightly slower (barely above measurement error). wupise, mgrid and galgel might be slightly faster, but the difference is small.
  • SPEC2006 sees little change to benchmark scores.
  • I ran SNAP as featured in the llvm test suite repo, except increasing nx and ny to 200 (so that it ran for long enough to get a good measurement). Like many benchmarks, this appears to have improved but the difference is small enough that it may have been measurement error.
  • Polyhedron mostly saw similar very small improvements. But linpk has a 64% performance regression!
    • I not yet investigated the linpk performance regression, but it might be because the box/data access disambiguation now has to use per-function TBAA trees and so provides less precice information for inlined functions.

(*) I haven’t gotten results for spec2006/wrf due to some issue unrelated to my pass.

To Do

  • My existing prototype is very rough code and needs a lot of cleanup, documentation, unit testing, etc.
  • I need to spend some time considering the performance regression in polyhedron/linpk
  • I have not measured the impact on compilation time of this change. Anecdotally, flang-new is a bit slower but not catastrophically so. Once my implementation is of higher quality I would like to spend some time looking for easy compile-time performance wins (memoisation, multithreading).
  • I have tested using all of the Fortran programs I have easily to hand, but I would be greatful if others try their source with these changes once I publish the code.
  • Understand if there would be benefits to adding TBAA subtrees for local allocations and/or global variables in other benchmarks (I only checked spec2017)
2 Likes

Thank you for working on this, Tom! Have you considered adding support for multiple TBAA tags on a single load/store/etc. as described in the last paragraph of (draft) Aliasing information in LLVM IR produced by Flang compiler - Google Docs? This should allow disambiguating descriptor/data access after LLVM inlining, and also use the classic flang TBAA trees within the functions.

I’m now quite sure that the slowdown to polyhedron/linpk is because the additional alias information allows the hot loop to vectorize. The loop is slower after vectorization (at least on graviton3).

LLVM’s implementation of TBAA does not currently support multiple tags. Without any known performance regressions from the current implementation, I would rather not try to add multiple TBAA tag support because it could take a very long time to review, by which time maybe full restrict will be ready.

1 Like

Final patch in the series: [flang] Enable fir alias tags pass by default when optimizing for speed by tblah · Pull Request #68597 · llvm/llvm-project · GitHub

I’ve discovered that I made some mistakes in my performance measurements. I actually do need to use TBAA for global variables to get the fotonik3d speedup.

Adding TBAA for global variables does not lead to any misscompilations in spec2017, spec2006, spec2000, SNAP, polyhedron, or the gfortran test suite. Unfortunately it does lead to new performance regressions (on aarch64) in spec2017/exchange2, polyhedron/nf, and polyhedron/gas_dyn2.

The plan going forward is to abandon the patch which enables this by default until the regressions are fixed. But I would like to go forward with merging the pass so that it is easier to test for and work on these regressions.

I appologise for any confusion and for the delay.

TBAA tags are now enabled by default https://github.com/llvm/llvm-project/pull/73111