Enabling loop-interchange

This RFC is part of our loop vectorization work that we will present in the LLVM developer’s conference next week and is about enabling loop-interchange by default. Interchange enables more vectorization and is one of the root causes why we are significantly behind compared to other compilers in some test cases.

This is about more than just enabling interchange though: DependenceAnalysis (DA) is used for legality checks. It is also by other loop optimisations but none of the loop optimisations are enabled by default. So, by enabling interchange, we will have a first user of DA. @sebpop, @madhur13490 and I are working on this and our concerted effort is about trying to get a foot in the door with loop optimisations in LLVM. Or perhaps in other words, we would like to test and answer the question whether LLVM IR is suitable to perform any loop optimizations (a bit more on this later).

Of all (disabled) loop optimizations, we think interchange is a good candidate to enable because i) it’s very general and not only beneficial for vectorization but also for improved memory access behaviour, ii) it triggers a quite a lot in the LLVM test-suite and other benchmarks including SPEC, and iii) it’s a relatively straightforward loop transformation to start with.

Our approach to get interchange and DependenceAnalysis enabled:

  • We are aware of the bugs raised against both components, and we have started
    fixing them.
  • We will collect compile-time numbers (a bit more on this below),
  • And we intend to support both components and fix any issues that may come up
    later, so this isn’t supposed to be a fix-and-forget exercise, also because
    we might want to follow up with other loop optimizations.

Sebastian could elaborate more on this, but briefly about Dependence analysis: to fix this correctness issue, he is going to use and will add Memory SSA to DA. This then needs preserving in other loop optimisation passes, so there will be a little bit of churn for other loop optimisations related to MemorySSA, just as a heads up that these changes are in the pipeline. The question whether LLVM IR is the right abstraction for loop optimisations is related to recovering (multi-dimensional) array accesses, i.e. some type information is lost in LLVM IR, which is what delinearisation is trying to recover. While there are certainly limitations what delinearisation can achieve, we think it’s robust enough for our initial use cases. A discussion for later that we are open to, is to see how could add or recover this type information in a more robust way.

Compile-times will be interesting and important. A quick exercise with the compile time tracker shows a 1.8% and 1.12% increase for lencod and mafft respectively (see top results here) It looks like a bit more time is spent in BasicAA. Interchange isn’t even triggering, so I don’t know why that would be the case, but I am investigating this.

We are happy to receive any feedback and ideas, and are also happy to talk more about this on llvm dev conference next week.

1 Like

I’m not particularly familiar with any of the components involved here (LoopInterchange, DependenceAnalysis, Delinearization, LoopCacheAnalysis), so just a few notes on what you wrote…

You probably don’t need MemorySSA for this. I think the only related handling MemorySSA has for this is a crude “guaranteed loop invariant” check, see llvm-project/llvm/lib/Analysis/MemorySSA.cpp at 3764d0ff15ef281974879002e27857a041bd5b9c · llvm/llvm-project · GitHub. A similar check exists in DSE.

Preserving MSSA in all LPM2 passes may be challenging and expensive.

Something I’d definitely require before DependenceAnalysis makes its way into the default pipeline is that the code in llvm-project/llvm/lib/Analysis/Delinearization.cpp at 3764d0ff15ef281974879002e27857a041bd5b9c · llvm/llvm-project · GitHub is adjusted in such a way that it no longer uses the GEP source element type to drive optimization heuristics. I hope this part of the delinearization implementation is not load-bearing for your plans.

You might want to check whether you can use BatchAA inside DA to cache queries.

You probably don’t need MemorySSA for this.

Maybe there’s another way to check dependence chains in loops for array bases. I’m currently using MemorySSAWalker to identify the definitions dominating the two memory uses in the dependence test. That is, use-def chains may end up colliding, indicating a dependence between the two array bases, or the use-def chains do not collide when the arrays are in separate alias groups. Here’s the patch:

Preserving MSSA in all LPM2 passes may be challenging and expensive.

Yes, the more challenging places are loop-fuse and loop-unroll-and-jam that duplicate a lot of code.
Loop interchange does very little code movement, and it was easy to preserve MSSA:

DependenceAnalysis […] no longer uses the GEP source element type

There is only one use of GEP->getSourceElementType() in delinearization, and dozens in other passes. What’s the recommended way to replace it? Do you have a patch that does that in some other part of the compiler? Can you also please explain why delinearization should not access that info? I see for instance this patch that translates to getSourceElementType:

The role of delinearization is to retrieve the info about array accesses that got lost in translation: we need the array type with sizes of each subscript, and the access functions for each subscript. The reason is that the data dependence test is very difficult to compute on linearized array access functions: we end up with systems of “Diophantine equations” (integer coefficients and integer results) in Multiple variables (MIV test is when a subscript contains multiple loop induction variables). On the original multi-dimension array accesses the dependence test is often much easier to do as the dependence test works on a subscript at a time and very often the dependence equations to be solved contain a Single induction variable (SIV test). Delinearization is a simple pattern-match to build the array access info that has been lost in translating to the low-level LLVM IR. With pattern matching we get more precise dependence info instead of running the MIV test. Most of the time the MIV test gives up and the result is the conservative “don’t know” answer.

It looks like a bit more time is spent in BasicAA. Interchange isn’t even triggering

use BatchAA inside DA to cache queries.

Loop interchange should not depend on AA when the code has no nested loops.

GCC’s pass manager has a “gate” function to determine whether the pass should run or not. Maybe we need something similar to avoid building all the required analyses.

The problem is not that delinearization is using getSourceElementType(), it’s what it does with it. You are not allowed to drive optimization heuristics based on the structural type information in a GEP. You can only make use of the offset calculation that type implies. The type information will go away entirely in the future, once the ongoing migration to ptradd representation completes.

A typical way to replace the type access is to use collectOffset(), which converts it into an explicit offset representation. But this does lose the explicit array bounds information from the type – and losing it is intentional.

LoopInterchange computes DependenceInfo itself, not the via pass manager, so it has full control over when (and whether) it gets computed.

You are not allowed to drive optimization heuristics based on the structural type information in a GEP. You can only make use of the offset calculation that type implies.

This info is not used to “drive optimization heuristics”, it is used to check legality of loop transforms. Without array type information the dependence analysis will be less effective than today, and the ability to transform loops will be impaired.

The type information will go away entirely in the future, once the ongoing migration to ptradd representation completes.
A typical way to replace the type access is to use collectOffset(), which converts it into an explicit offset representation. But this does lose the explicit array bounds information from the type – and losing it is intentional.

So in the near future the GEP representation will be equivalent to how memory accesses are represented in the GCC’s RTL back-end: Regs and Memory (GNU Compiler Collection (GCC) Internals)

In the long run we could add more annotations to help with the lack of info in the LLVM IR, or we could decide that LLVM IR is unsuitable for loop transforms that need to be implemented on MLIR or some other IR where the array accesses are still available.

Adding to what Sebastian wrote, this does look problematic and load-bearing for our plans. getIndexExpressionsFromGEP() looks foundational in the delinearisation process and if we can’t access type info, that looks like a showstopper. So, it looks like we will need to address first the more fundamental question if LLVM IR provides the right abstractions for loop transformations, and how to fix that?

If the goal is just to have heuristic to derive whether a dimension exists, you can do it based on arithmetic… if you have “p+x*c1*c2 + y*c2”, where “p” is a pointer, and “x” and “y” are variables, and c1/c2 are invariant, then “c1” is the dimension.

My team has some code internally for doing this with SCEV we could share, if you’re interested.

Yes, DependenceInfo is not managed by the pass manager, and it is a corner case where the loop-interchange will avoid asking for this analysis before it is necessary.

The pass manager decides what analyses will be run for most of the analyses. There’s no control over what analyses the pass manager decides to run or not, independently of whether the code contains interesting patterns or not.

If I say loop-interchange uses MSSA, it will build MSSA prior to the pass::run()
Here’s where all the analysis passes get pulled in by the pass manager: llvm-project/llvm/include/llvm/IR/PassManagerImpl.h at main · llvm/llvm-project · GitHub

PreservedAnalyses PA = PreservedAnalyses::all();

I traced this call when running

$ opt -passes='function(loop-mssa(loop-interchange))' test/Transforms/LoopInterchange/interchangeable.ll 

The following analyzes appear in the execution trace for the above stmt:

PassInstrumentationAnalysis
LoopAnalysis
DominatorTreeAnalysis
AssumptionAnalysis
TargetIRAnalysis
MemorySSAAnalysis
AAManager
TargetLibraryAnalysis
BasicAA
ScopedNoAliasAA
TypeBasedAA
ScalarEvolutionAnalysis
TargetIRAnalysis

Yes, that’s exactly what the delinearization pass is doing.

Yes, thanks Eli. If you can share the code I will take a look.

I wonder if we are speaking about the same code as what LLVM currently has in delinearize.cpp. I did contribute this code when I was at QuIC and we ended up using it internally in Polly’s arm vectorizer.

I guess the part I’m not sure about is, if you can do this just based on the math, how helpful is the version that digs into the GEP’s type? It’s a heuristic either way, so the question is just whether the GEP-based version produces better results.

Looking more carefully, I think it is the same code you’re talking about. The internal version has some improvements to catch more cases, though.

Yes, the code is a heuristic pattern match that may fail, and we do check that the delinearized form follows the little that we know about the loop trip counts:

This makes sure all the array indices fall within the array dimensions detected in delinearize.

Right… we can get the dimensions from the source type of the GEP, or we can go through SCEV; either way, we need to check that the indexes fit in the detected dimensions.

So given that, is there a benefit to using the source type of the GEP? At least for basic cases, going through SCEV should produce the same result.

Hi, @sjoerdmeijer .

I also agree with enabling loop-interchange and I can collaborate with you. So let me ask your thoughts on the cost-model of loop-interchange. As we’ve discussed this a little before in the github, the current cost-model has room to improve. In this case, the same loops are interchanged twice, resulting in a return to the original one. At this point, I think there are at least two things we should focus on:

  • One of the two profitability checks is wrong. I have not investigated the details, but it is likely that the loop-interchange may make the wrong decision and generate code that is less efficient than the original code. This could cause performance degradation.

  • It is a complete waste of time to interchange the same loops. Preventing this in some way may save compile time.

Do you have any thoughts or plans to tackle this issue? I’m happy if we can discuss about it.

Thanks.

Hi @kasuga-fj, I have just uploaded a patch to enable interchange by default: #124911. The very short summary is and as I wrote there, I think we are getting close to that, but am interested to hear other thoughts.

Regarding the cost-model and interchanging the same loops twice, this hasn’t come up as a problem for us, both in terms of correctness or excessive compile-times. So, I don’t see this as showstopper for the interchange enablement. However, I agree of course that it needs fixing, because as you see it doesn’t make sense and saves compile-time.

In terms of next steps: we would like to get interchange enabled soon if others agree. To achieve this, we have prioritised correctness. I.e. we have intentionally made interchange more conservative to avoid miscompiles (e.g. this and this change).

I see that as the first enablement step, and getting it running by default would be a major milestone. The next step after that is the optimisation step that we will start working on soon, because we know interchange isn’t triggering on our motivating examples. It would involve lifting some restrictions or adding functionality to trigger on our motivating examples.

Thanks for your help so far, both for the patches and reviews! We are of course happy to collaborate further on this. We don’t have any plans to look at the issue that you mentioned, so would be great if you want to pick this up, and we are happy to help out with review or discuss approaches in an issue or merge request.

2 Likes

Thanks for your reply.

I also agree with enabling interchange as soon as possible, as the list of problems is endless. I think it would definitely be a big step forward for us. As far as I know, major correctness issues have been resolved on the LoopInterchange side. I don’t know much about the DependenceAnalysis side, but I hope they are close to a fix. My only concern is some performance degradation due to the unprofitable transformations caused by interchange. But I haven’t found any problems so far, so it would not be a major blocker.

Thanks also for your help with reviews and patches! As for the next step, I would like to work on improving the cost-model and legality checks that we are now doing conservatively. I hope we can continue to work together on this and thank you in advance for your cooperation. Of course, if you have any other topic besides this, please feel free to make review requests/mentions to me. I’ll be happy to discuss about them.

Thanks!