I didn’t know the story of LoopUnswitch vs. SimpleLoopUnswitch. As you said, this RFC is indeed about SimpleDA, not NewDA. I don’t intend to implement a new algorithm, although some additional validation is definitely needed.
Most of my earlier comments were based on the assumption that regressions caused by removing DA functionality should be avoided. However, if such regressions can be justified in the context of default enablement (which I probably should have asked about first), then I don’t have a strong preference for creating a new pass, and I’m happy to focus on improving the existing one – especially to avoid the worst-case scenario where both the newly started pass and the improvements to the current pass end up stalling.
“Symbolic” might be the wrong word. Solving (integer) linear equations[1] is NP-complete, but well-studied. Heuristics solve common cases efficiently and can be aborted with complexity thresholds (like ScalarEvolution). In particular, LLVM already has at least two implementations:
LAA is a special case of innermost loops (e.g. a load/store is only executed once per iteration to compute strides) and has vectorization-only assumptions (such as that “lexical forward dependencies” are fine). Generalization with lose these properties, or add complexity by adding special cases for innermost loops. This seems contrary to the goal of keeping it simple.
On the other side, we eventually will also need a dependency analysis for outer loop vectorizaton.
Q: For which two iteration values i1,i2 of a loop over does A[f(i)] alias (-> meaning a dependence with distance i2-i1)? A: Iff f(i1) == f(i2). ↩︎
I was referring to something like symbolic execution with the word “symbolic”. The word “hard” also would be confusing. What I intended to emphasize was soundness, not computational complexity. More specifically, it’s unclear what happens if an arithmetic operation during intermediate calculations can potentially overflow. Detecting overflow would be straightforward if all the operands of the arithmetic operation are constant integers, but it would become more complicated when symbolic values (e.g., SCEVUnknown) are involved.
I believe each dependence test (e.g., xxxSIV, xxxRDIV) used by DA is a subset of general integer linear equation solvers like the ones you showed. Therefore, these overflow issues should be solvable with appropriate methods. However, I think it’s premature to discuss this topic at this point, given the number of existing bugs in DA.
I had in mind that a few functions in LAA might be reusable for DA, e.g., isSafeDependentDistance, which is almost equivalent to Strong SIV and it has a more meticulous implementation than the one in DA. On the other hand, it’s also true that they are slightly different so it might not be worth the effort to integrate them.
Dependence analysis for outer loop vectorization looks a highly challenging task to me. I hope DA will serve as a good precedent for outer loop dependence analysis.
Invariant SCEVUnknown are just another parameter in the equation system. Variant unknowns, including non-affine expressions, need to be pessimized as “any value”.
Both libraries make use of arbitary-precision integers (isl_int/DynamicAPInt). Potential overflow is verified by solving equations of the form expr <= MAX_INT (or MAX_UINT), or not if nsw/nuw, or one uses it to derive runtime-check assumptions of unknowns in expr.
If the conclusion is that DA needs to be replaced, there is no reason why the replacements needs to use the same algorithms. I am presenting options. An IntLP solver would make several problems go away:
Existing, shared, mature IntLP implementation reduces the risk of bugs
Integer overflow checked in a general manner, not having to be re-thought in every step
Single solver instead of multiple case-by-case algorithms
More powerful due to mathematically complete
Eliminates the need for GetElementPointer/delinearization in statically-sized array case
There are drawbacks of course, most prominently the compile-time costs[1] partially due to the use of arbitry-precision integers[2]. The conclusion of implementing a simpler DA might well be that it is not sufficiently powerful.
That can be reduced by complexity thresholds like SCEV also has. ↩︎
I would consider an implementation that falls back to a SCEVCouldNotCompute equivalent whenever a coefficent overflow happens (this is unrelated to integer overflows in the to-be-compiled program) ↩︎
To clarify: This RFC is effectively a proposal to somewhat simplify the current DA. Adding a new pass is just one possible way to do that, and I don’t intend to implement any new algorithms. Also I’m now leaning towards removing several features from existing DA.
Thanks for clarifying. Introducing an LP solver does sound compelling to me. But I don’t plan to work on such a big task at the moment (and I’m not confident I could complete it). Even if we end up re-implementing a new pass, I was imagining something closer to copy-and-paste parts of the current DA, with some functionality stripped out. Yeah, it might be less interesting, but I believe it would be more feasible.
Good point. Although I don’t have any numbers or data, I believe the current DA is somewhat overkill, and a simplified version would be sufficient for basic cases (though of course, that depends on how much we simplify it). I think we’ll figure out the right balance as we go.
Even though I don’t intend to introduce an existing LP solver for dependence analysis, I’m personally very interested in the topic. So this might be a bit off-topic, but let me ask: what is the difference between that and Polly? Is it the domain of the solver? In this case, the domain of the LP solver would be SCEV, right?
I wanted to summarise this thread (the latest developments).
@sebpop created this pull request to strip out a large chunk of code that is problematic in different ways. @kasuga-fj seems happy with this approach:
To be clear: I am happy with this direction – it’s an alternative to NewDA that I proposed on Discourse.
To me, it looks like there’s some consensus to move forward in this way.
I’ve left one minor question on that merge request, and might have another one, but I was planning to approve that patch soon.
I have created this bucket issue for all DA issues to create an overview what we need to do:
Polly could be understood as a frontend for the Integer Set Library. In addition to dependency analysis, ISL also does schedule optimizaton and code generation (~SCEVExpander).
Polly also has PolyhedralInfo which could be used by LLVM transformation passed for its analyses, but is limited in usefulness because Polly likes to select its scope (the outermost loop that it considers worth optimizing, so any SSA values define before the loop are considered invarient, even if surrounded by another top-level loop) itself. It also adds assumption predicates for non-aliasing and non-integer overflow that have to be ensured at runtime, like LoopVersioningLICM. So PolyhedralInfo only returns dependency information when there are no such assumptions, which in practice is rarely the case. Even a simple loop such as
for (unsigned i = 0; i < n; ++i)
B[i] = A[i + 2];
needs to ensure that A and B do not alias, and n <= UINT_MAX - 2 because otherwise i + 2 overflows. Hence my pessimism that a simplified (but correct) DA might be insufficient.
The equivalent of SCEV expressions in Polly is isl::pw_aff. SCEVExprs are converted to isl_pw_aff using a component called SCEVAffinator, The equivalent of SCEVExpander is IslExprBuilder.
Thanks for the explanation, that will be helpful to improve DA!
This example alone feels pessimistic enough. Without loop versioning, it might not be usable in any practical case. That said, I believe a simplified DA should still work reasonably well, because:
When I previously evaluated the capabilities of loop-interchange using the llvm-test-suite, I found that it wasn’t applied in many cases – even with the current (not simplified) DA. (Though I’m not sure if the llvm-test-suite is an appropriate benchmark.)
In many cases where loop-interchange is applied, the memory access patterns were simple (e.g., A[i][j]), and some loops had constant trip counts.
This brings up a question to me: While accurate dependence analysis is certainly valuable, is it really necessary for the existing loop transformations in upstream LLVM? At least for loop-interchange, I suspect that the profitability check tends to reject cases with complex memory access patterns, even when the dependence analysis is enough powerful to handle those accesses. On the other hand, such reasoning might not apply to passes that are primarily triggered by user pragmas (such as unroll-and-jam).
Anyway, I feel the current DA is a bit overkill. One of the motivation of simplification is to start with the minimal viable feature and incrementally add things as needed while keeping its soundness.
I hope the cost of loop-versioning for DA will be justified someday by the benefit of enabling more loop transformations.