Constant folding multi dimension vector reduction if the source and accumulator are both constant and splat. In this case, denote the value in source as S, value in accumulator as A and the reduction op as @:
result = splat(V) where V = A @ (S @ S … @ S)
Where @ is applied N times where N is the product of reduction dims. The suggested implementation uses APFloat and APInt. The arguments for doing it is:
Accuracy: doing this with APFloat is more accurate than lowering to a partial reduction and doing this at runtime.
Consistency: this would make the result more consistent compared to say, doing the O(N) operations at runtime where it can vary hardware to hardware.
If we determine that partial reduction is the correct way to do this and we should not fold, then we are enforcing non-determinism: the reduction op does not specify ordering, so for a given reduction any ordering is valid and so there is no unique valid result. It is an inconsistent stance to take, to accept that there is no unique valid result but stipulate that the result we’d get by folding is not valid.
It’s a bit awkward to argue that we should fold, because usually the arguments are over choosing between “fast or accurate” or “fast or deterministic” but here it is “fast, more accurate and deterministic or ?”.
With floating point math it is inevitably hard to get same results on all hardware, with all compilers etc… And reductions can lead to different results- the default behavior of any major OpenMP/OpenACC compiler is that omp parallel for reduction or acc parallel for reduction can return different values run by run: this is justifiable, because the default expectation is performance. Serializing a reduction is not the default. That justification is even easier to apply here: who would be unhappy that their program got faster and more accurate?
Thanks for the proposal, @Iman. This is interesting because AFAIK, we don’t currently do much folding in the vector dialect for ‘math-heavy’ operations, i.e., ops that do not merely rearrange vector elements. Before we get to multi_reduction, I think we should decide whether we want to constant fold more operations in this bucket: reduction, contract, matrix_multiply, outerproduct.
Before going into the multi reduction proposal, let’s consider the simpler case of (plain) reduction. The idea of folding it is interesting because it ends up producing a single scalar, thus we don’t have to worry about creating too many dense element attributes. For integer values, this seems to me like an obvious win. For floating point values, however, my concern would be whether the fold itself is correct. Assuming no rounding flags, this seems safe for some fp values like NaN/Inf or for identity values of combining operations. I’m not sure if we can safely fold the most general case of arbitrary fp inputs and arbitrary combining kinds, since the reduction order is currently unspecified. Even folding arbitrary splat values seems tricky with <mul> requiring us to commit to a certain pow function implementation like in your APFloat PR. Last, maybe it would also be good to only make the ‘cheap’ case a fold and the more complicated non-splat cases a canon pattern to save on compilation times.
I’d like to highlight that this reads to me very much like a bug report… Probably not the kind of precedent we’d be looking to replicate.
In the multi_reduction case, I think we’d need to take the same precautions as in reduction and, in addition, cap the constant sizes.
Would it make sense to perhaps split the PR in two (integer one can be submitted post review as seems everyone is happy) and then the discussion here can also focus on floating point exclusively?
Thanks @kuhar .
Re limiting the constant size: Do we even need to do anything here (the case where source and destination are both splat)? Because the output would also be Splat.
Last, maybe it would also be good to only make the ‘cheap’ case a fold and the more complicated non-splat cases a canon pattern to save on compilation times.
Totally. I am only suggesting, at least for now, to do the splat-splat case.
So we also don’t have any large constant issue for this case and the only remaining issue is whether the fold may be incorrect for float, because the reduction order is unspecified. And this is the part I don’t understand.
How can we be producing incorrect result, when it is unspecified what is the correct result? If the order was specified then one could argue whether a certain value would be incorrect because it did not conform to that specification. Maybe later there would be an attribute or another op with a specification about how to do the reduction, and once there is, that spec should be respected. But currently there isn’t.
For a splat-splat case there is really only 2 possibilities: fold or partial-reduce. None of them is any more correct than the other given the spec, but one of them is much faster and more efficient.
And if there ever was a big breakage or strong reason where this turned out to be bad, we can just turn off the folding for floats. Or restrict it to where the result is exact. We won’t be stuck with this change, we can just undo it.
If this folding is observable at all, then it should only be done if explicitly allowed by floating-point control flags (a la “fast-math”).
If this folding is not observable at all, then it’s legal. So if you’re OK with less optimized compilation times, then a simple way to make this folding legal is to make it perform the original math expression instead of trying to rewrite it mathematically. Evaluate a + ... + a at compile time, don’t evaluate it as n * a.
Without even getting to multiplicative reductions being folded into a power, maybe we can already advance this discussion on the simpler case of additive reductions.
There are three problems with rewriting a + ... + a into n * a:
Unlike mathematical real numbers, floating point is not distributive: x * (y + z) != x * y + x * z in general. In particular, a + ... + a != n * a in general. This is demonstrated by the test program below.
Since n is an integer and a a floating-point value, forming the expression n * a relies on introducing casts. There are multiple ways to introduce them, resulting in more different results. Again demonstrated in the program below.
For very large integral values of n (larger than 1/epsilon, e.g. ~1e+7 for f32), two separate issues arise, one leading to minor differences and one leading to huge differences:
3.a. (Minor difference) The integral value n may not be exactly representable, so the cast of n to a floating point type, necessary from the above point 2., itself introduces numerical differences.
3.b. (Huge difference) When the sum gets large enough that a is less than epsilon times sum, we have sum + a == sum. The sum a + ... + a thus stops increasing, while n * a keeps increasing linearly with n. At that point, the difference gets arbitrarily large. Demoed in the 2nd run below.
Demo:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
if (argc != 2) {
fprintf(stderr, "pass an integer!\n");
exit(EXIT_FAILURE);
}
int n = strtoll(argv[1], NULL, 10);
#define PRINT(X) printf(#X " = %.10g\n", (X))
// Initialize with any constant `a` that has nonzero low mantissa bits, so
// that additions and multiplications have to perform rounding.
float a = 1.0f / 3;
PRINT(a);
float sum = 0.f;
for (int i = 0; i < n; ++i)
sum += a;
PRINT(sum);
PRINT(((double)n) * ((double)a));
PRINT((float)(((double)n) * ((double)a)));
PRINT(((float)n) * ((float)a));
PRINT(n * a);
}
$ clang ~/fold.c -o ~/fold && ~/fold 10
a = 0.3333333433
sum = 3.333333015
((double)n) * ((double)a) = 3.333333433
(float)(((double)n) * ((double)a)) = 3.333333492
((float)n) * ((float)a) = 3.333333492
n * a = 3.333333492
$ clang ~/fold.c -o ~/fold && ~/fold 10000000
a = 0.3333333433
sum = 3090156.25
((double)n) * ((double)a) = 3333333.433
(float)(((double)n) * ((double)a)) = 3333333.5
((float)n) * ((float)a) = 3333333.5
n * a = 3333333.5
$ clang ~/fold.c -o ~/fold && ~/fold 100000000
a = 0.3333333433
sum = 8388608
((double)n) * ((double)a) = 33333334.33
(float)(((double)n) * ((double)a)) = 33333334
((float)n) * ((float)a) = 33333334
n * a = 33333334
So the fold can actually do the computation without folding a + … + a into n * a. It can just loop over the value and do the partial reduction. This may take a long (compilation)time, so it can be limited to N below a threshold. If this is how we do it: doing the partial reduction in compile time, then the results would be the same.
Right! That is my suggestion. I can’t think of an objection to that, other than the potentially larger compilation times, but like you say, this could be limited to some threshold. If the values you happen to care about are not that huge, feel free to set the threshold just above them
I don’t believe this is permissible either because, as it stands now, vector.reduction does not prohibit reassociation, so my understanding is that lowerings with arbitrary bracketing/association are allowed. This is less contrived than it appears because it’s natural to distribute reduction to parallel processors and have a higher-level reduction at the end. For this reason, I think for a fold not to be observable, we’d have to guarantee that we produce the same result as any of these valid lowerings.
For integer values, this should be always allowed; for certain fp values too (e.g., NaN, +0.0 for <add>, etc.), but I’m not convinced it’s valid in the general case, with or without a plain loop. The reason is that MLIR is not only an ML compiler and can also be used for languages/compilers that do care about strict fp correctness, and having observable folds is not a good default.
To allow for these more aggressive folds that we’d want to use in ML, I think that it’d make sense to make them permissible under fast math flags, or put them in off-by-default rewrite patterns that projects can selectively enable.
Good point! To summarize, this is falling victim to the current under-specification of FP math in MLIR, particularly here about the underspecification of associations in reductions. We are going to have a MLIR-wide discussion about that sooner or later; the fact that this is causing friction here is a data point towards that. Until that’s resolved, yes, non-default flags / non-default patterns may be the best that can be done.
AFAIK, provided that all input elements are identical, we only need to worry about add and mul. Put differently, all integer and most FP cases should be safe to fold. (*)(**).
+1
I hope I am not misinterpreting the discussion, but if “distribution” is the natural thing to do, then it sounds like the current default behavior is already producing an approximation of the original reduction. Just to be precise, this is the original calculation:
SUM_1 = a + ........................... + a
The distributed sum:
SUM_2 = (a + ... + a) + ... + (a + ... + a)
Since FP add is not “distributive”, we can’t assume that SUM_1 == SUM_2. Hence, with distribution, we are already introducing an approximation. To me, this PR simply introduces another approximation. Is this a fair assessment?
-Andrzej
(*) We should still make sure that we don’t create huge vectors.
(**) I couldn’t find what the difference between e.g. minnumf and minimumf is, so might be missing something intricate here. Having said that, I don’t see minnumf in 'arith' Dialect - MLIR
minnumf vs minimumf: the difference is in handling NaNs. minimumf propagates the NaN (if one is NaN result is NaN). minnumf doesn’t: if one input is NaN, it returns the other input.
Also, we already fold arith::AddFOp. So, say you have a machine with 3-input add, you may later want to change a chain of AddFOps doing a + … + a, into (a + a + a) + (a + a + a) + … using your 3-input add function: so by folding, you can’t do that anymore and the results would be different whether it’s constant or not. By same logic, shouldn’t we also remove AddFOp::fold ?
I think that you implicitly assume the following parenthesization: a + (a + (...)), since the only well-defined primitive for fp addition is a binary operator IIUC. My point is that, in the dialect docs as of today, the exact parenthesization is not specified, therefore I consider any of them equally valid and not be an approximation.
You’d have to make sure that the lowering from arith.addf to your 3-input add preserves the semantics of arith.addf. If it does not, it’s not the fold that’s bad, it’s the lowering. An escape hatch to allow such lowering is the fast math flags.
As a guiding principle, I think that for any op with implementation-defined behavior, we can only fold when we are certain that the fold causes no observable difference for any of the valid implementations. Otherwise we have to delegate the folding logic, if any, to those implementations.
I’ll try a different example: say we have an unordered_set dialect with set operations: unordered_set.insert and unordered_set.get_element. The insert operation adds a value to the set and returns a new set with an implementation-defined order, while get_element returns a set element at the given index.
In this scenario, it would be for the lowering to decide how exactly to implement the set: whether to use the hash value order, preserve the insertion order, or something else. At the level of the unordered_set dialect, it would be invalid to add a fold for get_element that checks if the index is 0 and returns the most recently inserted element, assuming it will be found at index 0 because the element order was not specified.
I was wondering how this ambiguity has no precedent in LLVM vector intrinsics. And in fact it seems vector is intended to be close to them:
The doc for reduction op in Vector mentions this ('vector' Dialect - MLIR)
If the intrinsic call has the ‘reassoc’ flag set, then the reduction will not preserve the associativity of an equivalent scalarized counterpart. Otherwise the reduction will be sequential , thus implying that the operation respects the associativity of a scalarized reduction. That is, the reduction begins with the start value and performs an fadd operation with consecutively increasing vector element indices.
llvm.vector.reduce, by default, assumes a sequential order. So if we wish to remain close to the corresponding LLVM intrinsics shouldn’t we just adopt the same? Default to sequential, introduce reassoc as a way to allow non-sequential.
And then even in that case, assume we introduced reassoc to disambiguate. That we have both: sequential, but you can enable reassoc → is it true that in both cases it is allowed to constant fold (sequentially)? Because even reassoc does not mean you can’t do sequential, so for constants, the fold would still be permitted, with or without reassoc? Or the other way say we disambiguate by defaulting to reassoc but permitting an option to enforce sequential → (seq) constant folding would still be permitted in both cases right?
Thanks for bringing this up for discussion! In general, I agree with the points that @kuhar made. For me, the key aspect is to make sure we enable the functionality that @Iman is looking for while keeping the default case conservative.
Adding to or emphasizing what has been mentioned before:
We might have overlooked FP exceptions and signaling. It would be a lot of fun if the folding at compile time led to an FP exception crash, which might be handled differently by the runtime hardware. We need optionality for users who may want to take this risk.
Compilation time should not be underestimated. Performing a multi-GB matmul on a single CPU thread might always be undesirable depending on the target runtime hardware. We need optionality regardless of the folding applicability.
More accuracy is not necessarily better. We should match the accuracy of the selected FP mode/models. Without FP models in place in MLIR, we should be conservative by default (IEEE?). Enforcing a more accurate model after the fact is extremely hard. We could leverage the reassoc FMF but that seems somewhat orthogonal to the fact that the computation would happen at compile time.
I don’t think these ops have “implementation-defined” semantics but fp-model based semantics, similar to LLVM. For example, AFAIR a vector reduction intrinsic is lowered to a tree reduction or sequential reduction depending on the fp model.
Given this, I think we could take the following steps in the short term:
Enabling folding of “large” integer operations under a flag or function_ref check, disabled by default.
Bringing FMFs and rounding modes to all these vector operations.
Enabling folding of “large” FP operations with the appropriate FMF and rounding mode under a flag or function_ref check. Different folding strategies could be implemented based on the provided FMFs.
In the longer term, this should be integrated into an MLIR FP model. The folding behavior could even be modeled with an independent “folding mode” attribute or a dedicated FMF, as we considered for other FP behaviors. Unfortunately, it’s not clear to me when an FP model will be proposed for MLIR.
I feel that the following nuance might be the key to allowing you to get what you need with minimal friction:
Maybe you don’t need this constant-evaluation pattern to be a folding pattern, that is, to be a ::fold() method.
Being a folding pattern is what really heightens the requirements of unobservability here.
If it were just a pattern with a populateImansFoldingPattern() function that downstreams could choose to call or not, that would allow you to do whatever you want.
And then if in some cases if can be proven to be unobservable and really useful to have as a folder, the decision to add that to ::fold() can always be revisited later.
It is equally valid, but we can only fold if we guarantee that the result is the same as the actual implementation; otherwise we may end up with self-contradictory/unsafe code. For example, consider something like this (pseudocode):
It’s possible that foo is only safe to call for values greater than 42 and even though we checked that x and const are bit-identical, we end up violating this precondition by evaluating reduction twice: once at runtime, once when folding.
+1 Provided that the underlying target is LLVM. Not sure about SPIR-V, @kuhar ?
vector.reduction seems much higher-level than llvm.vector.reduce and not limited to very small vectors. But even for small vectors, the spir-v lowering uses an unspecified order via spirv.dot. When that’s not available, we expand to a tree reduction.