Yes, this is related to the linked post, at least in that they’re both stemming from the same observation/complaint.
I’ll try to elaborate on our usecase some, and hopefully I’ll strike a good balance between presenting a realistic example and avoiding extraneous details.
In our usage of MLIR, we’re generating convolution kernels that use an implicit matrix multiplication. I’m going to discuss where the signed operation emission comes up in the context of storing back the results of the multiplication, but the discussion should generalize to loading data as well.
Once the multiplication is performed, each GPU thread has a %v : vector<R x type>
of values that correspond to some section of the result matrix (with R
being a constant).
To simplify the usecase, we’ll be storing %v
to %out : memref<N x H x W x C x type>
, and we’ll have each thread’s %v
be a contiguous slice of the C
dimension (with R
evenly dividing C
). Because of this simplifying assumption, we can emit a store of %v[%i]
to %out[%n0, %h0, %w0, %c0 + %i]
(in reality, we generate coordinate-update code that converts the movement from the next value in %v
to the non-negative offsets to (n0, h0, w0, c0), allowing for fully unrolled loops without full re-computation of the indices).
So, the problem becomes computing (%n0, %h0, %w0, %c0)
. We know that there is a fixed mapping from the a thread’s ID to coordinates (%i0, %j0)
that indicate which value in the computed matrix is stored in %v[0]
. However, since we support both NCHW and NHWC tensors, since we plan to support arbitrary or mixed layouts in the future, and since the matrix->tensor map can be quite complex (for example, due to needing to handle padding while storing the results of backwards convolutions, or when loading data), it’s unreasonable to similarly hard-code the map (%n0, %h0, %w0, %c0) = f(%i0, %j0)
.
Therefore, the map f
is stored as an affine map, and we translate from matrix to tensor coordinates using expandAffineMap()
to get the initial values corresponding to the starting coordinates of each thread’s data.
The simplest example of f
is something like
f = (d0, d1) -> (d1 floordiv HW, (d1 mod HW) floordiv W, d1 mod W, d0)
(with all the capital letters being the relevant constants).
We know that the arguments d0
and d1
to this map are non-negative, though LLVM doesn’t appear to. Therefore, if affine.apply
emits checks for handling negative inputs to floordiv
or mod
when lowering the computation above to the arithmetic dialect, the downstream compiler is less capable of optimizing expensive arithmetic to cheaper bitwise operations.
Looking more broadly, while our code currently contains extensive infrastructure to work around missing features in MLIR, we are hoping to be able to express our kernel generator in a style that more closely matches the intended use of MLIR and thus take advantage of the existing infrastructure instead of fighting against it. One such improvement would be ways to express to affine.apply
(and therefore affine.for
, affine.load
, and so on) that some values are non-negative (or being able to use unsigned operations, as in the previous proposal), enabling the emission of more optimal code.
My proposal for a range analysis is intended to introduce a general optimization to the lowering of the affine dialect to arithmetic, which, on top of enabling the optimizations discussed above, would allow for observations like “Oh, this loop goes from 0 to 4 but the iteration variable is divided by 8? That’s just 0 now.”. However, there may be other solutions to our problem, which I’d be happy to discuss.
(On that note, @jungpark , since you headed the previous discussion)
(and while the above issue could be solved with a custom affine visitor that assumes everything is non-negative, that’s limited-purpose, duplicative, and leaves us with extra maintenance burden)