I’m developing an NumPy like application that uses MLIR to generate and optimize many of the kernels inside the library. I was hoping to utilize affine maps to implement some transformations on arrays (discussed later), but after reading the discussion on Semantics of affine_maps in memrefs? - #3 by bondhugula, I’m a bit less sure about a reasonable direction to go, and would appreciate some feedback / direction help.
Consider an implementation of binary addition on 2D arrays in numpy using MLIR. One might write some code that generates a loop like this:
func.func @add(%arg0: memref<?x?xf64>, %arg1: memref<?x?xf64>, %arg2: memref<?x?xf64>) attributes {llvm.emit_c_interface} {
%c0 = arith.constant 0 : index
%dim = memref.dim %arg0, %c0 : memref<?x?xf64>
%c1 = arith.constant 1 : index
%dim_0 = memref.dim %arg0, %c1 : memref<?x?xf64>
affine.for %arg3 = #map(%c0) to #map(%dim) {
affine.for %arg4 = #map(%c0) to #map(%dim_0) {
%0 = affine.load %arg0[%arg3, %arg4] : memref<?x?xf64>
%1 = affine.load %arg1[%arg3, %arg4] : memref<?x?xf64>
%2 = arith.addf %0, %1 : f64
affine.store %6, %arg2[%arg3, %arg4] : memref<?x?xf64>
}
}
return
}
To keep the code generation relatively simple, it is nice to assume that the two input arrays to the binary operation as well as the output array are all of the same shape. However, that often isn’t the case in numpy due to things like slicing and broadcasting. One could write something like
x = np.ones((n, n))
y = 1.0 * x
which is usually implemented by broadcasting a view of the 1.0 scalar as a 2-D array by promoting the scalar into an array, but only logically – no extra memory is allocated.
So there are two constraints I’m trying to grapple: 1) I want to make the implementations of the functions that generate these MLIR fragments simple – it’s nice if they don’t have to worry about transformations like I mentioned and 2) I want to eventually get those transformations reflected in the generated code, either at the point of generation, or some passes applied after.
My initial thinking was to make a pass over the generated MLIR fragments and construct an affine map for each of the input arrays based on the transformations that are applied to it. For example, if an array is broadcasted / promoted over a dimension, I would like to add an affine map that intuitively removes the promoted dimension. However, based on reading some documentation, I don’t think this sort of thing is possible – the affine maps don’t seem to amenable to this kind of dimension remapping as injectivity (for example) is no longer maintained.
Another option here is to write a pass that “undos” each transformation directly on the argument arrays and their uses inside the generated kernels, but I worry that this could be a little brittle, as this would have to be done after more specific dialects get lowered into memref, but then the assumptions they made about the operand memrefs might be wrong.
A final strategy is to instead consider these kinds of transformations when generating the task body, where we already don’t generate code using a 2-D memref (like in the above case), and manually undo the transformations when generating code that accesses each argument array. However, this approach feels like it greatly complicates the implementations of each numpy operator, as each operator needs to handle the potential dimensionalities of each input array.
Does anyone have an opinion here? Perhaps I’m misunderstanding how affine maps with memrefs could be used.