Is it possible to add parameter for indexing_maps of linalg.generic?

indexing_maps does the affine map from loop index to Memory access index.
I would like the affine_map to map according to some variable (might be decided at time). I tried to add parameter in indexing_maps like the following (%FH, %FW), but it fails to compile.

Does linalg.generic index mapping supports adding variable in affine_map? If not, any way to support this requirement. Thanks.

%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%FH_DIM = memref.dim %A, %c1 : memref<?x?x?x?xf32>
%FW_DIM = memref.dim %A, %c2 : memref<?x?x?x?xf32>

linalg.generic {
indexing_maps =
[
affine_map<(i, j, k) → (i, j, k)>,
affine_map<(i, j, k)[FH, FW] → ((k mod (FH * FW)) floordiv FW, (k mod (FH * FW)) mod FW, k floordiv (FH * FW), i)>[%FH_DIM, %FW_DIM], // can’t correct compile;
],
iterator_types = [“parallel”, “parallel”, “reduction”] }

Hi,

I can confirm that linalg.generic does not support indexing maps that take additional symbols / variables. That means variables such as FH, FW, etc. would need to be compile-time constants if you want to use them in an indexing map. In your example this is not the case since the memref dimensions are dynamic.

It may be possible to circumvent this limitation using reshape or subview operations. If not, you may want to use different abstractions for your code generation.

Additionally, linalg.generic does not support div and mod in the indexing maps.
This is a non-trivial extension with implications that will not occur in the near future.

Instead, in most cases, the recommended way is to represent your problem in higher dimensions and use appropriate reshape operations where necessary.

If the div/mod you need are not achievable by changing the dimension of your problem, then you should use the affine dialect.

@gysit @nicolasvasilache Thanks very much for your confirm and suggestions.

My intention is to implement conv with implicit gemm, which requires to fetch Matrix in some not-trivial order, and mod/div operations are used as above code shows. I am newer to MLIR/Linalg. Could you share with me, what is the implication of div/mod for indexing maps and the foundation reason that div/mod is not supported ? thanks.

For this specific use case, you don’t need that.
I think this has been done at least in the context of IREE a while back.
There are relatively simple extensions that we can land to make this easier for you but I’d like to understand the problem better.

Basically the main idea (starting from a conv1d), written in index notation for simplicity:
O[n, h, c] += I[n, h + kh, f] * K[kh, c, f]
Tile h and kh by 1:

for h .. step 1
  for kh .. step 1
    sO[n, c] += sI[n, f] * sK[c, f]

where sO is the subview of O of size [n, 1, c] rooted @[0, h, 0]
sI is the subview of I of size [n, 1, c] rooted @[0, h + kh, 0]
and sK is the subview K of of size [n, 1, c] rooted @[kh, 0, 0]

Now sO[n, c] += sI[n, f] * sK[c, f] is some form of matmul (C = A.B^T).
Permute your layout dimensions however you like to get a different flavor.

This is assuming you really want to target matmul for some reason (e.g. you want wmma instructions).

But in practice, we do much better at the vector level because we can amortize the load and store of contiguous h / kh and get very high arithmetic intensity for any stride and dilation (see the vectorization impl here). This is described in more detail in our paper:

The main reason is that the logic of linalg.generic as well as subview/tensor_insert/tensor_extract operations operate on subsets.
The image of a continuous subset by functions with divs and mods can become non-continuous creating all sorts of problems.
So the linalg.generic operation limits itself to a class (hyper-)rectangular abstraction in the specification of its indexing maps.
When you operate on dense tensors, this means you cannot do mods, divs and indirections.
Note that this is related to the “inside” of a linalg.generic; you can do whatever you want outside of it.

OTOH you can use different types than dense tensors and then you get e.g. the sparse work that @aartbik has been building.

There are different possibilities for extension and generalization but if the objective is just to allow divs and mods inside the linalg.generic, then affine is a better option that works today.

Taking a big detour from your original question, but since sparse tensors are mentioned, please note that there we separate the computational indexing from the preferred dimension order storage (much more important for sparse than for dense, since accessing sparse storage formats in the wrong dimension ordering typically incurs a much higher performance penalty). So in the kernel below, making i outermost would access Z along rows and U along columns (and v.v. for j outermost). As such, selecting a column-wise storage for U make sense, since that enables efficient code generation, and keeps the default row-wise layout for the dense tensor too.

alpha = pt.tensor(42.0)
U = pt.tensor([N, M], pt.format([pt.compressed, pt.compressed], [1, 0]))
Z = pt.tensor([M, N], pt.format([pt.dense,      pt.dense]))

Z[i, j] = alpha[0] * U[j, i]

More details could be found in our pre-print paper.

  • Aart .J.C. Bik, Penporn Koanantakool, Tatiana Shpeisman, Nicolas Vasilache, Bixia Zheng, and Fredrik Kjolstad. Compiler Support for Sparse Tensor Computations in MLIR. Distributed as pre-print arXiv:2202.04305, February 7, 2022.

As Nicolas mentioned above. This kind of transformation was implemented in IREE a while back (here) .

Thanks very much for your detail explanations.
Actually, I have roughly read the code part in IREE, how it does tile/distribute/memory promote, and it makes great progress on Matmul performance in CUDA, as showed in this blog https://google.github.io/iree/blog/2021-10-15-cuda-backend/. Also find the code ConvertConv2DToImg2ColPass.cpp as pointed by @MaheshRavishankar . It implements standard im2col algorithm, which requires much memory move.
I would try to implement implicit gemm for conv2d in cuda, for better performance.

And for implicit gemm, it is defined and well implemented in open-source cutlass project https://github.com/NVIDIA/cutlass/blob/master/media/docs/implicit_gemm_convolution.md, which shows great performance for conv2d. The following code from above link shows roughly the algorithm, which fetches Matrix index in some kinds of piecewise continuous way, even for inner-most loop:

int GEMM_M = N * P * Q;
int GEMM_N = K;
int GEMM_K = C * R * S;
for (int gemm_i = 0; gemm_i < GEMM_M; ++gemm_i) {
  for (int gemm_j = 0; gemm_j < GEMM_N; ++gemm_j) {
    int n = gemm_i / (PQ);
    int npq_residual = gemm_i % (PQ);
    int p = npq_residual / Q;
    int q = npq_residual % Q;
    Accumulator accum = 0;
    for (int gemm_k = 0; gemm_k < GEMM_K; ++gemm_k) {
      int k = gemm_j;
      int c = gemm_k / (RS);
      int crs_residual = gemm_k % (RS);
      int r = crs_residual / S;
      int s = crs_residual % S;
      int h = f(p, r);
      int w = g(q, s);
      ElementA a = tensor_A.at({n, h, w, c});
      ElementB b = tensor_B.at({k, r, s, c});
      accum += a * b;
    }
    C[gemm_i * K + gemm_j] = accum;
  }
}

For me, it’s not easy to use subview for above Matrix indexing, as it has different subview for every inner loop indexing.
It might be one solution to follow the way as sparseTensor shows. I am not quite familiar yet. But even it works, the tile/distribute work based on linalg (like shown in IREE for matmal) could not be used as the (hyper-)rectangular limitation, I guess.
I would turn to affine for a try.
Thanks.