[RFC] add linalg::IndexOnlyGenericOp

I’m planning to add linalg::IndexOnlyGenericOp, as described below. The difference with the existed linalg::GenericOp and linalg::IndexedGenericOp is that, the body function only accepts indexes and memrefs as parameters, and the user will handle the load/store in the body function according to the actual needs.

We have to use linalg::IndexOnlyGenericOp instead of GenericOp/IndexedGenericOp when:
1, When the user has to manage the load/store. For example, when an atomic instruction is the body function, the parameter has to be pointers other than loaded data, and no store is needed.
2, When the user want to explicitly optimize the memory load/store, for example, vectorized load.

Still take the matmul as an example:

func @fma(%offset_m: index, %offset_n: index, %offset_k: index,
            %A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>)
    -> ()
  {
    // the user manage the load/store according to actual needs.
    // the user may choose not to load/store if there's atomic instructions here.
    %a = linalg.load %A[%m, %k] : memref<?x?xf32, stride_specification>
    %b = linalg.load %B[%k, %n] : memref<?x?xf32, stride_specification>
    %c = linalg.load %C[%m, %n] : memref<?x?xf32, stride_specification>
    %d = mulf %a, %b: f32
    %e = addf %c, %d: f32
    linalg.store %d, %C[%m, %n] : memref<?x?x?xf32, stride_specification>
  }
  #matmul_accesses = [
    (m, n, k) -> (m, k),
    (m, n, k) -> (k, n),
    (m, n, k) -> (m, n)
  ]
  #matmul_trait = {
    doc = "C(m, n) += A(m, k) * B(k, n)",
    fun = @fma,
    indexing_maps = #matmul_accesses,
    library_call = "linalg_matmul",
    n_views = [2, 1],
    iterator_types = ["parallel", "parallel", "reduction"]
  }

Could you elaborate your use case a bit more? IMO, your proposal defies one of the purposes of Linalg, which is to avoid analyzing the access footprints by looking at individual load/store operations. Is it an intermediate lowering stage where you will no longer work with buffers using Linalg operations? How do you expect Linalg transformations (tiling, fusion) to process such operations? What benefit does Linalg give for your use case? For me, it looks like what you need is just a (parallel) loop nest.

Looking into more details of your example:
%a = linalg.load %A[%m, %k] - linalg.load does not exist (anymore); do you propose to reintroduce it?how is it differnet from std.load?
#matmul_accesses = [(m, n, k) -> (m, k), <...> - you should not be needing this if you don’t implicit loads/stores
indexing_maps = #matmul_accesses, - neither this
n_views = [2, 1], - nor this.

+ @nicolasvasilache

+1 to what @ftynse said, thanks for surfacing.
Generally the bar for introducing new special Linalg ops is very high and has to play nicely with the design points described in the rationale.
In particular, linalg.generic puts the accesses in the type as attributes, introducing load/store in the body goes against this and risks losing the correspondance.

It is not completely clear to me what problems you would like addressed.
For the vector case, Linalg works with memrefs over vectors, you then get vector objects in the region.

For the particular case of atomics, we could think of adding an attribute that would generalize the conversion from linalg to loops, and remove the load/store if appropriate. But we’d need to see much more details to understand what you need.

Could you please share some more detailed use cases?

Thanks!

Thanks @ftynse and @nicolasvasilache, I’m trying to answer the questions here:

Could you please share some more detailed use cases?

The use case is to porting the kLoop fusion codegen of XLA into MLIR (for dynamic shape codegen with MLIR)

Is it an intermediate lowering stage where you will no longer work with buffers using Linalg operations? How do you expect Linalg transformations (tiling, fusion) to process such operations? What benefit does Linalg give for your use case? For me, it looks like what you need is just a (parallel) loop nest.

The benefit of using Linalg other than directly using parallel loop nest, is to implement the fusion codegen. Tiling/Fusion of Linalg is also needed in the whole process. Some existing works, which is more like the kLoop fusion of XLA, is already ongoing. However, we also need to porting kInput fusion here. And the root op of the kInput pattern is often the ReduceOp which requires for atomic instructions.

For the particular case of atomics, we could think of adding an attribute that would generalize the conversion from linalg to loops, and remove the load/store if appropriate. But we’d need to see much more details to understand what you need.

I think it’s a good idea if we can generalize the linalg.GenericOp with some kind of attribute. The attribute should be able to sufficiently describe the format of the body’s parameters. GenericOp, IndexedGenericOp and the ‘IndexOnlyGenericOp’ on discussion could be in a unified form.

I suppose by “detailed use cases” Nicolas meant something like: shows us how you would want the IR to look before and after the transformation that you want to apply, and textually describe the transformation.

XLA has its own terminology for fusion, but I am pretty sure “kLoop” fusion means, well, fusing two loops. As in:

loop.for %i = %0 to %1 step %2 {
  "op1"() : () -> ()
}
loop.for %i = %0 to %1 step %2 {
  "op2"() : () -> ()
}

gets transformed to

loop.for %i = %0 to %1 step %2 {
  "op1"() : () -> ()
  "op2"() : () -> ()
}

This should be already possible with Linalg’s producer/consumer fusion approach, by first tiling the consumer with sizes equal to the shape of the consumed buffer, and then fusing the consumer into the producer.

Well, Linalg fusion relies on data access being expressed as attributes rather than load/store to decide whether the fusion is legal and what should be fused. If you have a way of deciding fusion legality and profitability by looking at loads and stores, you can do it on a loop nest.

AFAIU, kInput fusion is roughly what Linalg does today.

I don’t think generalizing across IndexedGeneric and just Generic is a good idea. Transformations apply differently to them (and there is an open bug on making the fusion work between IndexedGeneric and Generic). What I can see is a way to specify whether the implicit loads and stores should be lowered to std.load or to something else.

@ftynse

I agree that “kInput fusion is roughly what Linalg does today”.
But the precondition is that, linalg.GenericOp must be able to describe the root Op of the fusion pattern.

the kInput of XLA roughly means the pattern like “add -> mul -> reduce”
We can do this with Linalg fusion only when the GenericOp is able to describe the reduce computation,
which is usually a mixture of nested serial reduction, warp-shuffle reduction & atomic reduction.

I’ll take a look into the possibility lowering lhlo directly to loops, bypassing Linalg. But it seems to me that this is not your intension. Pls let me know what do you think.

@linearhit The affine loop fusion pass can already perform such fusion. If you lower to affine ops either from Linalg or from LHLO, you won’t have to reimplement it on the Loop dialect.

Considering about the lowering path of lhlo -> affine, in my understanding , not all the fusible lhlo Ops can be expressed by affine.loop and affine.if. Am I correct?

I don’t know what part is currently covered, but for reference, see this test case, or run something like this on the pattern you have:

$ tf-opt -lhlo-legalize-to-affine -affine-loop-fusion tensorflow/compiler/mlir/xla/tests/lhlo-legalize-to-affine.mlir

The discussion would be significantly more productive if you gave us a specific example that cannot be expressed with Linalg today. The example in your original post with mulf and addf works just fine.

I think you are conflating the “what” and the “how”. Linalg lets you express an (associative and commutative) reduction of multiple values obtained by indexing a buffer. That’s the “what”. This reduction can be implemented as (1) a serial loop with loads a stores; (2) a serial reduction loop that we now support; (3) a parallel reduction loop; (4) a parallel loop with atomic operations inside; (5+) an OpenMP reduction loop, a CFG with fork-join parallelism à la Tapir, etc. That’s the “how”. Linalg operations do not prescribe the “how”, different lowering flows can take different decisions. In particular, you may chose to implement it with atomics if the semantics of the operation is preserved (e.g. you have a corresponding atomic operation in the target dialect).

I’m pretty sure there are some LHLO ops that cannot be expressed with Linalg either, up to the point that we may not want them to be in Linalg.

OK, I think it’s clear enough with your explanation of “what” and “how”.

For of a pattern like “add -> mul -> reduce”. I may have to only describe the reduce computation in Linalg, then perform tiling and input fusion. And then, after lowering to affine.loop or loop, the reduction loop is further split into multiple loops, each implemented as atomic parallel loop or serial reduction loops.
Linalg is only to describe the logical computation, not the backend implementation.

Overall this makes sense to me, but i’m still a little bit confused about the boundary of Linalg and Affine.loop.
It seems to me the “fusion on Affine.loop” can fully cover the functions of “fusion on Linalg.GenericOp”
Then why would we do tiling/fusion on Linalg.GenericOp?

@ftynse described the difference between the what and the how very well.
One could think of the current implementation of linalg → loops conversion as 1-level of indirection through memory (i.e. insert load and stores). It seems what you would like is the ability to specify 0-level of indirection and some additional semantics to forward operand / result values. Gather-scatter type of semantics could correspond to 2-levels of indirection through memory, etc.

I am not sure which fusion you are talking about here: are you in the tensor world or in the buffer world?
The distinction is important re different phase ordering issues related to buffer allocation and layout.
Still I think you are talking about fusion of linalg on buffers, in which case I am not sure why you seem to want to express the result of fusion in the body of linalg regions. Linalg fusion on buffers can be thought of as a “tile and fuse” transformation on loops (in the limit, with tile size of 1 you get classical loop fusion). This bears most resemblance to Halide.

I would reiterate my suggestion for your sharing a more detailed use case.
To be more concrete let’s try the following.

Could you please write what the code would look like in some loop form before and after fusion?
This is still quite unclear to me and details are important: depending on whether or not you go through memory or through SSA values, and depending on constructs you use, you may find some things are easier or harder to do (e.g. values may not escape region boundaries).

Feel free to use loopey pseudo-code to convey the semantics if you don’t see an easy way to write it in executable MLIR form.

This type of questions is addressed in the Rationale that I mentioned in my first reply :slight_smile:

Thanks for the information @nicolasvasilache

for a typical pattern like this:
image

A typical gpu implementation for column reduction may be something like:

loop.parallel %h = 0 to %sh step tile_h {
    loop.parallel %w = 0 to %sw step tile_w {
       loop.parallel %w_i = 0 to tile_w step 1 {
           alloc %acc
           loop.for %h_i = 0 to tile_h step 1 {
               if (w_inbound && h_inbound) {
                   %val = load %data[%h*tile_h + %h_i, %w*tile_w + %w_i]
                   %acc = std.addf %acc, %val
               }
           }
           %res_p = gep %result[%w*tile_w + %w_i]
           atomic_add %res_p, %acc
       }
   }
}

and the ReduceOp will input fuse the elementwise MulOp, in the memref buffer world.
The common thing for these patterns are: they are a series of elementwise Ops (also including Slice/Reshape/Concat etc), and ended with a non-elementwise Op (typically ReduceOp).

My intuition was to describe the schedule of the ReduceOp with linalg.GenericOp, and then perform the tiling/fusion in Linalg Dialect. With your explanation, I agree that this violates the rationale of Linalg.

I may have different ways to do this in MLIR:

  1. HLO (fused) → LHLO → Affine Dialect → (automatic fusion on AffineDialect) → …
    One known problem of this solution is, we need to support dynamic shapes and the FusionPass of Affine Dialect currently do not support it.

  2. HLO (fused) → LHLO → Linalg Dialect → AffineDialect → (loop transformation pass for backend impl of ReduceOp) → (fusion on Affine or Loops Dialect) → …
    for this solution, ReduceOp is described in linalg.GenericOp like (which just describes the computation of reduce, the “what” as @ftynse said ):

      linalg.generic {args_in = 2 : i64, args_out = 1 : i64, indexing_maps = [(d0, d1) -> (d0, d1), (d0, d1) -> (), (d0, d1) -> (d1)], iterator_types = ["parallel", "reduce"]} %arg3, %arg2, %arg4 {
      ^bb0(%a: f32, %b: f32, %c: f32): // no predecessors
        %e = addf %c, %a : f32
        linalg.yield %e : f32
      }: memref<?x?xf32>, memref<f32>, memref<?xf32>
    

    after lowered to Affine, the two nested loops:
    (h, w) [parallel, reduce]
    is further split into four nested loops:
    (h_o, h_i, w_o, w_i) [parallel, parallel, reduce, reduce]
    and then transformed to:
    (h_o, h_i, w_o, w_i) [parallel, parallel, parallel(with atomic), reduce]

    after this I may get some expected schedule, and then perform similar loop transformation to elementwise Ops. And then, perform loop fusion on them.

    I’m currently studying the existed methods on Affine dialect. It seems to me that this solution is too complicated.

  3. a mixed “workaround”:
    the elementwise Ops goes:
    HLO(fused) → LHLO → Linalg → (loop tiling/fusion on Linalg) → Loop
    the reduce Op goes:
    HLO(fused) → LHLO → Loop
    and then, on Loop Dialect, the ReduceOp directly fuse the loop body of the elementwise Ops.

    This seems to be doable and simple, strange anyway.
    For the final fusion on the Loop Dialect, the Loop body of the elementwise Ops are directly fused by the ReduceOp loops. And the correctness is in fact guaranteed by the fusion decision on HLO layer.

I’m still looking for a more decent way. Thanks in advance for any suggestions.

@linearhit this thread should also be relevant.

Your analysis of 3 possible ways to attack the problem is accurate. In particular point 3. is the way others are attacking this too atm (@MaheshRavishankar @herhut @antiagainst @asaadaldien @pifon). The previous thread should also be relevant re. some ideas to support things better that should also work for you.