Linal.generic on tensor seems to have a weird semantics

Your understanding of the semantic of linalg on tensor is correct. The out argument is mandatory to be able to know the shape of the output (and be the initial value for reductions). The output can simply be a linalg.init_tensor that would only contain the shape.

Here the result of the bufferization is not correct. I’m not an expert of bufferization and I don’t if this function an be bufferize on its own without seeing the caller that would need to hold the allocation.

I think comprehensive bufferization does the trick. However, you need to annotate the output argument as inplacable:

func @test(%a : tensor<64x64xi64>, %b : tensor<64x64xi64>, %c : tensor<64x64xi64> {linalg.inplaceable = true}) -> tensor<64x64xi64> {
  %res = linalg.generic {doc = "C(a, b) = op(A(a, c), B(a, c))", indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]}
    ins(%a, %b : tensor<64x64xi64>, tensor<64x64xi64>)
    outs(%c: tensor<64x64xi64>)
    {
  ^bb0(%aa: i64, %bb: i64, %cc: i64) :
    %ee = addi %aa, %bb: i64
    linalg.yield %ee : i64
  } -> tensor<64x64xi64>
  return %res : tensor<64x64xi64>
}

then calling comprehensive bufferize (mlir-opt bufferize.mlir --linalg-comprehensive-module-bufferize) yields the following output:

 func @test(%arg0: memref<64x64xi64, #map0>, %arg1: memref<64x64xi64, #map0>, %arg2: memref<64x64xi64, #map0>) {
    linalg.generic {doc = "C(a, b) = op(A(a, c), B(a, c))", indexing_maps = [#map1, #map1, #map1], iterator_types = ["parallel", "parallel"]} ins(%arg0, %arg1 : memref<64x64xi64, #map0>, memref<64x64xi64, #map0>) outs(%arg2 : memref<64x64xi64, #map0>) {
    ^bb0(%arg3: i64, %arg4: i64, %arg5: i64):  // no predecessors
      %0 = addi %arg3, %arg4 : i64
      linalg.yield %0 : i64
    }
    return
}

Now the output is written to %c.

PS there are different bufferization implementations and it may also be possible to extend your set of flags to get to a similar result.

What Tobias said.

Some more details re. output tensor:

  • Linalg ops refuse to have a semantic that creates a tensor out of thin air: such abstractions “have an implicit alloc” inside the op that only materializes when bufferization happens and that breaks scoping (an interesting effect of breaking scoping is that returning a memref in a function is currently always ill-defined in MLIR core). This in turn forces to always allocate and later try to optimize the allocation away based on alias analysis. Comprehensive bufferization keeps tensor SSA values around and aggressively reuses the buffers that get introduced gradually.

  • you can think of Linalg ops as always updating the out tensor they are given. This is similar to LLVM’s insert/extract element/value in/out of vector/struct. The tensor case is just a bit more dynamic but the mental model is the same.

  • the out tensor abstraction also makes it possible to reason about subsets of the tensor and their insert/extract without having to design new containing ops that have special semantics and pulling you into a deep rabbit hole.

  • In your particular use case, %C is ignored because you do not use the value in the Linalg basic block: the data is not actually used.

I wouldn’t say that the bufferization is incorrect in itself, but the issue I perceive is that the inplaceable makes the function ABI / API not clearly defined to me. On a simple example you could infer some rules like “if you provide an inlace operand with the same shape as a result tensor then the bufferization won’t return a memref”, but it isn’t clear to me that is generalizes?

It would be more uniform if the inpleaceable would always return a memref, but possible the same, i.e. it would yield this instead:

func @test(%arg0: memref<64x64xi64, #map0>, %arg1: memref<64x64xi64, #map0>, %arg2: memref<64x64xi64, #map0>) -> memref<64x64xi64, #map0> {
    linalg.generic {doc = "C(a, b) = op(A(a, c), B(a, c))", indexing_maps = [#map1, #map1, #map1], iterator_types = ["parallel", "parallel"]} ins(%arg0, %arg1 : memref<64x64xi64, #map0>, memref<64x64xi64, #map0>) outs(%arg2 : memref<64x64xi64, #map0>) {
    ^bb0(%arg3: i64, %arg4: i64, %arg5: i64):  // no predecessors
      %0 = addi %arg3, %arg4 : i64
      linalg.yield %0 : i64
    }
    return %arg2
}

That way at least the lowered API is predictible.

What isn’t carried in the API would still be the ownership, because with “inpleaceable” you can’t know if the returned memref is a newly allocated one (but ownership semantics is a bit more tricky as there are many approaches to this (refcounting could be one)).

This would be easy to evolve if/when the ownership model and ABI expectations with the outside world becomes available. It is fair to view the inplaceable attribute on the top-level function as a leak of an internal implementation detail that is unfortunately needed atm. Note that in the IREE integration, this does not appear as IREE handles the top-level ABI and we can inject information in C++ where needed.
Starting from tensor-level MLIR + e.g. numpy, some convention needs to be followed.

When usable ownership and tensor ABI become available we’ll happily use them :slight_smile:

This is separable from the op-level “out tensor” semantics though (but there are analogies).

1 Like

Yes absolutely: I was only referring here to the issue of the API around caller/callee and the associated lowering during bufferization (because I didn’t see any other issue in the original code snippets and the snippet @gysit provided).

Ok so if the linalg ops refuse to have a semantic that creates tensor, I didn’t understand why the linalg.generic need a result? And actually I think the semantic is not clear as the linalg-bufferize passes interpret that as a memref allocation. The outs operand should be enough to specify that the operators works in place, didn’t?
But I still wonder if the linalg op could/should have the both semantics, the “in-place” one and the “value” one. Then it’s the bufferization passes that take care of allocation. I feel that they are no tools to track if a function returns a newly allocated memref and that’s why we don’t want to have the value “semantic” for the linalg op on tensor, but I guess it is an orthogonal problem than the semantics we want to set on the linalg op? That should be more the problem of a bufferization pass that could handle the allocation/deallocation/buffer reuse stuff and avoiding memref to be allocated/returned by a function ?
I have some difficult to understand the purpose of linalg.inplaceable has I feel is a mandatory annotation else the comprehensive bufferize pass didn’t work, what is expected if the annotation is not here?

Cool, thanks for your questions! Here are a few more detailed answers.
Note that we also plan some ODM presentations about those topics once we have a good grasp on everything. Probably within the next 2 months.

Because 1) tensors are immutable and 2) it is needed to preserve SSA use-def chain properties.
In tensor world, the mechanisms are very similar to operations in the vector dialect, llvm.struct and llvm.vector for instance.
In fact, I have found that thinking of tensors like “very big vectors that have to go through memory to avoid catastrophic spills when we reach the register level” is a good first approximation. Here is a post with more details about how transformations operate on these abstractions. In the grander picture, these
compiler abstractions are designed with a transformations-first model in mind.

Regarding linalg-bufferize and other passes, I am not fully up to speed with what their evolution these days. Last I looked I found too many surprising behaviors that I was not able to unify for my purpose (codegen transformations with strong inplace guarantees). These days the thinking is that there is a notion of a) graph level bufferization that requires refcounting, does not obey scoping rules and is more conservative and b) a high performance version with inplace guarantees that works well in the presence of tiling, fusion and other transformations on tensor SSA-values but is less applicable.

Yes, that is one of the key insights that ComprehensiveBufferize uses, in addition to keeping SSA use-def chains as buffers are created. You can view this as the inverse design decision from “allocate conservatively and perform later optimization based on alias analysis”. The posts/RFCs linked above explain these in more detail.

This is a very good point/distinction: linalg op have “inplaceable” semantics in addition to the “value” semantics. This means the op semantics defines that an operand and a result “can be the same thing after bufferization”: this is a weird way to say the same “runtime type” (i.e. the data structures and all sizes of everything are bitwise equal / mirror images of each other).

As to your question: we cannot use full “inplace” semantics because deciding whether a tensor can be bufferized inplace in a given program is a global property of the inplace decisions made for surrounding ops: it is easy to shoot oneself in the foot and write incorrect programs. Instead, the comprehensive bufferization process allocates things inplace greedily.

Yes, as I mentioned above, the 2 are orthogonal but still related. The decision on the design of linalg on tensors comes from the need for transformations on tensors.

Yes this is the case in practice: linalg.inplaceble is an attribute that is not part of op semantics and is only used for bufferization. The problem is the one I mentioned in reply to Mehdi’s comment: the interface between MLIR and the outside world is not yet defined on tensors (ABI / API level). So for the entry (e.g. C++ caller) and exit points (e.g. calls from MLIR into library calls) of non-MLIR parts of the program, we cannot make magic decisions and this bufferization-specific annotation is needed to tell comprehensive bufferization what to do at these function boundaries. This is an implementation detail that leaks until the ABI on tensors is defined in MLIR. The alternatives are either a) external assumptions (like IREE and XLA have) or b) miscompiles.

Bonus point: re controlling inplace / inplaceable decisions. In principle, comprehensive bufferize is modular enough that you could want to control what gets bufferized inplace by annotating the ops yourself and calling the pass to fill the blanks and perform the actual allocations. I’ve been using that mode for debugging purposes in very limited cases. If that sounds like something that could be useful, it may be possible to harden and make available as a feature.

I find this way of presenting a bit misleading (I don’t understand how one can reconcile this with “tensors are immutable”): at the tensor level I really see linalg as creating a tensor (I wouldn’t say “out-of-thin air”: or you would say the same about every op in the tensor domain).
This is also misleading because this “make believe” that there is more than an optimization in the buffer allocation process, while there shouldn’t be, at least not in terms of semantics at the tensor level.
(and outs would really be better named “init_value” or something like that in the tensor domain!)

I don’t think so, because it isn’t an “out” but an “init value”.
Even inplaceable can’t guarantee than an “out” tensor can be used in place, because the annotation isn’t on the use.
For example just use the same tensor twice as an out by chaining computation (very crude below):

func @test(%a : tensor<64x64xi64>, %b : tensor<64x64xi64>,
                   %c : tensor<64x64xi64> {linalg.inplaceable = true}) -> tensor<64x64xi64> {
  %res = linalg.generic {doc = "C(a, b) = op(A(a, c), B(a, c))", indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]}
    ins(%a, %b : tensor<64x64xi64>, tensor<64x64xi64>)
    outs(%c: tensor<64x64xi64>)
    { ... } -> tensor<64x64xi64>
  %res2 = linalg.generic {doc = "C(a, b) = op(A(a, c), B(a, c))", indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]}
    ins(%res, %b : tensor<64x64xi64>, tensor<64x64xi64>)
    outs(%c: tensor<64x64xi64>)
    { ... } -> tensor<64x64xi64>
  return %res2 : tensor<64x64xi64>
}

:slight_smile:

It depends on the definition of “works in place”.
To be more precise, out arguments convey the fact that the result “may bufferize inplace on the operand if the bufferization of other things around it allow it”. IREE has a similar concept of a TiedOperand.
This will turn into OpInterfaces (likely after some ongoing refactoring and when we know how to name things properly).

1 Like

The focus should not be on “creates a tensor” part but on the “out of thin air” part.
See the part about:

I don’t see any contradiction with “tensors are immutable” if the focus is on the “out of thin air” part.

Interesting, the “out of thin air” is precisely meant to convey that bufferization is exactly an optimization (by opposition to e.g. infer the buffer size from the op + operands semantics).

Fair enough, I wouldn’t necessarily call it “init_value” though as it suggests “reading” that init_value in my mind. But if folks think it is less confusing, renaming it is def. on the table.

You’re sentence in quote seems like entirely universally applicable to any SSA value anywhere isn’t it?
Couldn’t I rewrite it as you "may bufferize inplace [any tensor] if the bufferization of other things around it allow it”.
I mean there is nothing that prevents a bufferization that could handle the situation below where %b could be reused in-place during lowering?

func @test(%a : tensor<64x64xi64>,
                   %b : tensor<64x64xi64>  {linalg.inplaceable = true},
                   %c : tensor<64x64xi64>) -> tensor<64x64xi64> {
  %res = linalg.generic .... // element wise operation, fully parallel iteration.
    ins(%a, %b : tensor<64x64xi64>, tensor<64x64xi64>)
    outs(%c: tensor<64x64xi64>)
    { ... } -> tensor<64x64xi64>

In this sense the “outs” isn’t specific here. Even for “what available allocation to use as an output here” (I’m not referring to “how much do I need for this output”). it isn’t clear to me what it actually provides for the bufferization process (other than trivial examples), consider:

func @test(%a : tensor<64x64xi64>  {linalg.inplaceable = true},
                   %b : tensor<64x64xi64>  {linalg.inplaceable = true},
                   %c : tensor<64x64xi64>  {linalg.inplaceable = true},
                   %d : tensor<64x64xi64>  {linalg.inplaceable = true}) -> tensor<64x64xi64> {
  // use %d
  ...
  // %d is dead now.
  %res = linalg.generic .... // some op
    ins(%a, %b : tensor<64x64xi64>, tensor<64x64xi64>)
    outs(%c: tensor<64x64xi64>)
    { ... } -> tensor<64x64xi64>

  // %a has not other use from here.

  %res2 = linalg.generic .... // some op
    ins(%res, %b : tensor<64x64xi64>, tensor<64x64xi64>)
    outs(%c: tensor<64x64xi64>)
    { ... } -> tensor<64x64xi64>
  // use of %c still below
  ...

(ignoring fusion opportunities here).

Optimally during bufferization you can use the buffer that will be passed-in in place of %d as the output buffer for the first linalg op, and the buffer passed-in for %a as the output of the second linalg op. At no point the out %c can really be used here since it is still live.

Yeah you’re right: it is slightly annoying to find a perfect term here, because it “sometimes” reads the initial reduction value from it, but not always. This is the two aspects of this “out” mentioned in the very high-level intro: 'linalg' Dialect - MLIR
(sorry I’m not having a better suggestion for the name)

The static case is a possible future optimization indeed, analyzing the IR to find runtime type equivalence will also work in certain cases.

There is value in encoding this information statically in the IR and not losing it. It has similar benefits to an inbounds annotation.

Quick question on this: how does the “outs” behave in the sparse case? Say sparse x sparse matmul. It seems like the number of nonzero elements (i.e. sparsity structure) of the initial accumulators might be different from the result (requiring reallocating). So in what sense is the result calculated “in place”?

This might be a question for @aartbik as well.

Forgive me if this is a dumb question.

Quite the opposite, this is a very good question!

In a previous posting, I showed that we need something like {linalg.inplaceable = true} (lacking more advanced bufferization optimizations) on a dense output in cases with sparse inputs just to keep an updating kernel’s complexity O(nnz) vs. O(n). For sparse outputs (something still under development but coming real soon), we will either have to go the route of only accepting building new sparse tensors “from scratch” or assigning even broader semantics to the inplaceable annotation. I am still actively developing the sparse output implementation and would love to hear if people have strong opinions on what direction to take here.

+1 I don’t think there is a general answer yet but the intuition is that this can be relaxed and that the appropriate subset_insert on sparse tensor will contain the logic that updates a small part of the data structure.

For dense tensors this is an all or nothing property.
But as often, levels of indirection come to the rescue.

It is perfectly fine if the insert operation does a small extra alloc and hooks up some pointers / updates stuff internally.

@nicolasvasilache / @mehdi_amini awesome, thanks for your detailled answers.

I think my major confusion was about the semantics at the tensor level that seems to meld with how the bufferization(s) works, but now it’s more clear.
I still wonder that the out/init argument should be mandatory, as they are no necessary a reduction, right? As example in the example I wrote on my first post, removing outs could be totally legal as they are no reduction.

They are necessary when the shapes are dynamic

  %res = linalg.generic .... // some op
    ins(%a, %b : tensor<?x?xi64>, tensor<?x?xi64>)
    outs(%c: tensor<?x?xi64>)
    { ... } -> tensor<?x?xi64>

Without reduction the content of %c won’t be used, in fact you’ll frequently see the use of init_tensor for modeling this: only the shape of the tensor %c will be used to compute the resulting one.

Also it helps to be uniform so it gets passed in with static shape as well, even though it is indeed redundant.

+1 to what Mehdi said.
You can view linalg.generic as a kind of assembly op for codegen.
At this level of abstraction we are more concerned with ease of writing transformations.
At the higher level, you have e.g. MHLO.