[RFC] Promoting `linalg.init_tensor` to the `bufferize` dialect

The headaches I have heard about are related to behavior under parallel assumptions.
I view this as problems in the parallelism modeling on tensors but have not sees issues related to CSE.

The type of issues I have heard about in IREE are related to hoisting an op that turns into an alloc above “parallel loops” (which requires some notion of privatization, whatever it means in this particular instance).

1 Like

This is not a problem, the issue is when it is mixed with parallelization and something like privatization would be needed and is not yet modeled.

1 Like

A tensor full of zeros would be a good abstraction for structured codegen.

In practice you may not care about the value (if it is bufferized to a buffer that is just written i.e. the current linalg.init_tensormemref.alloc) or you may care about it (i.e. linalg.init_tensor + linalg.fillmemref.allox + linalg.fill).

Atm it is convenient to have a default that bufferizes to an alloc without filling (why would we want to always pay the price for the memset?).

An simplified example of false sharing I am encountering is this

%0 = linalg.init_tensor [%d0, %d1] : tensor<?x?xf32>
%1 = linalg.fill(%cst1, %0) : f32, tensor<?x?xf32> -> tensor<?x?xf32>
%2 = linalg.fill(%cst2, %0) : f32, tensor<?x?xf32> -> tensor<?x?xf32>

In tensor domain the ops %1 and %2 have no dependence on each other (they both only depend on %0). If you straight-forward allocate a buffer for linalg.init_tensor and replace all its uses, then there is a WAW hazard created for the two fills.

%0 = memref.alloc(%d1, %d2) : memref<?x?xf32>
linalg.fill(%cst1, %0) : f32, memref<?x?xf32>
linalg.fill(%cst2, %0): f32, memref<?x?xf32>

So this would be false sharing.

The bufferization though is smarter, its sees the two fills as writing to the same buffer and finds one of the fills as not inplace-able (and should create two allocations, when I tried it does, but something later on fails).

   %0 = linalg.init_tensor [%d1, %d2] {__inplace_results_attr__ = ["false"]} : tensor<?x?xf32>
    %1 = linalg.fill(%cst1, %0) {__inplace_results_attr__ = ["false"]} : f32, tensor<?x?xf32> -> tensor<?x?xf32> 
    %2 = linalg.fill(%cst2, %0) {__inplace_results_attr__ = ["true"]} : f32, tensor<?x?xf32> -> tensor<?x?xf32> 

So it is un-CSEing on the fly based on inplace-able analysis. But for some more complex cases, without unCSE-ing and converting to destination style passing before bufferization it results in allocations.

I dont think that is the case. Its not hoisting above parallel loops. It is hositing above just tiled loops. So it has nothing to do with parallel loops or privatization. Even in sequential code, CSE == false sharing.

This was my main motivation to not use tensor of zeros as well, but I am finding this reasoning fairly flimsy right now since we can probably eliminate dead stores in MLIR.

I agree with this point, but I think that is the missing part. An op can take a tensor shape as an input and create a tensor as a result. So it is “allocating”. At the time of bufferization, we can recognize this and actually do the allocation (effectively this is what IREEs version of bufferization did, not saying that one is better than the other, but just that its an alternative). This is easier to reason about for linalg.generic ops. As you say its harder for scf.for. At one level you can say if an scf.for that takes a tensor_shape as an iter_args operand and returns a tensor value, implies that during bufferization a tensor needs to be allocated before the loop. Still does require changing scf.for semantics.

So overall I am still +1 on having a side-effecting tensor operation in bufferization to avoid these complications. It should not leak all the way upto TOSA/MHLO → Linalg on tensors lowering.

So it seems like there are basically two use cases here that people have:

  1. “linalg frontend” - when initially creating linalg ops, independent of any transformation, in order to be maximally information-preserving, there is a desire to use a “shape” type as operands to certain linalg ops instead of “tensor” to avoid implying the existence of a backing data buffer for something that is just a shape (e.g. at a phi node for a control flow merge point, we would know just by looking at the type that we are just doing a select on the metadata if it is a “shape”)

  2. “destination-passing value-semantic program” - during linalg transformations, it is useful to not jump to memrefs directly, but to stay with SSA use-def chains and ops that take “tied” tensor operands to model in-place updates with enough predictability to be useful for codegen in practice. (this is a matter of tradeoffs – it is not some a priori optimal solution).

It seems like it could be useful to formalize this “destination-passing value-semantic program”, perhaps with a new type !bufferize.vmemref (v = value-semantic). This type would be structurally similar to a ranked memref (we might be able to remove the “address space” information and such; or perhaps not want to!), but be guaranteed to have value-semantics, and would support an op bufferize.vmemref.alloc to allocate an “uninitialized vmemref” (vmemref would always have a “backing buffer” / never be “hollow”). Converting a tensor/shape program to a vmemref program should be fairly straightforward (mostly 1:1 conversion with a type conversion), and would make explicit the transition converting e.g. %0 = linalg.fill(create_shape(d0, d1)) to %0 = linalg.fill(bufferize.vmemref.alloc(d0, d1)). Then ComprehensiveBufferize could be recast and cleanly scoped as converting vmemref programs to memref programs with minimal allocations by doing updates in place.

This would involve duplicating a number of ops and making linalg ops support vmemref, but the end result would be a clean layer that could be used for transformations. The alternative here is acknowleding that for certain usage patterns, “tensor == vmemref” and that we won’t represent that in the type. (or we could use the tensor “encoding” field to make the distinction?).

What do folks think? If we can at least agree that “vmemref” (even if we spell it tensor) is a useful conceptual layer that we materialize at some point in the compilation pipeline and use for transformations, I think that will be a step forward towards resolving the layering. Then we can at least decouple it from the “linalg frontend” design discussion.

Note: “false” sharing" is traditionally associated with “always bad cache-effects”. IMO this introduces an early red herring in the discussion and turns attention to the wrong thing. But I see what you mean.

This is a perfectly valid example, these cases appear due to CSE and/or are valid input/test programs.

In this case, a few steps have been jumped to get to a manually written bufferized version. This specific bufferized form may or may not be semantics preserving, it depends on the rest of the program. It is semantics-preserving iff %1 is never read from at any point that does not dominate the production of %2 (without double negation: all reads of %1 must dominate %2).

The SSA-based “willAlias” analysis captures those cases. Performing this “willAlias” analysis in the tensor domain is more powerful than a traditional alias analysis on pointers to memory.

After analysis and bufferization occur, if using the same buffer is legal then we have a case of “buffer reuse”; if not, then bufferization will introduce a proper alloc to avoid hazards. Note that buffer reuse is a “good thing”, whereas “false sharing” gives an impression that something is broken and needs to be fixed.

It is important to characterize the failure case precisely. I suspect/happen to know that in your specific use case something does not bufferize inplace as you’d expect.
Still, the generated code is valid under MLIR upstream sequential semantics.

Note that I am specifically interested in separating considerations related to “modeling of parallelism” and “embedded targets” that can’t alloc to break things down into manageable parts and get to the root of this.

It is really not “unCSE-ing”, the analysis determines in this case that one of the uses of %1 is a read and tells you that the values %1 and %0 cannot live in the same buffer.

This is where the transition from SSA values to buffers starts to happen. This is an alias analysis discussion and whatever CSE may or may not have happened x passes before is orthogonal: the program may be in that valid form in the first place.

And this is fine: in your particular more complex cases, it is likely a) the analysis is not powerful enough to detect that 2 memory regions don’t overlap and/or b) that the IR is not in a form that will bufferize inplace.

This touches a finer line of IR/transformations/analyses co-design.

A number of enabling simplifying assumptions were needed in IREE when iterating on execution model + bufferization and getting everything running end-to-end. I think the problem is that these assumptions do not pass the IREE/core boundary unscathed. The difficultly with simplifying assumptions is they are often not materialized in the IR but rather in C++ and are hard to revisit.

An exercise we can do is to produce a certain number of minimal test cases where your current expectation of “should bufferize inplace” is not met, see how these examples are produced by the IREE abstractions and how they would translate to / be produced by “upstream-only” abstractions. As we discover problematic cases we can evolve the analyses/transformations/IR design to better suit IREE’s needs.

I’d argue we should do this offline because this thread is getting stretched too thin. The additional wrench in the system is we lack the “local broadcasting” properties of f2f whiteboard design discussions and the fast iterations it enables, but such is life these days.

2 Likes

This is what init_tensor is used for in upstream MLIR, it has gradually evolved from the description of your original revision as the need arised from the “used as the init tensor of Linalg operations” parts of the world.

This is a big red flag for me and a sign that the IR/transformatio/analysis co-design is veering off tracks.
scf.for does well in particular in the presence of yielded vectors in the sequential case.
Things work analogously with yielded tensors, in the sequential case.

I do not think a side-effecting op is necessary, the current semantics are pure.

Rather my thinking is that if a side-effecting property helps simplifies things and connect to external uses then we can work with it. But from the way the discussion is branching off it seems to me this is not the case.

I do not see a leak. In fact, init_tensor or an equivalent op is unnecessary if everything needed is passed at a function boundary and has a simple contract wrt to a “future buffer” (linalg.pad_tensor is the exception to this and should also be rehashed separately).

3 Likes

I have heard this in context of traditional scalar compilers. A flavor of this happens even in scalar compilers when you move from an SSA representation to an non-SSA representation.

Didnt follow the discussion preceding this. But the example I tried was just this

func @fill_after_fill(%arg0 : tensor<?x?xf32>, %d1 : index, %d2 : index)
    -> (tensor<?x?xf32>, tensor<?x?xf32>) {
  %cst1 = ...
  %cst2  = ...
  %0 = linalg.init_tensor [%d1, %d2]
  %1 = linalg.fill(%cst1, %0)
  %2 = linalg.fill(%cst2, %0)
  return %1, %2
}

This obviously doesnt bufferize in-place directly, there is more massaging needed here. The function signature as is doesnt work, etc. etc. Basically it needs to be in this form

func @fill_after_fill(%arg0 : tensor<?x?xf32>, %d1 : index, %d2 : index
    %arg3 : tensor<?x?xf32> {linalg.inplaceable = true},
    %arg4 : tensor<?x?xf32> {linalg.inplaceable = true})
    -> (tensor<?x?xf32>, tensor<?x?xf32>) {
  %cst1 = ...
  %cst2  = ...
  %1 = linalg.fill(%cst1, %arg3)
  %2 = linalg.fill(%cst2, %%arg4)
  return %1, %2
}

This requires some changes to destination passing style, and also unCSE of linalg.init_tensor. A flavor of this is what I am seeing in various places in IREE → upstream bufferization usage.

Another way to think about this is that you unCSE the init_tensor and then they dont use the same buffer. Its the same solution, different ways of getting to it. Anyway, this might be splitting hairs.

I dont think this is accurate. Everything I have mentioned here is all using things in core (recasting the issues I am having hooking up to upstream bufferization into simpler code snippets that are “outside of IREE”). There are simplifying assumption in upstream bufferization too (need for destination passing style, using init_tensor as an allocation). I dont have an issue with it. Its a contract that the upstream bufferization expects. I have spent the majority of last month trying to meet that contract. FWIW, the IREEs version of the bufferization works in these cases because it does not allocate for an init_tensor by default. Its not to say the contract of the upstream bufferization is wrong, it is just a different contract.

Lets do that. Looking forward to when you can make it to Seattle (or better yet, me making it to Zurich :stuck_out_tongue: )

I dont totally disagree with this either. This is basically the need init_tensor filled. Over the course of this discussion I have gone back and forth about whether we need to drop init_tensor completely. Its does make things more consistent w.r.t to types as it exists today. I think there is a case to start removing this training wheel though, if we need to move this out of Linalg.

Agreed here to. How far up the stack does this need to be exposed is TBD. All the way upto the user? I would rather start lower in the stack and expose it up the stack as need arises. I am going for, “bufferization has this contract, modify the code passed to bufferization to satisfy this contract”.

One thing is not clear to me. What is create_shape? That seems like what linalg.init_tensor is today. Are you proposing renaming it and moving it out of Linalg. That does make sense to me, but whats the type of create_shape return value. That is the load-bearing part. So lets say the type of create_shape is abstract_tensor<?x?xf32> (to signify it is just the shape, not data).
The current use of linalg.init_tensor was essentially saying tensor == abstract_tensor. There are advantages to that. It makes type checking between “tied operands” more consistent. For example

  1. For LinalgOps, the type of outs operands and results are the same. That would need to change to say the type of outs operands and result values are “compatible”, where compatible means the shape of the result tensor is same as the shape of the abstract_tensor.
  2. For scf.for op, the type of the iter_args and result value will also have to be changed to be types are compatible (as described above), and not strict equality.

This also implies any operation that takes an abstract_tensor and returns a tensor has “allocation semantics”, i.e during bufferization it allocates memory. The conversion to destination passing style would then be the process of replacing abstract_tensor type with real tensor types.

Sorry, I was just addressing point 2 and hand-waving the existence of an abstract_tensor / shape type for part 1. create_shape is just some random op that creates a shape/abstract_tensor

I expect that we will convert to “vmemref” form before doing any transformation that creates such a scf.for loop that would need this special handling. Frontends would never create such a scf.for loop AFAICT. I can’t think of any situation outside linalg transformations that would create this situation where the iter_args are just a shape along the forward edge but a tensor along the back edge (but above I’m assuming that linalg transformations happen exclusively on the “vmemref” abstraction where this isn’t a problem, so the whole discussion of “shape vs tensor for iter_args” is moot in my opinion – linalg transformations happen on “vmemref” and it doesn’t have this problem).

This thread is really interesting to me.
I was working on formalizing the semantics of a few tensor-related dialects and MemRef in MLIR, and found that the semantics of linalg.init_tensor was not clear to me.
https://github.com/aqjune/mlir-tv

In mlir-tv, reading an uninitialized element from linalg.init_tensor is defined as immediate UB.
It is defined so because linalg.init_tensor is lowered into memref.alloc, which is lowered into malloc() without initialization. Since reading the uninitialized bytes in malloced object and using it is UB, and there is no undef/poison in Tensor dialect, this was the only choice to satisfy the refinement property.
At least the unit tests in MLIR (the LIT tests) were successfully validated w.r.t. this definition.

It would be great if its semantics is settled, and I’m willing to follow the decision.

2 Likes

Thanks! This is super useful information! I suspect that at some point we will want a full poison system for tensor ops.

1 Like

This is in practice how we use it today, for the same reasons you mention.

Nice! Is there something we could look at and see what validation means here? (i.e. is it a verifier check that none of the uses has read semantics?)

Thanks for sharing!

Nice! Is there something we could look at and see what validation means here? (i.e. is it a verifier check that none of the uses has read semantics?)

mlir-tv takes two MLIR functions and checks whether transforming the first function to the second function does not incorrectly change program behavior (e.g. mismatch in return values/final memory).
An analogous tool in LLVM is Alive2 (link: https://alive2.llvm.org/). mlir-tv does not have an online version, though; to run mlir-tv, a user must download the source & build it.

To test whether mlir-tv’s linalg.init_tensor semantics is fine, transformations happening in MLIR’s LIT tests were validated.
This was done by (1) gathering (src, tgt) function pairs from MLIR’s lit test files (tgt is generated from src with transformations described in // RUN:), and (2) running mlir-tv with each pair as input.

However, since the project is in an experimental stage and supports only a portion of operations, it is still possible that the linalg.init_tensor semantics won’t work in unexplored cases.
We are trying to actively share issues if it finds any. A recent one (that was not related to linalg.init_tensor) was about TOSA’s canonicalization of ‘x + (+0.0)’ into ‘x’.

3 Likes

A few things have happened on the bufferization side since December last year.

With the new One-Shot Bufferize, we heavily rely on tensor SSA use-def chains to drive buffer selection/reuse. linalg.init_tensor is a useful tool to indicate the start of a new chain and it makes sense from a bufferization perspective to lower this op to a buffer allocation. It is treated as a “memory allocation in tensor land”, with the benefit that One-Shot Bufferize can is able to analyze it (as opposed to any ops that operate on memrefs). From that perspective, it makes sense to have this op in the bufferization dialect.

As mentioned in this thread, there are different uses of linalg.init_tensor in TOSA lowerings/etc. The purpose of these init_tensors is to carry the shape of a tensor as an SSA value. From that perspective, it does not make sense to have this op in the bufferization dialect. Their usage is unrelated to bufferization.

I prepared a change that adds a new alloc_tensor op to the bufferization dialect (⚙ D126003 [mlir][bufferization] Add bufferization.alloc_tensor op). The name of this op makes it clear that this op will always allocate after bufferization. (There is an almost identical op in the sparse compiler that will be able to reuse this op.) This new op is no longer side-effect-free. Since it bufferizes to a new allocation, it should not suddenly disappear/CSE/… because the user may have intentionally placed it with the expectation that a new buffer will be allocated. All uses of linalg.init_tensor that are related to bufferization are switched to this new op. All other uses are left as is.

There has been a discussion about using tensor.zeros() or a similar op instead of init_tensor in this thread. With the expectation that the stores of “0” into the new buffer will be dead and eventually optimized away. This is a good alternative, but we do not have a pass to get rid of such dead stores yet. It is also not easy to implement. E.g., consider the case where the buffer is filled (overwritten) with data in a loop; we would need some kind of range analysis to make sure that the entire buffer is overwritten. In the general case, this may not even be possible because the loop structure and indexing into the buffer could be arbitrarily complex.

My proposal is to add alloc_tensor to the bufferization dialect and subsequently unify it with sparse_tensor.init. Then we have a nice story for bufferization and a predictable way for users to control in-place bufferization decisions (without the CSE/… issues described in this thread). Any concerns?

4 Likes

+1, this matches with my mental model of what we want here – during bufferization (or perhaps just before), we want to be able to think in terms of buffers, even though we are still in tensors for the purpose of keeping SSA use-def chains. Having an op that is a “intended buffer allocation”, even though it is still a tensor op, makes sense at that level.