Reasoning about memref mutability

Thank you for detailed explanation! It’s very helpful to understand the motivation about decision.
:+1:

Oh, that also make sense in terms of C semantic! Thank you :slight_smile:
Then is it right to understand after some memref operations (like tensor_load, clone…), the memory buffer pointed by memref is marked as readonly dynamically?
Here dynamically means the memory buffer’s readonly property is determined at runtime, not statically in compile time.

1 Like

I don’t think “dynamic” is really what I’d use to qualify it, in the sense that the property is always kind of known statically (i.e. by an oracle), but an analysis may not be able to tell you.
Maybe an aspect of “dynamism” is what I would call “contextual” ; for example with some C-like pseudo code:

void foo(int *p) {
  // what do you know about p here?
  // it is contextual to a particular call site, maybe p could be nullptr in some paths?
}

At first, I had same point of view as you. But If we accept only statically determined read-only property, it can’t explain following examples.

func foo() {
  %idx = constant 0 : index
  %cst = constant 1.000000e+00 : f32
  %ret = memref.alloc() : memref<5xf32>
  memref.store %cst, %ret[%idx] : memref<5xf32>
  %temp = memref.clone %ret : memref<5xf32> to memref<5xf32>
  return %temp : memref<5xf32>
}

There’s exists memref.clone operation, so the oracle would say the source, destination memref (here “ret”, “temp”) is pointing read-only memory block. So it is not allowed to store %cst value to %ret memref and this example is considered to be UB.
Does this example really exhibits UB?

Uh, I have no idea why memref.clone is defined as “mutating the source or result of the clone operation leads to undefined behavior”, the doc should be much much more elaborate on this, or link to the bufferization implementation detail.
Ideally such op should be named unambiguously: memref.__bufferization_clone__ or something like that.

According to the doc yes. What’s the dynamic angle in this example?

Edit: note that the original request from @herhut was “mutating the source and result of clone after the clone operation has executed has undefined behavior”, which would imply that your IR does not expose UB since the store is before the clone. This would be UB:

func foo() {
  %idx = constant 0 : index
  %cst = constant 1.000000e+00 : f32
  %ret = memref.alloc() : memref<5xf32>
  %temp = memref.clone %ret : memref<5xf32> to memref<5xf32>
  memref.store %cst, %ret[%idx] : memref<5xf32>
  return %temp : memref<5xf32>

Note that the linalg.inplaceable was introduced to experiment with different strategies dealing with this issue. For example, I had the following problem in the sparse compiler when lowering the following kernel that adds two vectors into sparse, bufferized code.

#SparseVector = #sparse_tensor.encoding<{
   dimLevelType = [ "compressed" ]
 }>

func @add(%arga: tensor<32xf32, #SparseVector>,
          %argx: tensor<32xf32>) -> tensor<32xf32> {
  %0 = linalg.generic #trait1
     ins(%arga: tensor<32xf32, #SparseVector>)
    outs(%argx: tensor<32xf32>) {
      ^bb(%a: f32, %x: f32):
        %0 = addf %x, %a : f32
        linalg.yield %0 : f32
  } -> tensor<32xf32>
  return %0 : tensor<32xf32>
}

Without any more information on the in/out tensor argx, the following code must be generated.

    %0 = sparse_tensor.pointers %arg0, %c0 : tensor<32xf32, #sparse_tensor.encoding<{ dimLevelType = [ "compressed" ], pointerBitWidth = 0, indexBitWidth = 0 }>> to memref<?xindex>
    %1 = sparse_tensor.indices %arg0, %c0 : tensor<32xf32, #sparse_tensor.encoding<{ dimLevelType = [ "compressed" ], pointerBitWidth = 0, indexBitWidth = 0 }>> to memref<?xindex>
    %2 = sparse_tensor.values %arg0 : tensor<32xf32, #sparse_tensor.encoding<{ dimLevelType = [ "compressed" ], pointerBitWidth = 0, indexBitWidth = 0 }>> to memref<?xf32>
    %3 = memref.buffer_cast %arg1 : memref<32xf32>
    %4 = memref.alloc() : memref<32xf32>
    linalg.copy(%3, %4) : memref<32xf32>, memref<32xf32> 
    %5 = memref.load %0[%c0] : memref<?xindex>
    %6 = memref.load %0[%c1] : memref<?xindex>
    scf.for %arg2 = %5 to %6 step %c1 {
      %8 = memref.load %1[%arg2] : memref<?xindex>
      %9 = memref.load %4[%8] : memref<32xf32>
      %10 = memref.load %2[%arg2] : memref<?xf32>
      %11 = addf %9, %10 : f32
      memref.store %11, %4[%8] : memref<32xf32>
    }
    %7 = memref.tensor_load %4 : memref<32xf32>
    return %7 : tensor<32xf32>

Here, the linalg.copy introduces an O(N) operation into something that should be O(nnz) only. By adding the annotation:

func @add(%arga: tensor<32xf32, #SparseVector>,
          %argx: tensor<32xf32> {linalg.inplaceable = true}) -> tensor<32xf32> {
  ...
}

we get the following code, which keeps the update into the dense vector O(nnz).

    %0 = sparse_tensor.pointers %arg0, %c0 : tensor<32xf32, #sparse_tensor.encoding<{ dimLevelType = [ "compressed" ], pointerBitWidth = 0, indexBitWidth = 0 }>> to memref<?xindex>
    %1 = sparse_tensor.indices %arg0, %c0 : tensor<32xf32, #sparse_tensor.encoding<{ dimLevelType = [ "compressed" ], pointerBitWidth = 0, indexBitWidth = 0 }>> to memref<?xindex>
    %2 = sparse_tensor.values %arg0 : tensor<32xf32, #sparse_tensor.encoding<{ dimLevelType = [ "compressed" ], pointerBitWidth = 0, indexBitWidth = 0 }>> to memref<?xf32>
    %3 = memref.buffer_cast %arg1 : memref<32xf32>
    %4 = memref.load %0[%c0] : memref<?xindex>
    %5 = memref.load %0[%c1] : memref<?xindex>
    scf.for %arg2 = %4 to %5 step %c1 {
      %7 = memref.load %1[%arg2] : memref<?xindex>
      %8 = memref.load %3[%7] : memref<32xf32>
      %9 = memref.load %2[%arg2] : memref<?xf32>
      %10 = addf %8, %9 : f32
      memref.store %10, %3[%7] : memref<32xf32>
    }
    %6 = memref.tensor_load %3 : memref<32xf32>
    return %6 : tensor<32xf32>

Eventually, the fully bufferized version looks something like this:

func @add(%arg0: !llvm.ptr<i8>, %arg1: memref<32xf32> {linalg.inplaceable = true}) -> memref<32xf32> {
  %c0 = constant 0 : index
  %c1 = constant 1 : index
  %0 = call @sparsePointers(%arg0, %c0) : (!llvm.ptr<i8>, index) -> memref<?xindex>
  %1 = call @sparseIndices(%arg0, %c0) : (!llvm.ptr<i8>, index) -> memref<?xindex>
  %2 = call @sparseValuesF32(%arg0) : (!llvm.ptr<i8>) -> memref<?xf32>
  %3 = memref.load %0[%c0] : memref<?xindex>
  %4 = memref.load %0[%c1] : memref<?xindex>
  br ^bb1(%3 : index)
^bb1(%5: index):  // 2 preds: ^bb0, ^bb2
  %6 = cmpi slt, %5, %4 : index
  cond_br %6, ^bb2, ^bb3
^bb2:  // pred: ^bb1
  %7 = memref.load %1[%5] : memref<?xindex>
  %8 = memref.load %arg1[%7] : memref<32xf32>
  %9 = memref.load %2[%5] : memref<?xf32>
  %10 = addf %8, %9 : f32
  memref.store %10, %arg1[%7] : memref<32xf32>
  %11 = addi %5, %c1 : index
  br ^bb1(%11 : index)
^bb3:  // pred: ^bb1
  return %arg1 : memref<32xf32>
}

This kind of representation looks quite funky to me in the first place though: why would a kernel operating even take %argx as an input in the tensor domain here?

Funky? How else would you express the tensor expression x(i) += a(i) in linalg (where arga denotes the parameter for the sparse vector a and argx the parameter for the updated dense vector x)?

You’re presenting some sort of pseudo-code in an unspecified language, so it is hard to answer precisely :slight_smile:
That said it seems that you’re trying to implement some sort of “inplace update” of a variable, but MLIR tensor is a value-based immutable type!

Basically using a tensor value requires expressing an IR after SSA construction when starting from the form you’re presenting above. Looks at how clang/llvm expresses:

int x = ...;
x += i;

Well, not trying, succeeding, as the MLIR IR I show above runs correctly end-to-end and with great performance :wink:

What I would genuinely like to know, though, is what would be an acceptable way for you to go from the value-based, immutable tensors at the higher levels of IR, to the inplace updates of mutable memory locations at the lower levels of IR. For dense computations, the occasional copy in/out may be acceptable, but for sparse computations this can negatively impact the asymptotic complexity of the kernel. We have been discussing the example I showed above for a while now, and you seem to dislike both the outs-clause of linalg as well as the inplaceable annotation on the tensor as solution. So, what would meet your bar for proper IR yet with proper performance?

This example made me realize that the sparse compiler needs to start using the more “modern” memref.copy instead of the linalg.copy (other linalg ops were already replaced by the memref equivalent ops in prior revisions, but this one was still lingering). Revision D106038 makes this happen. The hope is that memref.copy will be optimized better in the future. Also, this change removes the dependence on the “linalg-to-loops” pass for dealing with just this single remaining op.

That’s not really an argument though, many people write undefined behavior in C++ and say “but it works, I tried it”.

Again the clever on-liner, but no answer. What part in the MLIR IR above is undefined? If you feel the need to be derogatory about my work, please be courteous and back it up a bit better. Right now I strongly prefer my “funky” but existing solution over your non-existing solution. But I remain open for better solutions, which is why I try to engage in this dialogue.

I believe the IR above clearly shows the difference between the standard path the sparse compiler follows to preserve the immutable, SSA behavior of the input using the following bufferized version

x_out = alloc
for i = 1, N
   x_out(i) = x_in(i)
for i = where a is nonzero
   x_out(i) = x_out(i) + a(i)

and the much more prefered bufferized version below (in the sparse world, since it avoids the O(N) loop), which I allow when the linalg.inplaceable = true is given.

for i = where a is nonzero
   x(i) = x(i) + a(i)

So my question again is, what MLIR IR that ultimately is able to map to this code would be acceptable to you?

I didn’t answer you question because when I’m pointing issue with semantics, you seem to ask me to provide a complete design of a entire compiler (at least all the bufferization part, which is very tricky) : this does not look like to me like an appropriate request in the first place.

I am not sure how to address such question: I can give you an IR that model a problem, but “ultimately is able to map to this code” is not as much about the IR than it is about the lowering and the optimization implementation. So to loop back into my point above: answering your question completely would ask me a few weeks of work to provide you with the complete answer.

Also I’m not sure how to reconnect all of this to the topic here of memref and what’s UB?

We suggested to have a separate bufferize dialect that contains the clone, buffer_cast and tensor_load ops to make it very clear that these operations model intermediate state during partial bufferization. I am still open to that suggestion.

True. I will at least fix the documentation.

Also @makesource : Keep in mind that these operations are only encoding information for the bufferization passes. The memref.clone does not exist at later stages. It can be lowered to memref.copy at which point all the immutability requirements are no longer required.

1 Like

We are looking into things like a use-range analysis and buffer reuse that could, in your example, remove the memref.copy operation. We do not have all the required interfaces in place, yet, to make this happen.

With that work, you would not need the in/out parameter but can simply have two in parameters and (eventually) an output shape, which for now has to be a tensor due to limitations in the modelling. I think that longer term vision should put @mehdi_amini concern’s to rest. Mutable tensors indeed would be funky but that is not what is really happening here.

What we would still need is some annotation that tells us that a buffer that is passed to a function is owned by that function. inplaceable serves this purpose in the linalg bufferization. We have not explored cross function lifetime, because in all our settings, that is a dynamic property.

4 Likes

That is great news Stephan! Yes, anything that removes the copy and changes the O(N) code back into O(nnz) would make me extremely happy!

Which is why I used “experiment” very deliberately in the first post here in the context of inplaceable annotations. I do not expect this to remain in the form I am using it now, but it has enabled me to already experiment with the kind of code that would be generated once the proper interfaces are in place (and also makes it easier to show current and desired code in discussions like this).

Thanks for constructively addressing my concerns on this!

Thank you for many attention to this memref mutability issue.
Through this discussion, I can get better understanding about bufferization and relevant restrictions on memref dialect.

It seems like these restrictions are planned to be seperated from memref dialect to better represent partial bufferization steps.
I also think that’s a good idea in that it can ensure this imutability only applies to that step. (not for all memref dialect).

I just saw the fix to the documentation and I still find it quite twisted in how it reads. It does also directly contradict the statement above “Clones the data in the input view into an implicitly defined output view.”. Is this op misnamed? From the description, it’s more like a freeze_and_clone_read_only?

I agree, if anything the documentation update was meant as a band aid to at least make it correct. The output is implicitly defined as opposed to explicitly allocated and copied into. It can be defined as an alias.

I did not have the time to move this into its own dialect but I will get that done. That will take a little longer, though.