Reasoning about memref mutability

Hello, I’m interested in understanding the behavior of Memref mutability.
After I read the discussion about tensor_to_memref and its folding, I was more curious about when it is okay to store new value in memref.

As far as I understand, the main point of that discussion is “we can introduce new writes only when we have visibility of entire lifetime of a memref”. I was wondering in that point. How can we define more precisely the meaning of “we have visibility of entire lifetime”?

Here is some more detailed examples of simple mermef operations. How can we define whether the source program has undefined behavior?

  1. buffer_case vs tensor_store
%0: tensor<4x?xf32>
%1 = memref.buffer_cast %0 : memref<4x?xf32, #map0, 42>
memref.store %100, %1[%c0] : memref<4x?xf32, #map0, 42> // this is ub according to MLIR spec
%9 = dim %8, 1 : tensor<4x?xf32>
%10 = alloc(%9) : memref<4x?xf32, #layout, memspace0>
memref.tensor_store %8, %10 : memref<4x?xf32, #layout, memspace0>
memref.store %100, %10[%c0] : memref<4x?xf32, #map0, 42> // storing value to result of tensor_store. Is this allowed or ub?
  1. clone vs copy
%1 = memref.alloc() : memref<2xf32>
%2 = memref.clone %1 : (memref<2xf32>) -> (memref<2xf32>)
memref.store %100, %1[%c0] : memref<2xf32> // this is ub according to MLIR spec
memref.store %100, %2[%c0] : memref<2xf32> // this is ub according to MLIR spec
%1 = memref.alloc() : memref<2x3xf32>
memref.copy %2, %1 : memref<2x3xf32> to memref<2x3xf32>
memref.store %100, %1[%c0] : memref<2xf32> // is this ub?
memref.store %100, %2[%c0] : memref<2xf32> // is this ub?
memref.copy %3, %1 : memref<2x3xf32> to memref<2x3xf32> // copy another memref to %1. Is this ub?
  1. subview + store
%arg0: memref<4096x4096xf32>
%sv = memref.subview %arg0[%c0, %c0] [128, 32] [1, 1]  : memref<4096x4096xf32> to memref<128x32xf32, #map0>
memref.store %100, %sv[%c0] // is this allowed ?
memref.store %100, %arg0[%c0] // is this allowed ?

Something important in this quote is “introduce new writes”: that means that any example needs to start from an IR without the write and you need to ask about the legality of introducing a new write that didn’t exist in the original program. Just examples of standalone IR may not be enough.
Also UB may not be statically determined (think about use-after-free).

You’re not storing to “result of tensor_store”, you’re storing into a memref here. I don’t see any issue with this, it just looks like the tensor_store is dead.

I don’t see any cause for UB here, can you elaborate what you’re looking for?

Can you clarify these questions: are you asking if you could introduce these stores if they weren’t there in the first place? It’d be easier for me if you present these as transformations: starting from a known valid IR and considering an hypothetical transformations introducing stores.

Thank you for quick answering :slight_smile:

After I read your reply, I think I misunderstood the main point of previous “tensor_to_memref and its folding” post. And now I understand what’s the meaning of “introudcing new writes”.

Because of that, I think my question is misdireted from what I intended.

Today what I concenred about is “How to understand memref memory model (in terms of mutability)”.

Memref is conceptually like pointers in C program. (i.e Memref points specific memory buffer and can be alias with other memref). So It’s natural to understand modifiying the memory pointed by memref is possible like C pointers.

But in MLIR documentations, It looks like there’s some restrictions on modifying memref.

Restriction 1)

According to this document, all memref writes dominiate all reads during bufferization. i.e. It’s is impossible to mutate memref which is already read by any readers.

During Bufferization, we convert immutable value types (tensors) to mutable types (memref). This conversion is done in several steps and in all of these steps the IR has to fulfill SSA like properties. The usage of memref has to be in the following consecutive order: allocation, write-buffer, read- buffer. In this case, there are only buffer reads allowed after the initial full buffer write is done. In particular, there must be no partial write to a buffer after the initial write has been finished. However, partial writes in the initializing is allowed (fill buffer step by step in a loop e.g.). This means, all buffer writes needs to dominate all buffer reads.

Restriction 2)

According to Memref dialect’s documentation, memrefs returned from some opertaions are marked as immutable (operations like clone, buffer_cast, tensor_load).
I think above store-after-clone examples cause UB because of this restriction.

Note, that mutating the source or result of the clone operation leads to undefined behavior.

Note, that mutating the result of the buffer cast operation leads to undefined behavior.

If tensor load is used in the bufferization steps, mutating the source buffer after loading leads to undefined behavior.

And here’s my questions,

  1. Does the first restriction which specifies buffer read and write ordering apply only in Bufferization step? Then why this restriction is needed only for bufferization steps? I’m wondering what’s the motivation for that decision.
  2. Except above rules, is there any additional cases to consider whether modifying memref is specified as undefined behavior?

The reaon for this restriction is that it simplifies gradual bufferization of programs. When partially bufferizing a program, we insert memref.tensor_load and memref.buffer_cast operations into the program. These operations bridge between the value world and the buffer world. This bridging only works if the buffers are used like values, i.e., they are produced and then never mutated. Otherwise, you could have a sequence

%tensor = memref.tensor_load %buffer
"mutate"(%buffer)
%other = "use"(%tensor)

What would the expected semantics be? That %tensor has the contents of %buffer at the point where the tensor_load is issued? If so, a later lowering of tensor_load to buffers would need to take a copy, which is not what we intend. Disallowing this case makes this analysis easier.

The memref.clone operation is inserted by the buffer lifetime management passes. It is meant to extend the lifetime of a buffer. This can eithe be by copying the buffer or, in a setting with reference counting, by increasing the reference counter. However, the two are only equivalent if the buffer is not mutated anymore.

Outside gradual/partial bufferization or the lifetime management passes, no restrictions exist.

1 Like

Yes, but in C you may have pointers that comes from a constant region of data memory (rodata?) or pointer to code/text section, usually read-only as well. (this is in scope for your “Restriction 2)” case I think?)

1 Like

Thank you for detailed explanation! It’s very helpful to understand the motivation about decision.
:+1:

Oh, that also make sense in terms of C semantic! Thank you :slight_smile:
Then is it right to understand after some memref operations (like tensor_load, clone…), the memory buffer pointed by memref is marked as readonly dynamically?
Here dynamically means the memory buffer’s readonly property is determined at runtime, not statically in compile time.

1 Like

I don’t think “dynamic” is really what I’d use to qualify it, in the sense that the property is always kind of known statically (i.e. by an oracle), but an analysis may not be able to tell you.
Maybe an aspect of “dynamism” is what I would call “contextual” ; for example with some C-like pseudo code:

void foo(int *p) {
  // what do you know about p here?
  // it is contextual to a particular call site, maybe p could be nullptr in some paths?
}

At first, I had same point of view as you. But If we accept only statically determined read-only property, it can’t explain following examples.

func foo() {
  %idx = constant 0 : index
  %cst = constant 1.000000e+00 : f32
  %ret = memref.alloc() : memref<5xf32>
  memref.store %cst, %ret[%idx] : memref<5xf32>
  %temp = memref.clone %ret : memref<5xf32> to memref<5xf32>
  return %temp : memref<5xf32>
}

There’s exists memref.clone operation, so the oracle would say the source, destination memref (here “ret”, “temp”) is pointing read-only memory block. So it is not allowed to store %cst value to %ret memref and this example is considered to be UB.
Does this example really exhibits UB?

Uh, I have no idea why memref.clone is defined as “mutating the source or result of the clone operation leads to undefined behavior”, the doc should be much much more elaborate on this, or link to the bufferization implementation detail.
Ideally such op should be named unambiguously: memref.__bufferization_clone__ or something like that.

According to the doc yes. What’s the dynamic angle in this example?

Edit: note that the original request from @herhut was “mutating the source and result of clone after the clone operation has executed has undefined behavior”, which would imply that your IR does not expose UB since the store is before the clone. This would be UB:

func foo() {
  %idx = constant 0 : index
  %cst = constant 1.000000e+00 : f32
  %ret = memref.alloc() : memref<5xf32>
  %temp = memref.clone %ret : memref<5xf32> to memref<5xf32>
  memref.store %cst, %ret[%idx] : memref<5xf32>
  return %temp : memref<5xf32>

Note that the linalg.inplaceable was introduced to experiment with different strategies dealing with this issue. For example, I had the following problem in the sparse compiler when lowering the following kernel that adds two vectors into sparse, bufferized code.

#SparseVector = #sparse_tensor.encoding<{
   dimLevelType = [ "compressed" ]
 }>

func @add(%arga: tensor<32xf32, #SparseVector>,
          %argx: tensor<32xf32>) -> tensor<32xf32> {
  %0 = linalg.generic #trait1
     ins(%arga: tensor<32xf32, #SparseVector>)
    outs(%argx: tensor<32xf32>) {
      ^bb(%a: f32, %x: f32):
        %0 = addf %x, %a : f32
        linalg.yield %0 : f32
  } -> tensor<32xf32>
  return %0 : tensor<32xf32>
}

Without any more information on the in/out tensor argx, the following code must be generated.

    %0 = sparse_tensor.pointers %arg0, %c0 : tensor<32xf32, #sparse_tensor.encoding<{ dimLevelType = [ "compressed" ], pointerBitWidth = 0, indexBitWidth = 0 }>> to memref<?xindex>
    %1 = sparse_tensor.indices %arg0, %c0 : tensor<32xf32, #sparse_tensor.encoding<{ dimLevelType = [ "compressed" ], pointerBitWidth = 0, indexBitWidth = 0 }>> to memref<?xindex>
    %2 = sparse_tensor.values %arg0 : tensor<32xf32, #sparse_tensor.encoding<{ dimLevelType = [ "compressed" ], pointerBitWidth = 0, indexBitWidth = 0 }>> to memref<?xf32>
    %3 = memref.buffer_cast %arg1 : memref<32xf32>
    %4 = memref.alloc() : memref<32xf32>
    linalg.copy(%3, %4) : memref<32xf32>, memref<32xf32> 
    %5 = memref.load %0[%c0] : memref<?xindex>
    %6 = memref.load %0[%c1] : memref<?xindex>
    scf.for %arg2 = %5 to %6 step %c1 {
      %8 = memref.load %1[%arg2] : memref<?xindex>
      %9 = memref.load %4[%8] : memref<32xf32>
      %10 = memref.load %2[%arg2] : memref<?xf32>
      %11 = addf %9, %10 : f32
      memref.store %11, %4[%8] : memref<32xf32>
    }
    %7 = memref.tensor_load %4 : memref<32xf32>
    return %7 : tensor<32xf32>

Here, the linalg.copy introduces an O(N) operation into something that should be O(nnz) only. By adding the annotation:

func @add(%arga: tensor<32xf32, #SparseVector>,
          %argx: tensor<32xf32> {linalg.inplaceable = true}) -> tensor<32xf32> {
  ...
}

we get the following code, which keeps the update into the dense vector O(nnz).

    %0 = sparse_tensor.pointers %arg0, %c0 : tensor<32xf32, #sparse_tensor.encoding<{ dimLevelType = [ "compressed" ], pointerBitWidth = 0, indexBitWidth = 0 }>> to memref<?xindex>
    %1 = sparse_tensor.indices %arg0, %c0 : tensor<32xf32, #sparse_tensor.encoding<{ dimLevelType = [ "compressed" ], pointerBitWidth = 0, indexBitWidth = 0 }>> to memref<?xindex>
    %2 = sparse_tensor.values %arg0 : tensor<32xf32, #sparse_tensor.encoding<{ dimLevelType = [ "compressed" ], pointerBitWidth = 0, indexBitWidth = 0 }>> to memref<?xf32>
    %3 = memref.buffer_cast %arg1 : memref<32xf32>
    %4 = memref.load %0[%c0] : memref<?xindex>
    %5 = memref.load %0[%c1] : memref<?xindex>
    scf.for %arg2 = %4 to %5 step %c1 {
      %7 = memref.load %1[%arg2] : memref<?xindex>
      %8 = memref.load %3[%7] : memref<32xf32>
      %9 = memref.load %2[%arg2] : memref<?xf32>
      %10 = addf %8, %9 : f32
      memref.store %10, %3[%7] : memref<32xf32>
    }
    %6 = memref.tensor_load %3 : memref<32xf32>
    return %6 : tensor<32xf32>

Eventually, the fully bufferized version looks something like this:

func @add(%arg0: !llvm.ptr<i8>, %arg1: memref<32xf32> {linalg.inplaceable = true}) -> memref<32xf32> {
  %c0 = constant 0 : index
  %c1 = constant 1 : index
  %0 = call @sparsePointers(%arg0, %c0) : (!llvm.ptr<i8>, index) -> memref<?xindex>
  %1 = call @sparseIndices(%arg0, %c0) : (!llvm.ptr<i8>, index) -> memref<?xindex>
  %2 = call @sparseValuesF32(%arg0) : (!llvm.ptr<i8>) -> memref<?xf32>
  %3 = memref.load %0[%c0] : memref<?xindex>
  %4 = memref.load %0[%c1] : memref<?xindex>
  br ^bb1(%3 : index)
^bb1(%5: index):  // 2 preds: ^bb0, ^bb2
  %6 = cmpi slt, %5, %4 : index
  cond_br %6, ^bb2, ^bb3
^bb2:  // pred: ^bb1
  %7 = memref.load %1[%5] : memref<?xindex>
  %8 = memref.load %arg1[%7] : memref<32xf32>
  %9 = memref.load %2[%5] : memref<?xf32>
  %10 = addf %8, %9 : f32
  memref.store %10, %arg1[%7] : memref<32xf32>
  %11 = addi %5, %c1 : index
  br ^bb1(%11 : index)
^bb3:  // pred: ^bb1
  return %arg1 : memref<32xf32>
}

This kind of representation looks quite funky to me in the first place though: why would a kernel operating even take %argx as an input in the tensor domain here?

Funky? How else would you express the tensor expression x(i) += a(i) in linalg (where arga denotes the parameter for the sparse vector a and argx the parameter for the updated dense vector x)?

You’re presenting some sort of pseudo-code in an unspecified language, so it is hard to answer precisely :slight_smile:
That said it seems that you’re trying to implement some sort of “inplace update” of a variable, but MLIR tensor is a value-based immutable type!

Basically using a tensor value requires expressing an IR after SSA construction when starting from the form you’re presenting above. Looks at how clang/llvm expresses:

int x = ...;
x += i;

Well, not trying, succeeding, as the MLIR IR I show above runs correctly end-to-end and with great performance :wink:

What I would genuinely like to know, though, is what would be an acceptable way for you to go from the value-based, immutable tensors at the higher levels of IR, to the inplace updates of mutable memory locations at the lower levels of IR. For dense computations, the occasional copy in/out may be acceptable, but for sparse computations this can negatively impact the asymptotic complexity of the kernel. We have been discussing the example I showed above for a while now, and you seem to dislike both the outs-clause of linalg as well as the inplaceable annotation on the tensor as solution. So, what would meet your bar for proper IR yet with proper performance?

This example made me realize that the sparse compiler needs to start using the more “modern” memref.copy instead of the linalg.copy (other linalg ops were already replaced by the memref equivalent ops in prior revisions, but this one was still lingering). Revision D106038 makes this happen. The hope is that memref.copy will be optimized better in the future. Also, this change removes the dependence on the “linalg-to-loops” pass for dealing with just this single remaining op.

That’s not really an argument though, many people write undefined behavior in C++ and say “but it works, I tried it”.

Again the clever on-liner, but no answer. What part in the MLIR IR above is undefined? If you feel the need to be derogatory about my work, please be courteous and back it up a bit better. Right now I strongly prefer my “funky” but existing solution over your non-existing solution. But I remain open for better solutions, which is why I try to engage in this dialogue.

I believe the IR above clearly shows the difference between the standard path the sparse compiler follows to preserve the immutable, SSA behavior of the input using the following bufferized version

x_out = alloc
for i = 1, N
   x_out(i) = x_in(i)
for i = where a is nonzero
   x_out(i) = x_out(i) + a(i)

and the much more prefered bufferized version below (in the sparse world, since it avoids the O(N) loop), which I allow when the linalg.inplaceable = true is given.

for i = where a is nonzero
   x(i) = x(i) + a(i)

So my question again is, what MLIR IR that ultimately is able to map to this code would be acceptable to you?

I didn’t answer you question because when I’m pointing issue with semantics, you seem to ask me to provide a complete design of a entire compiler (at least all the bufferization part, which is very tricky) : this does not look like to me like an appropriate request in the first place.

I am not sure how to address such question: I can give you an IR that model a problem, but “ultimately is able to map to this code” is not as much about the IR than it is about the lowering and the optimization implementation. So to loop back into my point above: answering your question completely would ask me a few weeks of work to provide you with the complete answer.

Also I’m not sure how to reconnect all of this to the topic here of memref and what’s UB?

We suggested to have a separate bufferize dialect that contains the clone, buffer_cast and tensor_load ops to make it very clear that these operations model intermediate state during partial bufferization. I am still open to that suggestion.

True. I will at least fix the documentation.

Also @makesource : Keep in mind that these operations are only encoding information for the bufferization passes. The memref.clone does not exist at later stages. It can be lowered to memref.copy at which point all the immutability requirements are no longer required.

1 Like