Tensor to memref conversion (a.k.a. bufferize) question

Hello,

When a function uses tensors, code generation towards CPUs always requires lowering of all tensors to memrefs.

A good part of this conversion is realized by the *-bufferize options of mlir-opt. However, I can’t find a way to convert all tensors to memrefs. For instance, in the following example of the distribution, I can’t see how to convert the signature of function @dynamic_tensor_from_elements:

mlir/test/Dialect/Standard/bufferize.mlir

I know there are memory allocation choices involved, and also optimization problems. I just want to know which part is automated, and which not.

Can you help me, please?

Best regards,
Dumitru

ADDENDUM: I have found this post this recent topic which points to a very nice presentation. However, this presentation mentions objects that I am unable to find, such as the conversion pattern BufferAssignmentFuncOpConverter (I made a grep -R on the sources of both mlir and tensorflow). Maybe someone can point me tho them?
I have also found this post of phabricator, which mentions file mlir/lib/Transforms/BufferPlacement.cpp, but the file is absent from my llvm repository, even after git pull. Is this a patch not yet included in the main branch? Then, how can I use it?

See my reply here: What is the strategy for tensor->memref conversion? (bufferization) - #25 by _sean_silva

Generally, you run one *-bufferize pass for * == each dialect you want to convert to memrefs, and then finish it with func-bufferize.

1 Like

Thanks a lot. I had already written this message before finding the main thread of discussion. Very nice work.

I’m trying the existing bufferization method. Consider the following function:

#attrs1 = {
  args_in = 1, args_out = 1,
  indexing_maps = [#identity,#identity],
  iterator_types = ["parallel"]
}
func @myfun4(%i:tensor<10xf32>)->(tensor<10xf32>) {
  %o = linalg.generic #attrs1 ins(%i:tensor<10xf32>) {
  ^bb0(%elt:f32):
    %x = absf %elt : f32
    linalg.yield %elt:f32
  } -> tensor<10xf32>
  return %i:tensor<10xf32>
}

I’m bufferizing it with mlir-opt --linalg-bufferize --std-bufferize --func-bufferize try.mlir.

Currently, the output will (dynamically) allocate the output memref, which then must be deallocated outside the function. Is there some way to require that bufferization makes output tensors into input memrefs, so that it’s the caller that makes the allocation? This has to be, of course, mirrored on caller side, where allocation has to be done? The advantage I see is that in many typical cases low-level code generation can allocate the memref on the stack of the caller, which is easier and more predictable than playing with malloc and free.

This is exactly what BufferizeFuncOpConverter does (mlir/Transforms/Bufferize.h).

/// Converts the signature of the function using BufferizeTypeConverter.
/// Each result type of the function is kept as a function result or appended to
/// the function arguments list based on ResultConversionKind for the converted
/// result type.
class BufferizeFuncOpConverter : public BufferizeOpConversionPattern<FuncOp> {
public:
  using BufferizeOpConversionPattern<FuncOp>::BufferizeOpConversionPattern;

  /// Performs the actual signature rewriting step.
  LogicalResult matchAndRewrite(mlir::FuncOp, ArrayRef<Value>,
                                ConversionPatternRewriter &) const override;
};

You may want to check the codebase as to where this pattern is being used/tested from. In the TF/MLIR repo, the mhlo → lmhlo conversion uses this.

1 Like

You can use the buffer-results-to-out-params pass to do this transformation: https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/buffer-results-to-out-params.mlir

It’s not a fully general transformation since it doesn’t handle dynamic shapes well. That’s why we do it in a separate pass.

This feature of BufferizeFuncOpConverter has been superceded by BufferResultsToOutParams.

The option is --test-finalizing-bufferize, but I can’t make it work, and I think I also found a bug in another bufferizing option. Assume the function is:

#identity = affine_map<(m) -> (m)>
#attrs1 = {
  args_in = 1, args_out = 1,
  indexing_maps = [#identity,#identity],
  iterator_types = ["parallel"]
}
func @myfun4(%i:tensor<10xf32>)->(tensor<10xf32>) {
  %o = linalg.generic #attrs1 ins(%i:tensor<10xf32>) {
  ^bb0(%elt:f32):
    %x = absf %elt : f32
    linalg.yield %elt:f32
  } -> tensor<10xf32>
  return %o:tensor<10xf32>
}

Then, applying mlir-opt --linalg-bufferize --convert-linalg-to-std to it gives:

#map = affine_map<(d0)[s0, s1] -> (d0 * s1 + s0)>
module {
  func @myfun4(%arg0: tensor<10xf32>) -> tensor<10xf32> {
    %0 = tensor_to_memref %arg0 : memref<10xf32>
    %1 = alloc() : memref<10xf32>
    %2 = memref_cast %0 : memref<10xf32> to memref<10xf32, #map>
    %3 = memref_cast %1 : memref<10xf32> to memref<10xf32, #map>
    call @op_has_no_registered_library_name(%2, %3) : (memref<10xf32, #map>, memref<10xf32, #map>) -> ()
    %4 = tensor_load %1 : memref<10xf32>
    return %4 : tensor<10xf32>
  }
  func @op_has_no_registered_library_name(memref<10xf32, #map>, memref<10xf32, #map>) attributes {llvm.emit_c_interface}
}

which seems to be incorrect because absf completely vanished. If on this output I apply mlir-opt --test-finalizing-bufferize then I get:

<stdin>:10:5: error: failed to legalize operation 'std.return'
    return %4 : tensor<10xf32>

I find this weird, because return is one of the fundamental operations in SSA.

Use func-bufferize instead of --test-finalizing-bufferize. TestFinalizingBufferize is at this point only testing features related decomposing types along call graph edges; it’s not related to bufferization per se. See https://reviews.llvm.org/D90899 for that functionality being split out and test-finalizing-bufferize being removed.

(again, sorry that you got caught in the middle of this refactoring)

Ok, so I obtain the result I wanted with the following sequence of calls:

mlir-opt --linalg-bufferize --std-bufferize --func-bufferize try.mlir |\
  mlir-opt --buffer-results-to-out-params | \
  mlir-opt --buffer-deallocation

The output is:

#map = affine_map<(d0) -> (d0)>
module {
  func @myfun(%arg0: memref<10xf32>, %arg1: memref<10xf32>) {
    %0 = alloc() : memref<10xf32>
    linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel"]} ins(%arg0 : memref<10xf32>) outs(%0 : memref<10xf32>) {
    ^bb0(%arg2: f32, %arg3: f32):  // no predecessors
      %1 = absf %arg2 : f32
      linalg.yield %arg2 : f32
    }
    linalg.copy(%0, %arg1) : memref<10xf32>, memref<10xf32>
    dealloc %0 : memref<10xf32>
    return
  }
}

and I would expect the llvm back-end to generate stack allocations for such well-parenthesized code.

I have a simple question. I’ve spent quite some time on it, and every time I approach a solution I find some optimization only works for certain cases, etc. Assume you want to perform tensor concatenation and you don’t want to install tensorflow. How would you implement such a function? The constraints are:

  • The signature must be func @tensor_concat(%t1:tensor<?xf32>,%t2:tensor<?xf32>)->tensor<?xf32>
  • You should be able to lower it to LLVM using only calls to mlir-opt (one or more).
  • The resulting LLVM code should have the correct deallocation operations, automatically synthesized, so there are no memory leaks after multiple calls.
  • Ideally you should not use memref in the original function (at all).

Hi @dpotop, can you ask that question in a new thread? It sounds like a separate discussion from this thread.

I found a solution that’s for now good enough. :slight_smile:

Just looked at this. Trying to understand why this has a ‘dependentDialects’ on linalg (in Passes.td)? The conversion from memref result values to output memrefs is expected to be dialect independent - BufferizeFuncOpConverter didn’t depend on any dialects for example.

For now, linalg.copy is the only copy op we have (which is required to copy into out params). There’s no fundamental reason for that – there’s only one way to define such an op, so it makes sense to have it in some more neutral place than linalg. I think there have been multiple such ops that have incubated in linalg and graduated somewhere more neutral. See also Remove tight coupling of the BufferDeallocation pass to std and linalg operations - #4 by _sean_silva

That’s only true upstream, but there are copy ops in other repos downstream - lhlo.copy in mlir-hlo for example or other dialects. IIRC, BufferizeFuncOpConverter and related things are able to take a templated CopyOpTy and use that and worked correctly for both linalg and mlhlo. It never depended on linalg. So there is something remaining before BufferResultsToOutParams could subsume BufferizeFuncOpConverter :). And there is the CopyOpInterface to make it cleaner (can check if the copy op type could be cast to it). For example, the copy-removal pass upstream works seamlessly on both the linalg copy op and the out of tree lmhlo.copy op. It is weird to me that the utility/pass to move return values to output arguments depends on linalg.

Agreed that the new formulation does currently embed a dependence on linalg. I don’t think the new behavior is a serious regression when viewed from a larger perspective, because BufferDeallocation has the same problem, and users of BufferResultsToOutParams surely also run BufferDeallocation. This definitely highlights that MLIR upstream needs better support for this – a goal that we can work towards together and apply systematically to all our passes (see other thread for the currently active discussion about this).

I was just pointing out BufferizeFuncOpConverter/BufferizeReturnOpConverter isn’t superceded by BufferResultsToOutParams yet, and one can’t simply use the latter in place of the former without bringing an otherwise unnecessary dependence on an entire dialect. I think it’s the BufferizeReturnOpConverter that I wanted to mention. Removing it would definitely be a regression.

/// Rewrites the `ReturnOp` to conform with the changed function signature.
/// Operands that correspond to return values and their types have been set to
/// AppendToArgumentsList are dropped. In their place, a corresponding copy
/// operation from the operand to the target function argument is inserted.
template <typename ReturnOpSourceTy, typename ReturnOpTargetTy,
          typename CopyOpTy>
class BufferizeReturnOpConverter
    : public BufferizeOpConversionPattern<ReturnOpSourceTy> {

I agree in theory, but in practice it didn’t seem to be an issue.

I was very careful when removing it to reach out to all the parties involved, including the original authors, and I verified that this was the right approach by updating multiple downstream users to the new functionality (which went seamlessly).

1 Like