[RFC] Add `memref.generate` and `memref.from_elements`

The tensor dialect has two operations that create a new tensor with specified contents:
a) tensor.generate
b) tensor.from_elements

During bufferization, these are lowered to:
a) a buffer allocation and an scf.parallel loop filling the buffer
b) a buffer allocation and sequence of stores into the buffer

These lowerings jump abstraction gaps. In particular, the tensor.generate lowering parallelizes the code. Some users may even prefer a sequential version using scf.for (or maybe yet another lowering). In the case of tensor.from_elements, there may be vectorization opportunities instead of creating scalar stores.

These issue could be solved by introducing two new ops in the memref dialect:
a) memref.generate
b) memref.from_elements

The exact naming of these ops could be different. The key point is that both have the same semantics as their respective tensor counterparts, but operate on a memref value that is passed into the op as an additional operand.

Example:

func @tensor.generate_static_and_dynamic(%arg0: index) -> tensor<16x?xindex> {
  %result = tensor.generate %arg0 {
  ^bb0(%i: index, %j: index):
    %sum = arith.addi %i, %j : index
    tensor.yield %sum : index
  } : tensor<16x?xindex>
  return %result : tensor<16x?xindex>
}

During bufferization, this would lower to:

func @tensor.generate_static_and_dynamic(%arg0: index) -> memref<16x?xindex> {
  %result = memref.alloc(%arg0) : memref<16x?xindex>
  memref.generate %result {
  ^bb0(%i: index, %j: index):
    %sum = arith.addi %i, %j : index
    memref.yield %sum : index
  } : memref<16x?xindex>
  return %result : memref<16x?xindex>
}

Another pass could then lower the memref.generate to scf.parallel or scf.for. We currently have BufferizeGenerateOp, which lowers from tensor.generate all the way to memref.alloc + scf.parallel, so we would effectively split this pattern into two patterns and add a new op for the intermediate step.

What are your opinions on this?

Thanks Matthias!

For the tensor.from_elements, I am generally curious as to why this op exists.
Is there a fundamental advantage to just using multiple tensor.insert operations?
I am curious about this because similar discussions would occur at the vector level and AFAIK there is no such support in LLVM except for simple one-at-a-time insertelement.

Re. tensor.generate and memref.generate is a good example of the need for a generic structured operation on tensors and buffers that knows how to lower to loops and can easily bufferize. If only we had such a general mechanism to avoid creating one-off ops that inevitably yield to conflation of concerns…

I’d be curious to hear people’s opinion about a more general tensor.generic and memref.generic with the above properties.

1 Like

tensor.from_elements avoids having an undefined or initial value, as the entire tensor is guaranteed to be written by constructions. It is also slightly more convenient when optimizing accesses to values constructed using it: No need to traverse the use-def chain to find the corresponding insertion.

We use this op to construct small tensors, typically when bridging shape computations between scalarized and vector forms. Something lightweight and well structured is fairly useful there.

Similarly, we use the tensor.generate in cases where we do not know the rank and hence have variable length shapes.

In our use case, we only lower these operations close to final code generation and we do not perform fusion or any other compute optimizations on them. For shape computations on small (max 10 elements) tensors, it is generally not worth it. Oftentimes we scalarize these operations away before we even get to bufferization.

I think there is some value in having these lightweight operations independently of a more powerful structured operation. Both make sense for their intended use and I do not see a need to standardize on one of them.

I would not mind such a split. It would clean up the laying of dialects. I would not use a memref.generate on its own because at the memref level the uninitialized value aspect is gone. Maybe it should be named memref.fill instead, as that is closer to what it does.

I’m not convinced by the “jump an abstraction gap” argument here: the scf.parallel encode just more information than an SCF for, but a user could lower the scf.parallel into a scf.for later.
As for the the “from_elements”, it only exists as a convenience in the value domain to avoid introducing the notion of “undef”, but this argument does not apply to memref as far as I can tell.

I’d be more interested to read about this instead of the series of one-off additions that seem a little bit ad-hoc to me right now.

At this point, these look like unnecessary abstractions that just add to the number of equivalent things to look at and require additional de-abstraction. If the issue is the lowering choice of scf.parallel and scf.for, that could be addressed by either using a flag on the lowering to generate scf.parallel instead of scf.for or generate the former and convert to the latter if desired. There are potentially other design points using native attributes introduced during the lowering.

Thanks for the feedback, everyone! We’ll abandon this RFC and keep everything as is for the moment. We can revisit in the future if necessary.