The tensor dialect has two operations that create a new tensor with specified contents:
a) tensor.generate
b) tensor.from_elements
During bufferization, these are lowered to:
a) a buffer allocation and an scf.parallel
loop filling the buffer
b) a buffer allocation and sequence of stores into the buffer
These lowerings jump abstraction gaps. In particular, the tensor.generate
lowering parallelizes the code. Some users may even prefer a sequential version using scf.for
(or maybe yet another lowering). In the case of tensor.from_elements
, there may be vectorization opportunities instead of creating scalar stores.
These issue could be solved by introducing two new ops in the memref dialect:
a) memref.generate
b) memref.from_elements
The exact naming of these ops could be different. The key point is that both have the same semantics as their respective tensor counterparts, but operate on a memref value that is passed into the op as an additional operand.
Example:
func @tensor.generate_static_and_dynamic(%arg0: index) -> tensor<16x?xindex> {
%result = tensor.generate %arg0 {
^bb0(%i: index, %j: index):
%sum = arith.addi %i, %j : index
tensor.yield %sum : index
} : tensor<16x?xindex>
return %result : tensor<16x?xindex>
}
During bufferization, this would lower to:
func @tensor.generate_static_and_dynamic(%arg0: index) -> memref<16x?xindex> {
%result = memref.alloc(%arg0) : memref<16x?xindex>
memref.generate %result {
^bb0(%i: index, %j: index):
%sum = arith.addi %i, %j : index
memref.yield %sum : index
} : memref<16x?xindex>
return %result : memref<16x?xindex>
}
Another pass could then lower the memref.generate
to scf.parallel
or scf.for
. We currently have BufferizeGenerateOp
, which lowers from tensor.generate
all the way to memref.alloc
+ scf.parallel
, so we would effectively split this pattern into two patterns and add a new op for the intermediate step.
What are your opinions on this?