[RFC] Adding an operation to construct small tensors of dynamic size

In the context of lowering the shape dialect, we have come across the issue that we need to create a small tensor of index values with dynamic length to represent the shape of an unranked array. Currently, we use a workaround like

%1 = alloca(%0) : memref<?xindex>
scf.for %arg1 = %c0 to %0 step %c1 {
  %9 = dim %arg0, %arg1 : tensor<*xf32>
  store %9, %1[%arg1] : memref<?xindex>
}
%2 = tensor_load %1 : memref<?xindex>

to model this. However, this introduces memrefs at an early stage, which we would like to avoid. It is also difficult to lower this correctly, as semantically the tensor_load makes a copy of the provided memref, leading to extra allocations in the lowering that we would need to optimize away.

For static sized tensors, standard already has a tensor_from_elements operation but that cannot be used here.

We could model this similar to LLVM by introducing an std.undef operation to create a tensor of undefined values and std.insert_element to fill said tensor but that also is inconvenient to lower as it requires some analysis to avoid temporaries and also would introduce undefined values.

Instead, we could have a more tailored operation like

%2 = generate_tensor %0 {
  ^bb0(%arg1 : index):
    %9 = dim %arg0, %arg1 : tensor<*xf32>
    yield %9 : index
} : tensor<?xindex>

The operands to generate_tensor specify the dynamic sizes of the result and the body region defines the value at each position of the tensor.

This form is fairly easy to lower (maps directly to an scf.for nest) and also allows nice canonicalizations with extract_element. For example, we have a use case

%2 = generate_tensor %0 {
  ^bb0(%arg1 : index):
    %9 = dim %arg0, %arg1 : tensor<*xf32>
    yield %9 : index
} : tensor<?xindex>
  %3 = dim %2, %c0 : tensor<?xindex>
  %4 = scf.for %arg1 = %c0 to %3 step %c1 
               iter_args(%arg2 = %c1) -> (index) {
    %9 = extract_element %2[%arg1] : tensor<?xindex>
    %10 = muli %9, %arg2 : index
    scf.yield %10 : index
  }

where the extract_element can be trivially canonicalized into dim %arg0, %arg1 (assuming extract_element has undefined behavior for out-of-bounds reads).

We also avoid undefined values and tackling optimizing updates on tensors.

1 Like

+1, this would definitely be useful for progressive lowering.

Also, this seems like a (very) special case of linalg.indexed_generic?

That is true. Using linalg for this seems a little heavy though and I’d like to avoid the dependency.

This is phrased as a mutation. I’m assuming it is more of

%0 = tensor.create : tensor<0xindex>
scf.for %arg1 = %c0 to %0 step %c1 iter_args(%arg2 = %0) -> (tensor<?xindex>) {
  %9 = dim %arg0, %arg1 : tensor<*xf32>
  %10 = tensor.append %arg2, %9 : tensor<?xindex>
  scf.yield %10 : tensor<?xindex>
}

? As you start with a 1d index tensor with 0 elements and append to it (although one would actually know the eventual size of %0 and the above doesn’t capture that).

Isn’t this more generally a map? (I think shape.reduce can also represent this computation in a cumbersome way as one would be that you might want to start with something like a “constant_like(1, rank(%0))” and then aggregate by multiplying on a dimension but there isn’t a constant_like op yet). E.g. re map, you have a map over some range (e.g., if your input is %0 : tensor<10x20xindex>, how do we know how it is sliced? I’m assuming by the type of the block arg so that index there says we are elementwise and produce a 10x10, but if we had tensor<10xindex> then the result would be something with shape 20? Or else just limit to scalar types) and produce some type that all gets concatenated together. On entry to the op you know you know the result shape.

The second part of @herhut’s example is the result of lowering shape.reduce. The problematic part is the lowering of shape.shape_of.

I am not sure if this is what @jpienaar meant. It is certainly nice to think about this op as a map from the index space to element values. The op’s arguments define the size of the resulting tensor (like they would in alloc) and the block arguments would span the index space. For a tensor of rank greater than one, this could look as follows:

%tnsr = generate_tensor %m, %n {
  ^bb0(%i : index, %j : index, %k : index):
    ...
    yield %9 : f32
} : tensor<?x3x?f32>
1 Like

That is indeed what I would like to avoid. Instead of gradually building the tensor, I want to create it in its final size and then gradually fill it without exposing the tensor as an SSA value before it is fully constructed.

Yes, this can be seen as a map(iota(shape)) where iota(shape) describes your indexing space. I would not want to decompose it that way, though, for codegeneration simplicity.

I want to keep this simple and only allow creation of tensors in an element-wise fashion. After all, this is not meant as a general compute primitive in the sense of LinAlg but rather for creating small meta-data structures. So I would enforce that the results of body region are scalar elements for now.

See https://reviews.llvm.org/D86276