I did some search but couldn’t find a solution on this forum. I have a case that I’d like to represent a computation in LinAlg that performs computation on subtensors rather than elements. For example, the computation defined by the following linalg_on_elements is on each element. What I want instead is something like linalg_on_subtensors where LinAlg only represents the outer most loop and the computation is performed on subtensors. Is this a scenario supported by LinAlg? Any pointer is appreciated.
#map = affine_map<(d0, d1, d2) > (d0, d1, d2)>
func @linalg_on_elements(%arg0: tensor<10x512x512xbf16>, %arg1: tensor<10x512x512xbf16>) > tensor<10x512x512xbf16> {
%0 = linalg.init_tensor [10, 512, 512] : tensor<10x512x512xbf16>
%1 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel", "parallel"]} ins(%arg0, %arg1 : tensor<10x512x512xbf16>, tensor<10x512x512xbf16>) outs(%0 : tensor<10x512x512xbf16>) {
^bb0(%arg2: bf16, %arg3: bf16, %arg4: bf16): // no predecessors
%2 = addf %arg2, %arg3 : bf16
linalg.yield %2 : bf16
} > tensor<10x512x512xbf16>
return %1 : tensor<10x512x512xbf16>
}
#map0 = affine_map<(d0) > (d0)>
func @linalg_on_subtensors(%arg0: tensor<10x512x512xbf16>, %arg1: tensor<10x512x512xbf16>) > tensor<10x512x512xbf16> {
%0 = linalg.init_tensor [10, 512, 512] : tensor<10x512x512xbf16>
%1 = linalg.generic {indexing_maps = [#map0, #map0, #map0], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<10x512x512xbf16>, tensor<10x512x512xbf16>) outs(%0 : tensor<10x512x512xbf16>) {
^bb0(%arg2: tensor<512x512xbf16>, %arg3: tensor<512x512xbf16>, %arg4: tensor<512x512xbf16>): // no predecessors
%2 = addf %arg2, %arg3 : tensor<512x512xbf16>
linalg.yield %2 : tensor<512x512xbf16>
} > tensor<10x512x512xbf16>
return %1 : tensor<10x512x512xbf16>
}
Hi @ruizhang
The general problem of representing nesting in linalg ops does not have a good solution atm.
You could achieve something similar in spirit by using tensor<10xvector<...>>
but this is not satisfactory for your use case as you will get spills faster than you can blink an eye.
The general problem of (a) nested op semantics with (b) attributes that feel natural to use and (c) implicitly extract/insert subtensors in the generality we need; is surprisingly involved.
Atm the closest resembling thing is linalg.tiled_loops
with explicit insert/extract slice, but this comes with its own challenges. See [RFC] Add Linalg TileOp and [RFC] Changes to linalg::TiledLoopOp to unblock reductions for more context.
Also see Encapsulating linalg.convolution inside linalg.generic Region for a previous question on the topic that seems similar to yours.
The viable approaches for more hierarchical representations IMO are:

tensor<...x vector<...x>>
: your vectors better not be too big or spills will occur. As you can imagine you’ll have issues at the boundaries (i.e. tensor<123xf32>
cannot be written as tensor<16xvector<?xf32>>
).
 “going higherD”
tensor<16xf32>
→ tensor<4x4f32>, you will have the same boundary issues as before tensor<123xf32>
→ tensor<16x?xf32>
is technically a valid type but the runtime value of ?
is not the same for every tile so instead you need to:
 using looping constructs (
scf.for
or linalg.tiled_loop
) with explicit insert/extract operations
Longerterm, the representation issue seems like it could be solved with better firstclass support of nested types tensor<16xtensor<?xf32>>
. This is not equivalent to tensor<16x?xf32>
because we now have 16 different runtime values for encoding the ?
which is much more flexible.
It will take a nontrivial amount of time before we can get there and extend operations and transformations to support this. But it will also unlock more general codegen and algorithms such as ragged arrays and indirect algorithms ([1907.02129] The Indirect Convolution Algorithm).
Hi @nicolasvasilache, thanks very much for your response. I have some follow up questions and hope you can kindly help.
 using looping constructs (
scf.for
or linalg.tiled_loop
) with explicit insert/extract operations
About your above comment, since scf.for
or linalg.tiled_loop
are looping constructs different from linalg.generic
's implicit looping structure, has there been any thought on unifying the expression of these looping structure in linalg.generic
? This feels related to the nested tensor type that you mentioned too. I notice LinAlg tiling produces scf.for
or linalg.tiled_loop
too and wonder if unifying the looping constructs is considered beneficial for loop transformations on LinAlg layer.
Yes there has and the result of our best efforts here is linalg.tiled_loop
.
To make it more concrete, we found it quite challenging to come up with an intuitive set of attributes that would make subtensors implicit with the flexibility we need (including transposes, sliding windows etc).
There is a nasty underlying inverse problem that we could not find a good solution for (I wrote a little about it somewhere I’ll link when I am able to find it).
Everything becomes either (a) very verbose and hard to make sense of or (b) very limited in expressiveness.
At which point the alternatives we are left with are:
 use a higherdimensional linalg.generic on either scalar or vector (this has many boundary issues)
 use a linalg.generic on tensors and keep botthe information of “original linalg.generic on scalars” + “tiling/interchange/fusion transformation that is applied to the scalar version to get to the tiled version” as an attribute.
 start thinking about more general codegen of nested
tensor<...tensor<...>>
and richer data type abstractions. This will also have to come with a counterpart on the nested controlflow side. This is what we are calling “Structured and Retargetable Codegen” internally, and the road is pretty long until we have something to show.
At this point in spacetime, I don’t think there is a significantly better way than 3.
But for now, explicit looping around smaller linalg.generic works quite well within the limits of what it can represent.
1 Like