As a rule of thumb, a single
linalg.generic models a perfectly nested computation.
While you may technically be able to express this as a single perfectly nested loop nest with proper conditionals, in practice we have found this approach to be counter-productive.
In this case, I see 2 possibilities forward:
- write this as (affine) loops and implement the proper analyses and transformations to get the output you want
- introduce a scan / segmented scan op on tensor/buffer and plug it into the system. Note that we have such an op at the vector level: VectorOps.td - llvm/llvm-project - Sourcegraph.
If you go with the second option, you can then rewrite this using the tensor comprehensions notation (assuming inplace buffers):
out(i) = 0; // fill
out(i) += inp(i, j) // reduce
scan_op(out) // scan
The reasons you may want such a scan op at the tensor/buffer level are:
- it can lower unsurprisingly to the vector abstraction for which we have (or at least expect to have, I haven’t dug yet) unsurprisingly good retargetable lowerings.
- its semantic carry the information that the iteration domain can be logarithmic, which is not trivial to recover from the loop domain.
More generally, I think there is design space for another type of “generic” op with a region and logarithmic iteration spaces but my intuition is it does not compose with reductions (or at least as we currently represent them, I have not dug into this topic deeply enough).
Note that a pragmatic choice here—rather than designing the next level of generalization along this axis—is to just use ops with the proper interfaces that connect to the vector level, see e.g. IREE’s LinalgExtOps.td - iree-org/iree - Sourcegraph.
Deriving implementations that are more or less fused/parallel/vectorized can then be done with transformations.
The above is for the “representing” scan part as a single
linalg.generic vs multiple ops. The TL;DR is “use multiple ops”.
Separately, there is the question of having an aggregate op, e.g. in your case a single
xxx.mbbs op in some dialect. This is very similar to @harsh-nod 's great work on Winograd convolutions (in his presentation he also goes through the tradeoffs of implementing this as a single
In both cases, the outcome seem to be that we need a first class “sequence” or “dag” construct that we can use to build new ops and have transformations like tiling, fusion etc carry to that.
My intuition is there isn’t something generic to represent all cases, but making the process simpler and more automated will be very useful.