Some additional comments on the specifics of the ops proposed. I think the utility of new ops comes from two directions:
- Do they provide readability, convenience or less information loss to lowerings in to linalg. For this, we look at a few of the usual suspects (and you are specifically looking at advanced fusion heuristics at the high level).
- Do they enable better mapping to specializations in the field (i.e. library calls, hardware blocks, etc).
When lowering to libraries or microkernels, I don’t think this is going to be a generally useful op, so this is in the frontend convenience category, I think.
The reason is that common targets for these things actually do match a fused linalg.generic
defined over projected permutations quite well: in such targets, you typically have strides available for free and the elementwise-fused form is what you want (i.e. broadcasts and transposes get “folded in”). I don’t love pattern matching linalg.generic
for such cases, but it is actually the right representation for the use cases I see. Separate to this proposal, I wonder if we had a better name/spelling for such n-ary, parallel, projected permutation mappings, if that wouldn’t be the right op to define. I’m not precisely sure how to define that formally with better rigor than generic
is defined now, though – more that I’ve seen a lot of code that does the same pre-amble dance to identify whether that is the form.
Naming: Elsewhere, we already have the ElementwiseMappable
trait which captures the semantic. Maybe linalg.elementwise_map
for consistency? There is also the ElementwiseToLinalg
pass which is presently converting anything with this trait to a linalg.generic
. Minimally that should be updated to build this new op.
Even if they get fused later on, I would like a better representation for transpose
specifically, even if just for the pure readability at the frontend->linalg level. Same comment about linalg.map
above: it is rare in real programs that this should survive as a standalone op because generic
represents its trivial fusions (and those are what you want to lower), but I think the ergonomic benefit makes it worth it.
I’m not actually sure what you are proposing here? It looks to be a fully static form of HLO’s broadcast_in_dim
? As written, I believe it can support dynamic broadcasting but does not allow ambiguity on unknown dimensions triggering expansion.
I’m going to leave this to the better qualified.
I think this is only partially true: Perhaps at the mid-level we are over-pattern matching, but there are a lot of (even library call) cases where generic
is the most direct match. Yes, you have to pattern match cases out of it, but the associations don’t exist otherwise. But especially at the higher levels, less information loss is better, so I can get behind the “missing layer of ops” analysis. Ideally, I would always be able to do access-pattern fusion and get into generic form for some code generation or lowering flows, but your analysis is that we don’t need to start there. (I think I’m just restating what you said in my own words, so please tell me if we are saying something different)