Here is our proposed design for grouping ops in TCP: [Public] TCP Design - Groups
Please feel free to add your feedback in the document or in this thread.
Thanks,
Raghavan
(on behalf of the ML Compiler Team at Cruise)
Here is our proposed design for grouping ops in TCP: [Public] TCP Design - Groups
Please feel free to add your feedback in the document or in this thread.
Thanks,
Raghavan
(on behalf of the ML Compiler Team at Cruise)
Once you have tcp.group
of tcp
ops, what do you lower them to?
If you lower them to linalg
(which doesn’t have a fusion op), how does that guarantee that I can fuse them later? What is the difference between lowering the same sequence of tcp
ops outside of a tcp.group
and inside?
If the lowering is just a sequence of linalg.generic
, then they can still get reordered before the pass that packs them.
Once we pack and block our tensors, we create parallel loops to add tile linalg
ops in the inner loop, that can still be fused together with ops outside of the nested loops. If we could keep tcp.isolated_group
with linalg
ops inside, we wouldn’t need to look at loop iterations at all, and pack, tile and fuse inside the group, before removing it.
So the last cleanup pass would just remove tcp.isolated_group
s because they’ve done their part and have no further semantics after lowering.
Is that the plan for those ops?
It is possible to lower the ops inside the groups in different ways, depending on the use-case needed. Here are some examples of lowering ops in tcp.isolated_group
:
linalg.generic
. This should have a similar effect as running the pass --linalg-fuse-elementwise-ops
on the linalg.generic
ops corresponding to the ops inside the group.conv-relu
fusion, it can be lowered to call the corresponding cudnn api.That is possible too. We could only lower the region inside a group op to linalg
, which will give you what you need IIUC.