[RFC] Grouping ops in TCP

Here is our proposed design for grouping ops in TCP: [Public] TCP Design - Groups

Please feel free to add your feedback in the document or in this thread.

Thanks,
Raghavan
(on behalf of the ML Compiler Team at Cruise)

Once you have tcp.group of tcp ops, what do you lower them to?

If you lower them to linalg (which doesn’t have a fusion op), how does that guarantee that I can fuse them later? What is the difference between lowering the same sequence of tcp ops outside of a tcp.group and inside?

If the lowering is just a sequence of linalg.generic, then they can still get reordered before the pass that packs them.

Once we pack and block our tensors, we create parallel loops to add tile linalg ops in the inner loop, that can still be fused together with ops outside of the nested loops. If we could keep tcp.isolated_group with linalg ops inside, we wouldn’t need to look at loop iterations at all, and pack, tile and fuse inside the group, before removing it.

So the last cleanup pass would just remove tcp.isolated_groups because they’ve done their part and have no further semantics after lowering.

Is that the plan for those ops?

It is possible to lower the ops inside the groups in different ways, depending on the use-case needed. Here are some examples of lowering ops in tcp.isolated_group:

  • If the group indicates how a graph is partitioned for execution across multiple accelerators then the operations inside backend can be lowered via a device specific pipeline (e.g. linalg based codegen for CPU vs. linalg based codegen for GPU).
  • If the group represents an elementwise fusion and needs to be lowered to linalg, we could lower it to a single linalg.generic. This should have a similar effect as running the pass --linalg-fuse-elementwise-ops on the linalg.generic ops corresponding to the ops inside the group.
  • If the group represents a conv-relu fusion, it can be lowered to call the corresponding cudnn api.

That is possible too. We could only lower the region inside a group op to linalg, which will give you what you need IIUC.

1 Like