[RFC] Introduce new pass/transform: fusion by diffusion

YunFly · June 14, 2024, 8:49am

Background

In past, there existed several pass/transform regarding to fusion in MLIR, saying:

Pass: ElementwiseOpFusion.
Transform: scf::tileConsumerAndFuseProducer.

However, there are also some limitation on these solutions. For example, ElementwiseOpFusion pass assumes GenericOp and only supports elementwise op. Besides, both of them execute fusion flow from consumer to producer, in which case, consumer is responsible for generate outer loops(a.k.a. iteration domain) and force producer adapt to them. As all you known, in most deep learning workloads, computation-sensitive op will take up most of time, like matmul or convolution(or so-called contraction). As the result, many developers prefer to implement them by hand-writing template with specific algorithms particularly involving how to tile on iteration domain and dispatch efficient kernel on multiple threads for either GPU or CPU. In this use-case, it seems unreasonable to tile a producer matmul within a common outer loops generated by consumer relu. On the contrary, it is better to fuse consumer relu into tiled producer matmul. Recently, another useful interface named scf::tileAndFuseConsumerOfSlice supported by community, as the counterpart of existed scf::tileAndFuseProducerOfSlice, makes this possible

Meanwhile, there is no combined pass/transform to fuse both producer and consumer of one contraction op within arbitrary nested loop structure up to now.

Taking classic MLP example consisting of pack+fill+matmul+broadcast+add+relu as example:

 input   weight
   |       /
   |     unpack
    \    /    bias
   matmul    /
      \    broadcast 
       \    /
          add 
           |
          relu

where matmul has already been deeply tiled with complex nested loop structure before fusion, E.g.

%input = ....
%weight = tensor.pack (...)
%dest = linalg.fill (...)
%mm = scf.forall() {
  %extract_slice_0 = tensor.extract_slice %input [...]
  %extract_slice_1 = tensor.extract_slice %weight [...]
  %extract_slice_2 = tensor.extract_slice %dest [...]
  %0 = scf.for() {
     %1 = scf.for() {
         %extract_slice_3 = tensor.extract_slice  %extract_slice_0 [...]
         %extract_slice_4 = tensor.extract_slice  %extract_slice_1 [...]
         %extract_slice_5 = tensor.extract_slice  %extract_slice_2 [...]
         %tiled_mm = linalg.matmul ins (%extract_slice_3, %extract_slice_4) outs(%extract_slice_5) 
         %insert_slice_0 = tensor.insert_slice %tiled_mm into ...
         yield %insert_slice_0
      }
     yield %1
  }
  scf.forall.in_parallel {
    tensor.parallel_insert_slice %0
  }
}
%add = linalg.add ins(%mm, ...) outs(...)

Motivation

We intend to fuse all other ops(pack+fill+broadcast+add+relu) into this deeply tiled loops in one pass/transform.

Terminology

To better explain later algorithm, put forward some necessary terminology in advance.

pre-op and post-op fusion

pre-op fusion: fuse producer into tiled consumer
post-op fusion: fuse consumer into tiled producer

diffusion

It is a concept learnt from wiki. However, unlike traditional definition or well-known Diffusion Model in generative AI field, diffusion here is mainly used to describe the recursive process of pre-op/post-op fuse all ops surrounding one central op.

Algorithm

preOpFusion(OpOperand &tiledOperand)

collect a chain of extractSliceOp and the originalOperand of tiledOperand. In above example:
- extractSliceOp chain (from inner to outer): %extract_slice_3 = tensor.extract_slice %extract_slice_0 [...] → %extract_slice_0 = tensor.extract_slice %input [...]
- originalOperand: %input
check boundary: producer of originalOperand is tilable or non-contraction. If so, continue, otherwise, break.
filter all candidate by Linalg semantic and select best candidate slice op by cost model based on HAL(Hardware Analysis Layer). Currently, select the smallest one measured by the size metric of extractSliceOp by default.
call tileAndFuseProducerOfSlice,

postOpFusion(OpResult tiledResult)

collect a chain of [parallel]insertSliceOp and the originalResult of tiledResult. In above example:
- [parallel]insertSliceOp chain (from inner to outer): %insert_slice_0 = tensor.insert_slice %tiled_mm into ... → tensor.parallel_insert_slice %0
- originalResult: %mm
check boundary: consumer of originalResult is tilable or non-contraction. If so, continue, otherwise, break.
filter all candidate by Linalg semantic and select best candidate slice op by cost model based on HAL(Hardware Analysis Layer). Currently, select the smallest one measured by the size metric of [parallel]insertSliceOp by default.
call tileAndFuseConsumerOfSlice.

diffusion(Operation *tiledOp)

Prepare deque<Operation*> with initial value tiledOp and set<Operation*> to record which operation has already been tiled.
pop front of deque as new central op.
execute preOpFusion:
a. fuse producer(s) of operand of central op.
b. if success, push back new tiled op into deque and insert original op into set.
execute postOpFusion:
a. fuse consumer(s) of result of central op.
b. if success, push back new tiled op into deque and insert original op into set.
repeat step 2 until it turns out empty.

Q & A

Q: Why collect chain of *_slice pair from inner to outer rather than multiple application of existing transform from outer to inner?
A: No matter tileAndFuseProducerOfSlice or tileAndFuseConsumerOfSlice are both invoked with *SliceOp, manually assigned in most transform usage before compiling time. But, it is possible(and certainly legal) for user to write a efficient template with multi-level *_slice, which is hard to manually assign which one in next round application especially for tileAndFuseConsumerOfSlice case. Furthermore, it is more friendly for future cost model(maybe a callback function) to select better candidate slice op during runtime stage.

Q: How to fuse reduction op?
A: In general, reduction fusion is blocked by tiling on reduction dimension. Benefit from the chain of *_slice mentioned above, we can filter out those valid candidates by Linalg semantic.

Q: Why not recursively call diffusion method but use deque?
A: There are two major reasons:

When the amount of operation becomes extremely large, it may cause stack overflow.
From the traversal view: recursive way is DFS style while deque appears more like BFS
style, which is safe for domination relationship and more reasonable for resultant IR.

Q: What is difference between diffusion and other potentially simple combination of tileAndFuseProducerOfSlice and tileAndFuseConsumerOfSlice?
A: In scf::tileConsumerAndFuseProducer, there indeed exists a deque to recursively fuse producer and producer of producer. But, it could not fuse consumer of producer at the same time, saying:

producer2
  |        \
producer1 consumer2
   |
tiledConsumer

, where consumer2 may not be fused in the end.

matmul    
   \   broadcast 
    \    /
     add 
      |
     relu

Similarly, in sub-pattern of above MLP example, although add+relu can be fused via multiple tileAndFuseConsumerOfSlice, it may also omit pre-op fusion for broadcast.

Moreover, diffusion can even deal with following complex topology:

contraction    
   \   op1   op3
    \    /  \    / 
     add     op2
      |       |
     relu   op4
       \     /
        op5

NOTE that: this topology also explains why we prefer BFS-like as traversal favor, which ensure op5 must be visited after relu and op4.

Q: What if one op lay between two contraction ops? saying:

matmul
  ...
  Op
   |
matmul

Is this Op post-op fused with first matmul or pre-op fused into second one?
A: It is decided by which matmul execute diffusion at first. IMO, diffusion should be executed based on topological order. In another word, post-op fusion expects higher priority,

Open

It is optional to make this to be either a transform or one pass. I am not sure which one is better. Looking forward to your suggestion,

I am also glad to hear comments or any other questions from you. Thanks a lot!

ftynse · June 14, 2024, 1:24pm

Overall this looks like just applying producer and consumer fusion iteratively, with additional terminology that brings mostly confusion (pun intended). There’s nothing wrong with this, but let’s not invent additional terminology where one exists already. Producer and consumer fusion are reasonably well-established terms, and there is no visible analogy with the diffusion process in physics here.

Also FYI, the sets of transitive producers(users) of an operation are called backward(forward) slice, respectively.

It’s not clear what “hardware analysis layer” this refers to, we have no such thing in MLIR as far as I know, and it doesn’t seem to be a part of the proposal.

It’s okay to have an iterative implementation of producer+consumer fusion starting from a given op with some cost model upstream IMO, preferably with a possibility to inject that cost model into the iteration logic, e.g., as a callback accepting the currently fused loop structure and a candidate. This can be implemented as a utility function and exercised in a (test) pass.

YunFly · June 14, 2024, 2:47pm

There’s nothing wrong with this, but let’s not invent additional terminology where one exists already.

FYI, the sets of transitive producers(users) of an operation are called backward(forward) slice, respectively.

Thanks for this vital information! It is really useful for me to align with existing terms. Also, I think we can rename this pass with more clear one in the future.

It’s not clear what “hardware analysis layer” this refers to, we have no such thing in MLIR as far as I know, and it doesn’t seem to be a part of the proposal.

Sorry for this confusing. It seems a part of IREE. Or you can treat it as Target Description of this proposal. Anyway, it just provides resource hints of specific device for cost model to make final decision like which candidate to fuse during runtime.

It’s okay to have an iterative implementation of producer+consumer fusion starting from a given op with some cost model upstream IMO, preferably with a possibility to inject that cost model into the iteration logic, e.g., as a callback accepting the currently fused loop structure and a candidate.

Yes, I agree and that is what I am trying to do.

qed · June 14, 2024, 7:08pm

contraction    
  \     op1   op3
   \    /  \    / 
    add     op2
     |       |
    relu   op4
      \     /
       op5
NOTE that: this topology also explains why we prefer BFS-like as traversal favor, which ensure op5 must be visited after relu and op4.

I would caution against consumer fusions like this. If the producer relu and op4 operations have already been tiled, determining whether it is possible to fuse op5 is nontrivial. Taking your example,

%0:2 = scf.forall iter_args(%init0, %init1) {
  %1 = relu
  %2 = op4
  scf.forall.in_parallel {
    tensor.parallel_insert_slice %1 into %init0 offsets[...] sizes[...]
    tensor.parallel_insert_slice %2 into %init1 offsets[...] sizes[...]
  }
}
%3 = op5 %0#0, %0#1

To then fuse op5 into the producer forall loop, you need to prove that the slices of its operands within the forall (%1 and %2) correspond to the same slice of op5’s iteration space. For parallel loop constructs like scf.forall, we could perhaps design “shuffle” like operations that reorganize thread local data across workers of the loop, but for serialized scf.for loops, such fusions are not possible. This is why if you are relying on this fusion to happen, it is most straightforward to start by tiling the root of the DAG first and fuse in producers.

rengolin · June 15, 2024, 9:44am

FWIW, I agree with Alex that we don’t need to create new terms for the application of existing passes. While ML papers love the fancy terms, compiler folks tend to name passes like the Germans: IterativelyRunProducerConsumerFusionUntilExhaustion.

We already fuse producers and consumers to the GEMM-like operation in our project and it works fine. It’s less of an iterative process and more of a pipeline construction process, but we’re working to combine the two process in one optimizer:

Packing & distribution with arbitrary multi-level loop nesting (cache awareness)
Arbitrary fusion (prod/cons) at non-perfectly nested loops (GEMM + EW)

It sounds like you’re working on a similar solution, it’d be good to iterate and see if we can find a good common ground.

Right. Any iterative / greedy process needs a guide, and this guide is indeed a cost model based on target descriptions.

Though, IREE isn’t MLIR, so if you want to use their HAL, you’ll have to implement that in IREE directly.

Our target description proposal should replace the “description” part and can be used directly in MLIR.

I’d love if IREE could use those descriptor in their HAL, which is one of the main goals of that proposal, that other projects use it in their own ways. But the ultimate goal is to find the common grounds between all cost models and implement a common infrastructure in MLIR proper (that again, can be reused downstream).

YunFly · June 15, 2024, 2:22pm

To then fuse op5 into the producer forall loop, you need to prove that the slices of its operands within the forall (%1 and %2 ) correspond to the same slice of op5 ’s iteration space.

Thanks for this good catch! IIUC, you are worried about such case:

op1  op2
  \   /
  op3

where op3 has iteration domain like {d1,d2}->{d1,d2}, {d1,d2}->{d2,d1},{d1,d2}->{d1,d2} for operand#0, operand#1 and result#0 of op3 respectively.

Assuming operand#0 and operand#1 have already tiled within a scf loop as you illustrated above:

tensor.insert_slice %1 into %init0 offsets1[...] sizes1[...]
tensor.insert_slice %2 into %init1 offsets2[...] sizes2[...]

Although their offset and size are completely identical, it is impossible to fuse op3 because it expects offset2 is permuted with offset1 in fact.

Current solution is just breaking op5 fusion. As I mentioned above:

filter all candidate by Linalg semantic and select best candidate slice op by cost model

The candidate, either tensor.insert_slice %1(candidate1) or tensor.insert_slice %2(candidate2), is invalid for op5 and should be filter out by a possible way that computing offset2 based on offset1 of candidate1 and then comparing with real offset2 of candidate2.

This is why if you are relying on this fusion to happen, it is most straightforward to start by tiling the root of the DAG first and fuse in producers.

BTW, I wonder what will happen for the counterpart of your example:

 op1 
 /   \
op2 op3

where op2 and op3 has already tiled into same loop.

%1 = op1
scf.for(...)
  %ex1_2 = tensor.extract_slice %1 offsets[...] sizes[...]
  %ex1_3 = tensor.extract_slice %1 offsets[...] sizes[...]
  %2 = op2(%ex1_2, ...)
  %3 = op3(%ex1_3, ...)
}

YunFly · June 15, 2024, 3:01pm

We already fuse producers and consumers to the GEMM-like operation in our project

Did you mean TileConsumerAndFuseProducers.cpp? If so, there seems two major difference compared with this RFC:

Method: Overall you are using a similar way like scf::tileConsumerAndFuseProducersUsingSCF going bottom-up from the last elementwise op, so that the GEMM-like operation could not deeply tiled as I illustrated at beginning. While, this RFC propose a pass starting from a hand-writing template.
Coverage: Your pass only focus on elementwise op, excluding pack/unpack/reduce covered in this RFC. As for fusing producers of GEMM-like operation you said, it looks like support fill is expected. On the contrary, this RFC intends to fuse any tilable operation(inherited from TilingInterface).

Packing & distribution with arbitrary multi-level loop nesting (cache awareness)

Arbitrary fusion (prod/cons) at non-perfectly nested loops (GEMM + EW)

That is one of original intention in this RFC and actually I have enabled this downstream and try to enhance some existing feature to support nested loop, saying this PR.

it’d be good to iterate and see if we can find a good common ground.

Maybe we can discuss this in more detail later. CC @ZhennanQin.

qed · June 15, 2024, 4:04pm

BTW, I wonder what will happen for the counterpart of your example:
 op1 
 /   \
op2 op3
where op2 and op3 has already tiled into same loop.
%1 = op1
scf.for(...)
  %ex1_2 = tensor.extract_slice %1 offsets[...] sizes[...]
  %ex1_3 = tensor.extract_slice %1 offsets[...] sizes[...]
  %2 = op2(%ex1_2, ...)
  %3 = op3(%ex1_3, ...)
}

Fusion of producers will clone the source operation so we would end up with two instances of op1. If the iteration space of each slice matches, then CSE can clean it up (or ex1_2 and ex1_3 could have already been CSE’d). I was primarily cautioning against relying on such fusions as they are inherently harder to reason about from the perspective of the TilingInterface than the reverse direction of fusing producers.

It’s okay to have an iterative implementation of producer+consumer fusion starting from a given op with some cost model upstream IMO, preferably with a possibility to inject that cost model into the iteration logic, e.g., as a callback accepting the currently fused loop structure and a candidate. This can be implemented as a utility function and exercised in a (test) pass.

+1 to what Alex is saying here. Now that we have the ability to consumer fuse, having a process to iteratively do both consumer and producer fusion is a natural next step. The cost model for fusion is highly context and pipeline dependent, so just as with the other fusion helpers, I would expect this new variant to take a callback to represent it.

I also agree with the sentiment upthread that introducing special terminology for a new algorithm makes it harder to follow.

Sorry for this confusing. It seems a part of IREE. Or you can treat it as Target Description of this proposal. Anyway, it just provides resource hints of specific device for cost model to make final decision like which candidate to fuse during runtime.

IREE’s Hardware Abstraction Layer (HAL) might be slightly mis-referenced here. The HAL is primarily a mirror of Vulkan’s compute API and provides little more than an attribute dictionary for target descriptions. There is more information on the design here, and on the associated dialect here. What it seems as though you are looking for is a unified representation of various critical device resource constraints, which the HAL does not provide. Whether IREE opts to adopt the result of the above RFC is TBD and is tangential to this discussion.

Borrowing @rengolin 's name, IMO IterativelyRunProducerConsumerFusionUntilExhaustion should not have any explicit dependence on target descriptions and instead allow users to encode the target dependent cost model they want in a user defined callback like @ftynse suggested.

rengolin · June 15, 2024, 4:09pm

As I said, our implementation is an iterative process. We do fuse fill, pack, unpack later on throughout the compiler, and end up with a single fully-fused kernels. Look for fused_brgemm XSMM calls.

The PR changes a lot of code and it’s hard to know what will be the effects of it elsewhere. We don’t have nearly enough tests upstream for that kind of change to go through.

Our approach seems more nuanced, evolving into a (hopefully more robust) high-level transform that can be used for both simple fusion and graph fusion.

rengolin · June 15, 2024, 4:14pm

Agree.

If the encoding comes from a user or some cost model, the implementation should be equivalent. Ie. the pass itself should be mechanical, and the parameters that it uses to transform should come from elsewhere.

I think this applies to most things: the more parametric the transforms are, the easier it is to combine them with multiple levels of “cost modelling” (whole graph vs sub-graph, high vs low level, etc).

YunFly · June 16, 2024, 9:12am

Fusion of producers will clone the source operation so we would end up with two instances of op1 . If the iteration space of each slice matches, then CSE can clean it up (or ex1_2 and ex1_3 could have already been CSE’d).

I see. However, if the iteration space of each slice matches, consumer fusions will also works well. If not matched, as you said, producer fusion could still success at the cost of redundant computation due to one more tiled instance of producer. Then, it looks a little confusing for me about how to furthermore fuse the producer(s) of op1, which has two different tiled instance. Do we need to fuse them along both of them?

I would caution against consumer fusions like this. If the producer relu and op4 operations have already been tiled, determining whether it is possible to fuse op5 is nontrivial.
To then fuse op5 into the producer forall loop, you need to prove that the slices of its operands within the forall (%1 and %2 ) correspond to the same slice of op5 ’s iteration space.

If so, could we use similar way to solve consumer fusion issue you cautioned against in above thread? For example, if slice of %1 and %2 are not matched, we can also re-tile-and-fuse op4 or relu to adapt with existing slice of %1 or %2 just like what you do in producer fusion.

they are inherently harder to reason about from the perspective of the TilingInterface than the reverse direction of fusing producers.

Strictly speaking, I think it is decided by the concrete DAG. The potential issue found in consumer fusion very likely exists in producer fusion via constructing reverse DAG. E.g. another example a little different from previous one:

%1_1, %1_2 = op1
scf.for(...)
  %ex2 = tensor.extract_slice %1_1 offsets[...] sizes[...]
  %ex3 = tensor.extract_slice %1_2 offsets[...] sizes[...]
  %2 = op2(%ex2, ...)
  %3 = op3(%ex3, ...)
}

where op1 has actually two outputs. Could they still be CSE`d as you said?

Besides, no matter producer fusion or consumer fusion, it makes sense to talk about how to support your cautioning case from view of functionality. But, as you saw, this RFC also plans to inject cost model into the iteration fusion logic, which means it emphasizes performance compared with functionality. That is why I prefer to break such fusion when slices is unmatched in your example.

BTW, I found this is slightly similar to another topic which we have talked a lot before. In general, a tiled implementation assumes all operands are not tiled yet in the past, let alone all of them are tiled in your case.

YunFly · June 16, 2024, 9:32am

Thanks for notes. BTW, do you mean fill/pack/unpack will be fused into another pass or transform? Also, is it strongly correlated with XSMM call?

The PR changes a lot of code and it’s hard to know what will be the effects of it elsewhere. We don’t have nearly enough tests upstream for that kind of change to go through.

As I said in that PR, most of change is related to code refactor for better reuse. And the consumer fusion is recently added feature for initial usage. All upstream related tests can pass so far. If any other corner case is found, I will track and fix it ASAP.

Our approach seems more nuanced, evolving into a (hopefully more robust) high-level transform that can be used for both simple fusion and graph fusion.

Would you share more concrete RFC or link about this, I am glad to study it.

YunFly · June 16, 2024, 9:37am

Hi, all. Since all of you show interest about cost model in this RFC and have given many useful comments, I need more time to take deep look on them. Thanks for all!

rengolin · June 16, 2024, 12:07pm

Yes. The reason is that there is no appropriate group op at tensor level, but there is a fused_brgemm op in XSMM. We’re also discussing group ops in Linalg to implement this (and other things).

It’s better to split refactoring from code changes in separate PRs.

This isn’t a single RFC, but involves everything from lowering from ingress, target description and cost modelling, pattern recognition, pattern matchers and the multiple topics discussed at the round table (link above).

All those discussions are happening upstream in Github or here.

ftynse · June 18, 2024, 12:09pm

That’s a discussion I’d like to reopen, but let’s not do it here and not condition this change on that discussion (unless really necessary)!

Thanks for looking into this! From reading the post, it looks like there is consensus from commenters about having an iterative fusion algorithm that first checks for fusion validity and, if valid, somehow asks the caller whether it is considered profitable.

YunFly · June 19, 2024, 1:30am

Yes, sure. It is certainly worth opening another RFC to discuss it in the future! I just throw out that to clarify the root cause of something may not come from consumer fusion which should be treated as fair as producer fusion.

+100, Thanks a lot for the constructive conclusion that make this RFC much clear and in the right direction! Its really great to reach this consensus . I will attempt to upstream this part at first.

MaheshRavishankar · June 20, 2024, 4:58am

FWIW, I am not sure there is a cost model that can tell you what to do. These are very specific transformations that can be put together to meet some downstream need. In any case, the fusion transformation already has a way to allow downstream users to control what can be fused, which would be where you would inject a cost model.

So this should already be done?

rengolin · June 20, 2024, 9:55am

This is research in progress. We can discuss in separate.

I don’t think we should design upstream technology with the low bar that “downstream already do this elsewhere”.

ftynse · June 21, 2024, 8:11am

I don’t see how iterative fusion is very specific, it’s one of the oldest loop transformations in the book. If you mean the decision to fuse, meeting some downstream need is exactly why it should be made controllable. One may not have their ideal cost model right now, but if we don’t give everybody a possibility to explore cost models, nobody ever will have a good one.

Can you point me to the code?

Topic		Replies	Views
[Linagl][Performance] Because copy without policy leads to redundant ops in fusion pass MLIR	0	228	March 12, 2022
Any plan to fuse consumers? MLIR	2	255	January 22, 2024
Understanding the affine loop fusion pass MLIR	20	1283	May 10, 2023
Using Transform Dialect Op in self-def lowering pass MLIR	4	231	March 20, 2023
RFC: Next iteration of fusion of elementwise operations MLIR	3	879	April 6, 2022