Understanding the affine loop fusion pass

rohany · March 22, 2023, 5:10pm

What is the intuition for why this loop is not getting fused?

➜  build git:(main) ✗ bin/mlir-opt ../testing.mlir -pass-pipeline='builtin.module(func.func(affine-loop-fusion{fusion-maximal=true}))' --dump-pass-pipeline                   (base) 09:53:12
Pass Manager with 1 passes:
builtin.module(func.func(affine-loop-fusion{fusion-compute-tolerance=3.000000e-01 fusion-fast-mem-space=0 fusion-local-buf-threshold=0 fusion-maximal=true mode=producer}))

module {
  func.func @testing2(%arg0: memref<10xf32>) {
    %cst = arith.constant 1.000000e+00 : f32
    %cst_0 = arith.constant 2.000000e+00 : f32
    affine.for %arg1 = 0 to 10 {
      %0 = affine.load %arg0[%arg1] : memref<10xf32>
    }
    affine.for %arg1 = 0 to 10 {
      %0 = affine.load %arg0[%arg1] : memref<10xf32>
    }
    return
  }
}

In the toy applications i’m thinking to use mlir for, fusion is the most important pass to be applied, so I want to understand generally when the pass will succeed and how to write code in a way that helps the pass’s reasoning.

dcaballe · March 22, 2023, 6:16pm

It’s been a while but there are three fusion strategies in Affine fusion: producer-consumer, sibling and greedy. Producer-consumer fusion looks for producer-consumer relationships between the candidate loops (not your case), sibling fusion looks for input reusability between the candidates (your case) and greedy is a mix of producer-consumer and sibling strategies. Looking at your command line, it looks like you are using the producer-consumer strategy and should be using the sibling one.

Hopefully that helps!

rohany · March 22, 2023, 6:19pm

I moved this over from the discussion here – MLIR Affine Dialect Loop Fusion pass appears to not work · Issue #61604 · llvm/llvm-project · GitHub. It seems like there’s a bug in the output of the pass manager where it always says “producer”.

For proof, these loops get fused (which should only occur with the greedy or sibling fusion options):

➜  build git:(main) ✗ cat ../testing.mlir                                                                                                                                     (base) 11:18:23
func.func @testing(%arg0: memref<10xf32>) {
  %const1 = arith.constant 1. : f32
  %const2 = arith.constant 2. : f32
  %const3 = arith.constant 3. : f32
  affine.for %i = 0 to 10 {
    %0 = affine.load %arg0[%i] : memref<10xf32>
    %1 = arith.addf %0, %const1 : f32
    affine.store %1, %arg0[%i] : memref<10xf32>
  }
  affine.for %i = 0 to 10 {
    %0 = affine.load %arg0[%i] : memref<10xf32>
    %1 = arith.addf %0, %const2 : f32
    affine.store %1, %arg0[%i] : memref<10xf32>
  }
  affine.for %i = 0 to 10 {
    %0 = affine.load %arg0[%i] : memref<10xf32>
    %1 = arith.addf %0, %const3 : f32
    affine.store %1, %arg0[%i] : memref<10xf32>
  }
  return
}
➜  build git:(main) ✗ bin/mlir-opt ../testing.mlir -pass-pipeline='builtin.module(func.func(affine-loop-fusion{fusion-maximal=true}, affine-scalrep))' --dump-pass-pipeline   (base) 11:18:20
Pass Manager with 1 passes:
builtin.module(func.func(affine-loop-fusion{fusion-compute-tolerance=3.000000e-01 fusion-fast-mem-space=0 fusion-local-buf-threshold=0 fusion-maximal=true mode=producer},affine-scalrep))

module {
  func.func @testing(%arg0: memref<10xf32>) {
    %cst = arith.constant 1.000000e+00 : f32
    %cst_0 = arith.constant 2.000000e+00 : f32
    %cst_1 = arith.constant 3.000000e+00 : f32
    affine.for %arg1 = 0 to 10 {
      %0 = affine.load %arg0[%arg1] : memref<10xf32>
      %1 = arith.addf %0, %cst : f32
      %2 = arith.addf %1, %cst_0 : f32
      %3 = arith.addf %2, %cst_1 : f32
      affine.store %3, %arg0[%arg1] : memref<10xf32>
    }
    return
  }
}

Since the more complicated loops get fused, i’m unsure why the simple one without other indexing is not getting fused.

bondhugula · March 23, 2023, 9:05pm

I didn’t understand what the question here is. With “producer” mode fusion, the output you pasted above is the desired and expected output: L2 pulled into L3, and L1 pulled into {L2, L3}.

rohany · March 23, 2023, 9:12pm

My question is why the loops in the original post are not getting fused, given that these other loops below can. @dcaballe thought that this was the case because the fusion mode was not set correctly, but as you mentioned in the github issue, it seems like the output from the pass manager when reporting the fusion mode is incorrect. For the original loop, i’ve tried all of the fusion modes without getting fusion.

bondhugula · March 23, 2023, 9:49pm

Okay, reg. the first example, sibling fusion expects the sibling node to have at least one store. The nests you have with unused loaded values are dead; if you store those results somewhere, you’ll see them fused. Arguably, one could expect them to be fused in this form too – could you please file an issue on LLVM Github issues? A patch to enable fusion for these is straightforward.

bondhugula · March 23, 2023, 10:16pm

This fixes it: ⚙ D146763 [MLIR][Affine] Fix/improve affine sibling fusion

Output

module {
  func.func @testing2(%arg0: memref<10xf32>) {
    %cst = arith.constant 1.000000e+00 : f32
    %cst_0 = arith.constant 2.000000e+00 : f32
    affine.for %arg1 = 0 to 10 {
      %0 = affine.load %arg0[%arg1] : memref<10xf32>
      %1 = affine.load %arg0[%arg1] : memref<10xf32>
    }
    return
  }
}

rohany · March 23, 2023, 11:24pm

You got to it before I could open the issue You’re right that in the sense of the whole system this fusion doesn’t matter since both loops would be DCE’d anyway. I was just looking around for potential gotcha’s in generating code that is amenable to the fusion pass, so was confused why this case was missed.

rohany · March 24, 2023, 6:58pm

Another question – do variable memref bounds affect the reasoning of the fusion pass? I haven’t been able to massage these loops into getting fused. It doesn’t seem to me like deeper analysis is needed here than in the case when these loop bounds are constant.

  func.func @testing(%arg0: memref<?xf32>) {
    %cst = arith.constant 1.000000e+00 : f32
    %cst_0 = arith.constant 2.000000e+00 : f32
    %cst_1 = arith.constant 3.000000e+00 : f32
    %c0 = arith.constant 0 : index
    %dim = memref.dim %arg0, %c0 : memref<?xf32>
    affine.for %arg1 = 0 to %dim {
      %0 = affine.load %arg0[%arg1] : memref<?xf32>
      %1 = arith.addf %0, %cst : f32
      affine.store %1, %arg0[%arg1] : memref<?xf32>
    }
    affine.for %arg1 = 0 to %dim {
      %0 = affine.load %arg0[%arg1] : memref<?xf32>
      %1 = arith.addf %0, %cst_0 : f32
      affine.store %1, %arg0[%arg1] : memref<?xf32>
    }
    affine.for %arg1 = 0 to %dim {
      %0 = affine.load %arg0[%arg1] : memref<?xf32>
      %1 = arith.addf %0, %cst_1 : f32
      affine.store %1, %arg0[%arg1] : memref<?xf32>
    }
    return
  }

bondhugula · March 27, 2023, 3:58pm

-debug-only=affine-loop-fusion,loop-fusion-utils will give you a useful log on why fusion didn’t happen. The underlying dependence and slicing analysis used by fusion should normally be able to handle this, but there may be TODOs or limitations that might have prevented it – since constant bounds has been a common case. Extending the underlying utilities will help – one would have to look at the log and dig deeper to see which utility to improve.

rohany · March 27, 2023, 6:07pm

Thanks, I will look into this more! I’m trying the flags you suggested on mlir-opt, but they don’t seem to work:

➜  build ../../llvm/build/bin/mlir-opt ../../llvm/testing.mlir -pass-pipeline='builtin.module(func.func(affine-loop-fusion))' --dump-pass-pipeline -debug-only=affine-loop-fusion,loop-fusion-utils                            (base) 12:04:14
mlir-opt: Unknown command line argument '-debug-only=affine-loop-fusion,loop-fusion-utils'.  Try: '../../llvm/build/bin/mlir-opt --help'
mlir-opt: Did you mean '--debug-pass=affine-loop-fusion,loop-fusion-utils'?
➜  build ../../llvm/build/bin/mlir-opt ../../llvm/testing.mlir -pass-pipeline='builtin.module(func.func(affine-loop-fusion))' --dump-pass-pipeline --debug-only=affine-loop-fusion,loop-fusion-utils                           (base) 12:04:15
mlir-opt: Unknown command line argument '--debug-only=affine-loop-fusion,loop-fusion-utils'.  Try: '../../llvm/build/bin/mlir-opt --help'
mlir-opt: Did you mean '--debug-pass=affine-loop-fusion,loop-fusion-utils'?
➜  build ../../llvm/build/bin/mlir-opt ../../llvm/testing.mlir -pass-pipeline='builtin.module(func.func(affine-loop-fusion))' --dump-pass-pipeline --debug-pass=affine-loop-fusion,loop-fusion-utils                           (base) 12:04:22
mlir-opt: for the --debug-pass option: Cannot find option named 'affine-loop-fusion,loop-fusion-utils'!

I don’t see related flags in the help output of mlir-opt either:

➜  build ../../llvm/build/bin/mlir-opt --help | grep debug                                                                                                                                                                     (base) 12:06:21
  --mlir-pretty-debuginfo                                     - Print pretty debug info in MLIR output
  --mlir-print-debuginfo                                      - Print debug info in MLIR output
      --ensure-debug-info-scope-on-llvm-func                  -   Materialize LLVM debug info subprogram attribute on every LLVMFuncOp
      --print-ir                                              -   Print IR on the debug stream
      --strip-debuginfo                                       -   Strip debug info from all operations
        --debug-payload-root-tag=<string>                     - Select the operation with 'transform.target_tag' attribute having the given value as payload IR root. If empty select the pass anchor operation as the payload IR root.
        --debug-transform-root-tag=<string>                   - Select the operation with 'transform.target_tag' attribute having the given value as container IR for top-level transform ops. This allows user control on what transformation to apply. If empty, select the container of the top-level transform op.

dcaballe · March 27, 2023, 6:29pm

Running cmake with -DCMAKE_BUILD_TYPE=Debug should help.

mehdi_amini · March 27, 2023, 6:32pm

-DLLVM_ENABLE_ASSERTIONS=ON should be enough, no need for full fledged debug build

rohany · March 27, 2023, 6:33pm

I will try the debug build. Unfortunately the lighter-weight assertions on build doesn’t work for me (issue building with `-DLLVM_ENABLE_ASSERTIONS=ON` · Issue #61598 · llvm/llvm-project · GitHub).

rohany · March 28, 2023, 9:09pm

After looking at debug logs, it looks like this is the problem: llvm-project/Utils.cpp at main · llvm/llvm-project · GitHub. The slice validity analysis used by the fusion pass cannot handle symbolic bounds on the slice.

This loop stats calculation is also not able to handle non-constant bounds: llvm-project/LoopFusionUtils.cpp at main · llvm/llvm-project · GitHub.

Some memref analysis also doesn’t work on dynamically sized memrefs: llvm-project/Utils.cpp at main · llvm/llvm-project · GitHub.

It seems like supporting fusion on loops with variable extents is a relatively large feature (i’m not sure how many more of these checks i can circumvent before things explode).

bondhugula · March 29, 2023, 1:33am

I wouldn’t call it a large feature --a medium-sized at best. There isn’t a need to circumvent anything - the utilities can be extended modularly. And I don’t see why anything would explode at all. For the nests you pasted, the computations would be trivial and as scalable as in the constant bound case. Please feel free to file an issue for this on LLVM/Github Issues.

rohany · March 29, 2023, 3:43am

I’ll open an issue about it.

And I don’t see why anything would explode at all.

I’m just unfamiliar with this code base – without me knowing the context, I expect the removal of enough error checks will lead to segfaults or unpredictable behavior.

issue: mlir/Affine: affine loop fusion is unable to fuse loops with non-constant extents · Issue #61784 · llvm/llvm-project · GitHub

rohany · April 28, 2023, 5:23am

Returning here with some more questions (and also going to ask about the scalar replacement pass):

Consider the following function, which is adding an input array to itself 3 times into an output array, using some temporaries along the way.

func.func @testing(%input : memref<100xf32>, %output : memref<100xf32>) {
  %int1 = memref.alloc() : memref<100xf32>
  %int2 = memref.alloc() : memref<100xf32>
  affine.for %i = 0 to 100 {
    %0 = affine.load %input[%i] : memref<100xf32>
    %1 = affine.load %input[%i] : memref<100xf32>
    %2 = arith.addf %0, %1: f32
    affine.store %2, %int1[%i] : memref<100xf32>
  }
  affine.for %i = 0 to 100 {
    %0 = affine.load %input[%i] : memref<100xf32>
    %1 = affine.load %int1[%i] : memref<100xf32>
    %2 = arith.addf %0, %1: f32
    affine.store %2, %int2[%i] : memref<100xf32>
  }
  affine.for %i = 0 to 100 {
    %0 = affine.load %input[%i] : memref<100xf32>
    %1 = affine.load %int2[%i] : memref<100xf32>
    %2 = arith.addf %0, %1: f32
    affine.store %2, %output[%i] : memref<100xf32>
  }
  return
}

First, why is the loop fusion pass also rewriting the sizes of allocated temporary memrefs:

  func.func @testing(%arg0: memref<100xf32>, %arg1: memref<100xf32>) {
    %alloc = memref.alloc() : memref<1xf32>
    %alloc_0 = memref.alloc() : memref<1xf32>
    affine.for %arg2 = 0 to 100 {
      %0 = affine.load %arg0[%arg2] : memref<100xf32>
      %1 = affine.load %arg0[%arg2] : memref<100xf32>
      %2 = arith.addf %0, %1 : f32
      affine.store %2, %alloc_0[0] : memref<1xf32>
      %3 = affine.load %arg0[%arg2] : memref<100xf32>
      %4 = affine.load %alloc_0[0] : memref<1xf32>
      %5 = arith.addf %3, %4 : f32
      affine.store %5, %alloc[0] : memref<1xf32>
      %6 = affine.load %arg0[%arg2] : memref<100xf32>
      %7 = affine.load %alloc[0] : memref<1xf32>
      %8 = arith.addf %6, %7 : f32
      affine.store %8, %arg1[%arg2] : memref<100xf32>
    }
    return
  }

While this is not the final form I’d like to use, it seems strange that the loop fusion pass would do something like that. In particular, the loop is no longer parallelizable! This sort of thing is eliminated by hitting the result with the scalar replacement pass:

  func.func @testing(%arg0: memref<100xf32>, %arg1: memref<100xf32>) {
    affine.for %arg2 = 0 to 100 {
      %0 = affine.load %arg0[%arg2] : memref<100xf32>
      %1 = arith.addf %0, %0 : f32
      %2 = affine.load %arg0[%arg2] : memref<100xf32>
      %3 = arith.addf %2, %1 : f32
      %4 = affine.load %arg0[%arg2] : memref<100xf32>
      %5 = arith.addf %4, %3 : f32
      affine.store %5, %arg1[%arg2] : memref<100xf32>
    }
    return
  }

Onto the second question – the scalar replacement pass didn’t get rid of all of the intermediate loads! Why not, and is that expected? Using the scalar replacement pass twice gets us the desired computation:

  func.func @testing(%arg0: memref<100xf32>, %arg1: memref<100xf32>) {
    affine.for %arg2 = 0 to 100 {
      %0 = affine.load %arg0[%arg2] : memref<100xf32>
      %1 = arith.addf %0, %0 : f32
      %2 = arith.addf %0, %1 : f32
      %3 = arith.addf %0, %2 : f32
      affine.store %3, %arg1[%arg2] : memref<100xf32>
    }
    return
  }

Given some arbitrary function, is the proper use of the affine scalar replacement pass to apply it until a fixed-point is reached?

Thanks.

bondhugula · May 1, 2023, 10:18pm

One shouldn’t have to run the affine scalrep pass multiple times to get the desired result. A single application of the pass should be sufficient. I would consider this a functionality issue. There are three things that the affine-scalrep pass does to replace/eliminate affine references to memory:

store to load forwarding
invariant load elimination
dead store elimination in certain simple cases

What you see above is invariant load elimination not happening completely. This can be addressed.

Reg. the first issue: the affine fusion pass creates “local” buffers to reduce the size of intermediaries: in this case, they are of size 1 and will be swept away by scalrep. In other non-trivial cases, they can be small buffers that fit in cache/local memory. There are pass/cmd-line options to customize these, and all of this is good for locality. You would lose parallelism if you don’t eliminate these temporaries or don’t privatize/expand them on the loops that you would like to parallelize. Fusion for locality isn’t as effective without the elimination of the intermediate memrefs when possible. You will have to really customize/extend the pass if you want to consider parallelism criteria. On-the-fly scalrep during fusion is also possible (the affineScalarReplace utility can be called) - but the fusion pass upstream doesn’t provide this option.

rohany · May 3, 2023, 3:12am

I would consider this a functionality issue.

Ok. I was trying to put together a reproducer here and ran into some weird behavior. For those code snippets I sent above, using the mlir-opt command line utility with the flag -pass-pipeline='builtin.module(func.func(affine-scalrep))' yields the code I sent above, and then I have to add a second affine-scalrep to get the minimized code. However, in some standalone C++ code where I use the PassManager infrastructure and do addPass(mlir::createAffineScalarReplacementPass()) the pass correctly does invariant load elimination on the first try. So maybe there’s just something wrong with the command line utility.

Thanks for the explanation about fusion!

but the fusion pass upstream doesn’t provide this option.

As in there’s a fusion pass sitting around somewhere else that does ?

Topic		Replies	Views
Possible Bug in Affine Loop Fusion MLIR	4	414	September 23, 2024
Affine loop fusion with structured branching MLIR	2	280	May 28, 2023
Beginner Q: Help with loops/affine/linalg MLIR	15	2086	April 29, 2021
Should affine loop fusion work on the contents of another loop? MLIR	1	243	May 6, 2021
Different affine maps for the same affine.load when using a pass pipeline and a standalone pass MLIR	1	141	June 3, 2024

Understanding the affine loop fusion pass

Related topics