Affine fusion legality for vector types

Hi, I was playing around with affine fusion with affine.vector_load/store operations inside loop and came across this particular case:

func.func @main(%a: memref<64x512xf32>, %b: memref<64x512xf32>, %c: memref<64x512xf32>, %d: memref<64x4096xf32>, %e: memref<64x4096xf32>) {

    affine.for %j = 0 to 8 {
        %lhs = affine.vector_load %a[0, %j * 64] : memref<64x512xf32>, vector<64x64xf32>
        %rhs = affine.vector_load %b[0, %j * 64] : memref<64x512xf32>, vector<64x64xf32>
        %res = arith.addf %lhs, %rhs : vector<64x64xf32>
        affine.vector_store %res, %c[0, %j * 64] : memref<64x512xf32>, vector<64x64xf32>
    }

    affine.for %j = 0 to 8 {
        %lhs = affine.vector_load %c[0, 0] : memref<64x512xf32>, vector<64x512xf32>
        %rhs = affine.vector_load %d[0, %j * 512] : memref<64x4096xf32>, vector<64x512xf32>
        %res = arith.subf %lhs, %rhs : vector<64x512xf32>
        affine.vector_store %res, %d[0, %j * 512] : memref<64x4096xf32>, vector<64x512xf32>
    }

    func.return
}

Upon invoking affine-fusion on this IR with the following command:

mlir-opt --pass-pipeline='builtin.module(affine-loop-fusion)' test.mlir

I see that the loops are getting fused as follows:

 func.func @main(%arg0: memref<64x512xf32>, %arg1: memref<64x512xf32>, %arg2: memref<64x512xf32>, %arg3: memref<64x4096xf32>, %arg4: memref<64x4096xf32>) {
    %c0 = arith.constant 0 : index
    %alloc = memref.alloc() : memref<1x1xf32>
    %c0_0 = arith.constant 0 : index
    affine.for %arg5 = 0 to 8 {
      %0 = affine.vector_load %arg0[0, %c0 * 64] : memref<64x512xf32>, vector<64x64xf32>
      %1 = affine.vector_load %arg1[0, %c0 * 64] : memref<64x512xf32>, vector<64x64xf32>
      %2 = arith.addf %0, %1 : vector<64x64xf32>
      affine.vector_store %2, %arg2[0, %c0 * 64] : memref<64x512xf32>, vector<64x64xf32>
      %3 = affine.vector_load %arg0[0, %c0_0 * 64] : memref<64x512xf32>, vector<64x64xf32>
      %4 = affine.vector_load %arg1[0, %c0_0 * 64] : memref<64x512xf32>, vector<64x64xf32>
      %5 = arith.addf %3, %4 : vector<64x64xf32>
      affine.vector_store %5, %alloc[0, 0] : memref<1x1xf32>, vector<64x64xf32>
      %6 = affine.vector_load %alloc[0, 0] : memref<1x1xf32>, vector<64x512xf32>
      %7 = affine.vector_load %arg3[0, %arg5 * 512] : memref<64x4096xf32>, vector<64x512xf32>
      %8 = arith.subf %6, %7 : vector<64x512xf32>
      affine.vector_store %8, %arg3[0, %arg5 * 512] : memref<64x4096xf32>, vector<64x512xf32>
    }
    return
  }

Isn’t this an invalid transformation since 2nd loop can only be executed once 1st loop completely finishes all its iterations and produces the result to be consumed in 2nd loop? Are there any specific flags to be used in affine-loop-fusion to enable analysis in the context of vector types?

Note: My LLVM source is based out on Nov 12th 2024’s commit.

1 Like

But the second loop nest is only reading [0, 0] of %c IIUC. Why does it need to wait for all iterations of the 1st loop to be executed? The fusion is valid.

First loop stores vector of 64x64 into a memref of 64x512 iteratively.

affine.vector_store %res, %c[0, %j * 64] : memref<64x512xf32>, vector<64x64xf32>

whereas second loop consumes a whole vector<64x512xf32> in a single load:

%lhs = affine.vector_load %c[0, 0] : memref<64x512xf32>, vector<64x512xf32>

Hence all iterations of loop 1 needs to be executed right ?

Sorry, I wasn’t looking at the vector widths. This is indeed a bug since the fusion pass (or its underlying analysis) doesn’t look at the widths of the elements being accessed but only the subscripts of the memref. As such, such a bug would exist even between pairs of affine.store and affine.vector_load or vice versa. Can you please file an issue on Github and mark it as good starter/beginner issue? Thanks.

When the fusion pass was originally introduced, there weren’t affine.vector_load/store operations in MLIR and so this was later overlooked. A bailout in the presence of different-sized element types in the producer/consumer validity checking is a reasonable fix to start with. If it’s necessary to do fusion post such vectorization, we could consider handling more.

@bondhugula thanks for the feedback, I have created an issue in GitHub for the same : [MLIR][affine] Illegal affine loop fusion with vector types · Issue #115849 · llvm/llvm-project · GitHub

1 Like

@bondhugula Further exploration shows that, transformation seems to be producing invalid IR like below where vector is being stored to memref of size <1x1xf32>

affine.vector_store %5, %alloc[0, 0] : memref<1x1xf32>, vector<64x64xf32>

%6 = affine.vector_load %alloc[0, 0] : memref<1x1xf32>, vector<64x512xf32>

I think the fix is not just legality but to make the transformations valid as well.

That’s correct - that’s a related issue. They could end up generating invalid IR due to the same oversights. One would run into a similar issue with affine-scalrep as well if it has (or has already been) extended to work with AffineRead/WriteOpInterfaces.

The fixes should be similarly straightforward - the elemental type can’t be ignored when looking at pairs of affine read/write interface ops – they could have been for affine store/load.