Parallelization of affine.for containing vector.transfer_write/read is not supported

init mlir:

module {
  func.func @elementwise(%arg0: memref<32x1280x768xf32>, %arg1: memref<32x1280x768xf32>, %arg2: memref<32x1280x768xf32>) {
    affine.for %arg3 = 0 to 32 {
      affine.for %arg4 = 0 to 1280 {
        affine.for %arg5 = 0 to 768 {
          %0 = affine.load %arg0[%arg3, %arg4, %arg5] : memref<32x1280x768xf32>
          %1 = affine.load %arg1[%arg3, %arg4, %arg5] : memref<32x1280x768xf32>
          %2 = arith.mulf %0, %1 : f32
          affine.store %2, %arg2[%arg3, %arg4, %arg5] : memref<32x1280x768xf32>
        }
      }
    }
    return
  }
}

run:

mlir-opt %s -affine-super-vectorize="virtual-vector-size=128 test-fastest-varying=0" -affine-parallelize='max-nested=1'

result:

module {
  func.func @elementwise(%arg0: memref<32x1280x768xf32>, %arg1: memref<32x1280x768xf32>, %arg2: memref<32x1280x768xf32>) {
    affine.for %arg3 = 0 to 32 {
      affine.for %arg4 = 0 to 1280 {
        affine.for %arg5 = 0 to 768 step 128 {
          %cst = arith.constant 0.000000e+00 : f32
          %0 = vector.transfer_read %arg0[%arg3, %arg4, %arg5], %cst : memref<32x1280x768xf32>, vector<128xf32>
          %cst_0 = arith.constant 0.000000e+00 : f32
          %1 = vector.transfer_read %arg1[%arg3, %arg4, %arg5], %cst_0 : memref<32x1280x768xf32>, vector<128xf32>
          %2 = arith.mulf %0, %1 : vector<128xf32>
          vector.transfer_write %2, %arg2[%arg3, %arg4, %arg5] : vector<128xf32>, memref<32x1280x768xf32>
        }
      }
    }
    return
  }
}

From the perspective of operators, the outermost affine.for can be parallelized.

However, in the file mlir/lib/Dialect/Affine/Analysis/AffineAnalysis.cpp, the functions isLoopMemoryParallel and checkMemrefAccessDependence can only analyze dependencies from affine.store and affine.copy, and cannot handle vector.transfer_write and vector.transfer_read.

Will this scenario be supported in the future?

Thank u.

I also have been wondering about this, but didn’t make a post (yet). I doubt that doing parallelization after vectorization when the affine dialect load/stores have been removed is going to happen. It seems like getting the affine vectorizer pass to try rewrites on parallel for operations instead is a better approach. After scanning the code, it looks like it will be a little bit of a headache though, as the pass right now is fixed to AffineForOp pretty heavily.

My default answer: FAQ - MLIR applies here.

Affine analyses and passes predate most of modern MLIR. Nobody is actively working on them AFAIK. If you need that functionality, feel free to implement that. Given the abstractions that are now available, it would make sense to have some AffineMemoryAccessOpInterface implemented by ops that can express their subscripts as affine maps or integer sets and rewrite the analyses and passes to use that instead of a hardcoded set of ops (interfaces were introduced ~2 years after the affine passes).

I see that the affine dialect actually has vector_load and vector_store operations within the dialect. Perhaps the simplest way forward is to adjust the affine vectorization pass to emit those operations instead of vector.transfer_read and vector.transfer_write operations. A conversion pass might be needed though to lower the affine vector operations into the vector dialect though.

This is completely inaccurate. Please see the output of git log lib/Dialect/Affine/ include/mlir/Dialect/Affine/.

There is include/mlir/Dialect/Affine/IR/AffineMemoryOpInterfaces.td that already has

def AffineMapAccessInterface : OpInterface<"AffineMapAccessInterface"> {
  let description = [{
      Interface to query the AffineMap used to dereference and access a given
      memref. Implementers of this interface must operate on at least one
      memref operand.  The memref argument given to this interface much match
      one of those memref operands.
  }];

This isn’t accurate as well: affine dependence check on memrefs, as well as other affine analyses, don’t use a hardcoded set of ops – they use AffineRead/WriteOpInterface. That said, it perhaps needs to use a more general interface, or other relevant ops should use the proper interface to be used with it. If there is indeed any usage of hardcoded affine ops, that should be replaced with the existing interfaces. Where did you find hardcoded ops (unless you are looking at LLVM/MLIR from more than 3 years ago?) (I do see a couple of hardcoded instances of AffineLoad/StoreOp in LoopAnalysis.cpp that should be changed to AffineReadOpInterface/AffineWriteOpInterface.)