Parallelization of affine.for containing vector.transfer_write/read is not supported

init mlir

module {
  func.func @elementwise(%arg0: memref<32x1280x768xf32>, %arg1: memref<32x1280x768xf32>, %arg2: memref<32x1280x768xf32>) {
    affine.for %arg3 = 0 to 32 {
      affine.for %arg4 = 0 to 1280 {
        affine.for %arg5 = 0 to 768 {
          %0 = affine.load %arg0[%arg3, %arg4, %arg5] : memref<32x1280x768xf32>
          %1 = affine.load %arg1[%arg3, %arg4, %arg5] : memref<32x1280x768xf32>
          %2 = arith.mulf %0, %1 : f32
          affine.store %2, %arg2[%arg3, %arg4, %arg5] : memref<32x1280x768xf32>
        }
      }
    }
    return
  }
}

RUN:

mlir-opt %s -affine-super-vectorize="virtual-vector-size=128 test-fastest-varying=0" -affine-parallelize='max-nested=1'

result:

module {
  func.func @elementwise(%arg0: memref<32x1280x768xf32>, %arg1: memref<32x1280x768xf32>, %arg2: memref<32x1280x768xf32>) {
    affine.for %arg3 = 0 to 32 {
      affine.for %arg4 = 0 to 1280 {
        affine.for %arg5 = 0 to 768 step 128 {
          %cst = arith.constant 0.000000e+00 : f32
          %0 = vector.transfer_read %arg0[%arg3, %arg4, %arg5], %cst : memref<32x1280x768xf32>, vector<128xf32>
          %cst_0 = arith.constant 0.000000e+00 : f32
          %1 = vector.transfer_read %arg1[%arg3, %arg4, %arg5], %cst_0 : memref<32x1280x768xf32>, vector<128xf32>
          %2 = arith.mulf %0, %1 : vector<128xf32>
          vector.transfer_write %2, %arg2[%arg3, %arg4, %arg5] : vector<128xf32>, memref<32x1280x768xf32>
        }
      }
    }
    return
  }
}

From the perspective of operators, the outermost affine.for can be parallelized.

However, in the file mlir/lib/Dialect/Affine/Analysis/AffineAnalysis.cpp, the functions isLoopMemoryParallel and checkMemrefAccessDependence can only analyze dependencies from affine.store and affine.copy, and cannot handle vector.transfer_write and vector.transfer_read.

Will this scenario be supported in the future?

Thank u.