Failed to legalize operation 'affine.parallel' marked as erased

Hi,

I found there is a conv2d example with affine.parallel, and I defined a conv2d function based on the example.

func @conv_2d(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>, %arg2: memref<?x?xf32>) {
  %c0 = constant 0 : index
  %filter_dim = dim %arg1, %c0 : memref<?x?xf32>
  %output_dim = dim %arg2, %c0 : memref<?x?xf32>
  affine.parallel (%x, %y) = (%c0, %c0) to (%output_dim, %output_dim) {
    %0 = affine.parallel (%kx, %ky) = (%c0, %c0) to (%filter_dim, %filter_dim) reduce ("addf") -> (f32) {
      %1 = affine.load %arg0[%x + %kx, %y + %ky] : memref<?x?xf32>
      %2 = affine.load %arg1[%kx, %ky] : memref<?x?xf32>
      %3 = mulf %1, %2 : f32
      affine.yield %3 : f32
    }
    affine.store %0, %arg2[%x, %y] : memref<?x?xf32>
  }
  return
}

Then I used the mlir-opt with -lower-affine option, but it reported an error:

error: failed to legalize operation 'affine.parallel' marked as erased
%0 = affine.parallel (%kx, %ky) = (%c0, %c0) to (%filter_dim, %filter_dim) reduce ("addf") -> (f32) {

It seems that the affine.store marked the affine.parallel as erased? How can I store the result generated from the affine.parallel?

As for the efficiency, can affine.parallel execute the conv2d faster than the nested affine.for method?

Thanks!

Hongbin

affine.parallel is like a multi-dimensional affine.for but with no ordering specified on the dimensions – because it’s meant to capture parallel execution on the entire domain. The lowering for affine.parallel is probably missing/broken for those with result values. Could you try with affine.parallels that don’t have result values? Please do file a bug anyway.
CC: @jbruestle @flaub for visibility.

The affine.parallel without result values works well:

func @conv_2d(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>, %arg2: memref<?x?xf32>) {
  %c0 = constant 0 : index
  %filter_dim = dim %arg1, %c0 : memref<?x?xf32>
  %output_dim = dim %arg2, %c0 : memref<?x?xf32>
  affine.parallel (%x, %y) = (%c0, %c0) to (%output_dim, %output_dim) {
    affine.parallel (%kx, %ky) = (%c0, %c0) to (%filter_dim, %filter_dim) {
      %1 = affine.load %arg0[%x + %kx, %y + %ky] : memref<?x?xf32>
      %2 = affine.load %arg1[%kx, %ky] : memref<?x?xf32>
      %3 = mulf %1, %2 : f32
      %4 = affine.load %arg2[%x, %y] : memref<?x?xf32>
      %result = addf %4, %3 : f32
      affine.store %result, %arg2[%x, %y] : memref<?x?xf32>
    }
  }
  return
}

Add load/store in the affine.parallel can get the correct result:

Unranked Memref base@ = 0x7fa643f04080 rank = 2 offset = 0 sizes = [6, 6] strides = [6, 1] data = 
[[9,   9,   9,   9,   9,   9], 
 [9,   9,   9,   9,   9,   9], 
 [9,   9,   9,   9,   9,   9], 
 [9,   9,   9,   9,   9,   9], 
 [9,   9,   9,   9,   9,   9], 
 [9,   9,   9,   9,   9,   9]]

But I think the load/store added to the affine.parallel will slow down the execute process. I originally hoped that the affine.parallel with result could save the load/store.

I haven’t filed a bug before, I should register an account and report the bug on bugzilla, right?

Even with the load/store, downstream LLVM passes should have been able to hoist the load/stores out and use registers. Are the -O3 LLVM passes running through the LLVM IR compilation?

Yes, here: https://bugs.llvm.org/

It should be pretty straightforward to handle lowering of result values. scf.parallel supports it and so it would directly map to that.

Actually, I’m not sure about the effect of -O3, and I’m wondering that will the -O3 conduct automatic vectorization in the conv2d function?

You’ll have to check the LLVM -O3 pipeline description/doc. Depending on the loop order and other things, it may or may not happen/be effective.

I have filed the bug:
Bug 48359 - [Affine] Failed to legalize operation ‘affine.parallel’ marked as erased when using affine.store with affine.parallel

I am working on this issue. Soon I will add the desired support. Thanks!

Thank you for pointing out and filing this bug. This bug has been fixed. Now both affine.for and affine.parallel operations with return values are successfully lowered into scf dialect.
Please refer to the following commit: https://github.com/llvm/llvm-project/commit/3e07b0b9d3363fb767cbbaa2593fa91ac393fb7e

It works correctly now! Thanks for fixing the bug!