Hi,

I found there is a conv2d example with `affine.parallel`

, and I defined a conv2d function based on the example.

```
func @conv_2d(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>, %arg2: memref<?x?xf32>) {
%c0 = constant 0 : index
%filter_dim = dim %arg1, %c0 : memref<?x?xf32>
%output_dim = dim %arg2, %c0 : memref<?x?xf32>
affine.parallel (%x, %y) = (%c0, %c0) to (%output_dim, %output_dim) {
%0 = affine.parallel (%kx, %ky) = (%c0, %c0) to (%filter_dim, %filter_dim) reduce ("addf") -> (f32) {
%1 = affine.load %arg0[%x + %kx, %y + %ky] : memref<?x?xf32>
%2 = affine.load %arg1[%kx, %ky] : memref<?x?xf32>
%3 = mulf %1, %2 : f32
affine.yield %3 : f32
}
affine.store %0, %arg2[%x, %y] : memref<?x?xf32>
}
return
}
```

Then I used the `mlir-opt`

with `-lower-affine`

option, but it reported an error:

```
error: failed to legalize operation 'affine.parallel' marked as erased
%0 = affine.parallel (%kx, %ky) = (%c0, %c0) to (%filter_dim, %filter_dim) reduce ("addf") -> (f32) {
```

It seems that the `affine.store`

marked the `affine.parallel`

as erased? How can I store the result generated from the `affine.parallel`

?

As for the efficiency, can `affine.parallel`

execute the conv2d faster than the nested `affine.for`

method?

Thanks!

Hongbin