Affine-supervectorize produce wrong results

absdzxc · December 8, 2020, 2:22pm

I’m trying to understand how affine-supervectorize pass is working, after few experiments I get the issue when underlying broadcasting of vector.transfer_read operation resulting in incorrect behaviour of original procedure:

minimal example

module {
func @foo() {

%0 = alloc() : memref<1x1x3x3xf32>
%1 = alloc() : memref<1x1x3x1xf32>

%c0 = constant 0 : index

affine.for %arg2 = 0 to 1 {
  affine.for %arg3 = 0 to 1 {
    affine.for %arg4 = 0 to 3 {
      affine.for %arg5 = 0 to 3 {

        %4 = affine.load %0[%c0, %c0, %arg4, %arg5] : memref<1x1x3x3xf32>

        // NOTE(Egor): this operation will be lowered into : 
        // #map5 = affine_map<(d0, d1, d2, d3) -> (0)>
        // %11 = vector.transfer_read %1[%7, %8, %9, %10], %cst_0 {permutation_map = #map5} : memref<1x1x3x1xf32>, vector<4xf32>
        %5 = affine.load %1[%c0, %c0, %arg4, %c0] : memref<1x1x3x1xf32>
        %6 = addf %4, %5 : f32
        affine.store %6, %0[%arg2, %arg3, %arg4, %arg5] : memref<1x1x3x3xf32>
      }
    }
  }
}
return
}
}

this is the minimal example that needed to reproduce behavior, after invoking this command

mlir-opt --affine-super-vectorize=“virtual-vector-size=4” --convert-vector-to-scf MVE.mlir > MVE_vectorized.mlir

it will produce code:

module {
  func @foo() {
    %cst = constant 0.000000e+00 : f32
    %c4 = constant 4 : index
    %c1 = constant 1 : index
    %c0 = constant 0 : index
    %0 = alloca() : memref<vector<4xf32>>
    %1 = alloc() : memref<1x1x3x3xf32>
    %2 = alloc() : memref<1x1x3x1xf32>
    affine.for %arg0 = 0 to 1 {
      affine.for %arg1 = 0 to 1 {
        affine.for %arg2 = 0 to 3 {
          affine.for %arg3 = 0 to 3 step 4 {
            %3 = vector.transfer_read %1[%c0, %c0, %arg2, %arg3], %cst : memref<1x1x3x3xf32>, vector<4xf32>


            // START: this is what transfer read with non-indentity affine map was lowered into:
            scf.for %arg4 = %c0 to %c4 step %c1 {
              %7 = index_cast %arg4 : index to i32
              %8 = load %0[] : memref<vector<4xf32>>
              %9 = cmpi "slt", %arg4, %c0 : index
              scf.if %9 {
                %10 = load %2[%c0, %c0, %arg2, %c0] : memref<1x1x3x1xf32>
                %11 = vector.insertelement %10, %8[%7 : i32] : vector<4xf32>
                store %11, %0[] : memref<vector<4xf32>>
              } else {
                %10 = vector.insertelement %cst, %8[%7 : i32] : vector<4xf32>
                store %10, %0[] : memref<vector<4xf32>>
              }
            }
            %4 = vector.type_cast %0 : memref<vector<4xf32>> to memref<vector<4xf32>>
            %5 = load %4[] : memref<vector<4xf32>>
            // END 


            %6 = addf %3, %5 : vector<4xf32>
            vector.transfer_write %6, %1[%arg0, %arg1, %arg2, %arg3] : vector<4xf32>, memref<1x1x3x3xf32>
          }
        }
      }
    }
    return
  }
}

looking at this implementation it’s determined that the only IF case that will be visited, will be the lower one (where vector is populated with padding value %cst), this is not the correct representation of original loop. Can someone please explain this situation to me? Is this a bug or my understanding of broadcasting is wrong in this situation?

As far as I understood #map5 = affine_map<(d0, d1, d2, d3) -> (0)> means that we will take the first element
%11 = vector.transfer_read %1[%7, %8, %9, %10], %cst_0 {permutation_map = #map5} : memref<1x1x3x1xf32>, vector<4xf32>
vector.transfer_read is pointing into, and we will use that element to populate the whole vector.

nicolasvasilache · December 9, 2020, 11:57am

Hello @absdzxc

This looks like a bug in -vector-to-scf lowering, where broadcasting semantics does not seem to work as intended. Some of that code was written before a vector.broadcast op existed and needs to be revisited.

It is possible the lowering would be better implemented with a more progressive N → N-1 lowering pattern where at each dimension we reduce the rank by 1 and depending on whether this is a broadcast or not just use the for + conditional or a vector.broadcast.

Reg. how to proceed next, are you interested in contributing to this or just want it to work as expected? (your TTS may vary ).

Thanks for reporting!

absdzxc · December 10, 2020, 10:18am

Hi, thanks for so prompt reply. I think, I’m interested in contributing.

absdzxc · December 13, 2020, 10:39pm

@nicolasvasilache I’m sorry I couldn’t find a way to pm you, could you please point me into the right direction?)

nicolasvasilache · December 14, 2020, 8:51am

I just did

Topic		Replies	Views
Parallelization of affine.for containing vector.transfer_write/read is not supported MLIR	0	158	June 8, 2023
Parallelization of affine.for containing vector.transfer_write/read is not supported MLIR	4	219	June 8, 2023
Vector.transfer_write error MLIR	0	165	October 3, 2023
Vector.transfer_read padding and division operations MLIR	1	245	December 2, 2022
`affine-super-vectorize=virtual-vector-size=2` is not working MLIR	2	198	August 17, 2023

Affine-supervectorize produce wrong results

Related Topics