Encapsulating linalg.convolution inside linalg.generic Region

Looks like convolution kernels are not encapsulated inside the linalg generic Regions, Is it possible to get them inside the generic regions, The intent would be to make them fuse with other operations like Add/Relu etc so that they can run as a single kernel.

For ex LinalgFusionOfTensorOps pass seems to ignore fusing the %2 = depthwise_conv_2d_input_nhwc_filter_hwc with %3 = linalg.generic { Relu }

 %2 = linalg.depthwise_conv_2d_input_nhwc_filter_hwc {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%cst_1, %cst_0 : tensor<1x10x10x1xf32>, tensor<3x3x1xf32>) outs(%1 : tensor<1x8x8x1xf32>) -> tensor<1x8x8x1xf32>

  %3 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (0, d1, d2, 0)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x8x8x1xf32>) outs(%0 : tensor<1x8x8x1xf32>) {
  ^bb0(%arg0: f32, %arg1: f32):  // no predecessors
    %4 = cmpf ogt, %cst_2, %arg0 : f32
    %5 = select %4, %cst_2, %arg0 : f32
    %6 = cmpf uno, %cst_2, %arg0 : f32
    %7 = select %6, %cst, %5 : f32
    linalg.yield %7 : f32
  } -> tensor<1x8x8x1xf32>

Is it possible to get %2 = depthwise_conv_2d_input_nhwc_filter_hwc inside generic region like below
so that LinalgFusionOfTensorOps can take care of fusing this pattern.

%4 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (0, d1, d2, 0)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x8x8x1xf32>) outs(%0 : tensor<1x8x8x1xf32>) {

 linalg.depthwise_conv_2d_input_nhwc_filter_hwc {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%cst_1, %cst_0 : tensor<1x10x10x1xf32>, tensor<3x3x1xf32>) outs(%1 : tensor<1x8x8x1xf32>) -> tensor<1x8x8x1xf32>

}

The LinalgFusionOfTensorOps pass is meant to do fusion where the resulting operation is also a perfectly nested loops. For fusing a convolution with elementwise operation, you need imperfectly nested loops. I responded how you would fuse convolution with elementwise operations on your other post here.

In general the region of a linalg.generic is (at least currently) meant to represent the scalar operations to be performed at a point in the iteration space. In particular, as of today, it introduces implicit loads/stores to the inputs/outputs which is not really compatible with using a linalg.* op within the region of the linalg.generic

Thanks Mahesh,
I will be looking into the other post where you have detailed the usage, This is a bit off the topic here but with some relevance, We want to leverage the fusion capabilities of Linalg and the memref based buffer optimizations along with Flow IR and HAL dialects capabilities of iree for our HW.

The Linalg operators use the affine maps and iterator types and generate the loop and tiling info. We do have some similar capabilities and info generated for our HW, provided through some proprietary llvm passes, so the tiling info is in fact redundant which we don’t want to leverage.

For the cases like Conv-Relu, we can run a single Fused ConvRelu kernel that saves stores from conv and loads by Relu and the intermediate memory needed between these 2 ops. But the problem being linalg.* ops done seem to fuse with linalg.generic ops.

I am looking at some optimal solution where we can leverage from the node fusion capabilities.

BTW is it possible to create op similar to Linalg Generic op say a linalg.complex Op which is a subset of linalg.generic per se it does not has the affine maps & iterator types but has other capabilities,
and would it be possible to extend the LinalgFusionOfTensorOps pass in such a way that
It can fuse non-generic ops like conv with generic op like Relu, something like in this case below.
basically linalg.complex is a subset of linalg.generic with region capabilities and can house a
linalg.convolution and a Relu .

%1 = linalg.complex { ins(%2 : tensor<1x8x8x1xf32>) outs(%0 : tensor<1x8x8x1xf32>) {
%arg0 = linalg.convolution %2;
%4 = cmpf ogt, %cst_2, %arg0 : f32
%5 = select %4, %cst_2, %arg0 : f32
%6 = cmpf uno, %cst_2, %arg0 : f32
%7 = select %6, %cst, %5 : f32
}

It seems like what you are looking for is a very specific use case. If your hardware is interested at intercepting such high-level constructs without any compiler transformation, I would just have my own hardware specific dialect and convert the linalg.*conv*linalg.generic (for bias-add) pattern to something in your hw-specific dialect and just bypass all of Linalg since it looks like the tiling + fusion capabilities in Linalg are not what you are looking for. I am happy to iterate on this if you can share some details about what you are interested in achieving.

I am not able to follow the op description. Firstly, I am not sure what type of %arg0 is. It seems to be a scalar based on its uses. Then it is unclear to me what linalg.convolution is doing then since its taking a tensor and returning a scalar.

In any case if I guess what you are trying to do, one option is that you can convert the convolution to a generic op using this pass and then trying LinalgFusionOfTensorOps. Even if that works, you loose the information that you have a convolution. I still believe that this pass is not what you are looking for.

W.R.T linalg.complex op, it depends on the semantics you want. linalg.generic (and all LinalgOp) represent perfectly nested loops. When fusion requires imperfectly nested loops then you need either scf.for or linalg.tiled_loop operations. which capture the outer loop nest. The body of these ops can have one or more Linalg ops which represent perfectly nested loops. Together they give you an imperfect loop nest (and tile + fuse generates the imperfect loop nest that represents the fused computation). So I am not able to follow the semantics you are trying to achieve with linalg.complex op.

I would just have my own hardware-specific dialect and convert the linalg.*conv*linalg.generic
To be honest, our first preference would be to explore and reuse as much from what is present and offered by mlir and probably which is proven/tested too, If nothing works out then we might go as suggested.

“I still believe that this pass is not what you are looking for.”
Exactly we don’t want conv getting lowered to mulf/addf .

I think I should present it with a correct representative example.

So our seed targeted pattern is

#map0 = affine_map<(d0, d1, d2, d3) -> (0, d1, d2, 0)>
#map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>
module  {
  func @main() -> tensor<1x8x8x1xf32> {
    %cst = constant 0x7FC00000 : f32
    %cst_0 = constant dense<[[[8.405740e-01], [0.107482761], [0.885160744]], [[0.879221558], [0.272046864], [0.219075441]], [[0.853731691], [0.786423742], [0.132054776]]]> : tensor<3x3x1xf32>
    %cst_1 = constant dense<[[[[0.61719793], [0.793435096], [0.121170968], [0.573920846], [0.445487499], [0.226183072], [0.973448157], [0.851443469], [0.565707207], [5.227910e-01]], [[0.00731075834], [0.557578623], [0.707973182], [0.959501981], [0.358185142], [0.699126303], [0.466926485], [0.413297445], [0.0725673139], [0.921178698]], [[0.30189532], [0.731407762], [0.370444775], [0.848782122], [0.871007978], [0.748299241], [0.102783702], [0.551843107], [0.816904246], [0.0332700573]], [[0.343397468], [0.982405781], [0.0607045554], [0.541470528], [0.726823389], [0.808600127], [0.256532729], [0.252898484], [0.110048898], [0.478431761]], [[0.979851245], [0.111159958], [0.724086046], [0.982171118], [0.211429045], [0.678284585], [0.0563224852], [0.837513744], [0.657312452], [0.536515653]], [[0.505726278], [0.648696065], [0.999981284], [0.00183360698], [0.745425224], [0.40943867], [0.333478332], [0.644900322], [0.19442305], [0.178031594]], [[0.681895732], [0.0306128804], [0.390298098], [0.893880724], [0.859292924], [0.445982367], [0.303117335], [0.769601822], [0.510563135], [0.387194842]], [[0.426417202], [0.49324289], [0.107445173], [0.790808141], [0.891636371], [0.0934373661], [0.42853874], [0.835353791], [0.0698968098], [0.611316978]], [[0.999428808], [0.938870192], [0.778351902], [0.973116576], [0.0702748299], [0.479627848], [0.455716878], [0.779465734], [0.614541173], [0.207848176]], [[0.847904145], [0.130607277], [0.678844749], [0.03101651], [0.964369654], [0.600809455], [0.0234705787], [0.356526107], [0.636354804], [0.281212419]]]]> : tensor<1x10x10x1xf32>
    %cst_2 = constant 0.000000e+00 : f32
    %0 = linalg.init_tensor [1, 8, 8, 1] : tensor<1x8x8x1xf32>
    %1 = linalg.fill(%cst_2, %0) : f32, tensor<1x8x8x1xf32> -> tensor<1x8x8x1xf32>
    %2 = linalg.depthwise_conv_2d_input_nhwc_filter_hwc {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%cst_1, %cst_0 : tensor<1x10x10x1xf32>, tensor<3x3x1xf32>) outs(%1 : tensor<1x8x8x1xf32>) -> tensor<1x8x8x1xf32>
    %3 = linalg.generic {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x8x8x1xf32>) outs(%0 : tensor<1x8x8x1xf32>) {
    ^bb0(%arg0: f32, %arg1: f32):  // no predecessors
      %4 = cmpf ogt, %cst_2, %arg0 : f32
      %5 = select %4, %cst_2, %arg0 : f32
      %6 = cmpf uno, %cst_2, %arg0 : f32
      %7 = select %6, %cst, %5 : f32
      linalg.yield %7 : f32
    } -> tensor<1x8x8x1xf32>
    return %3 : tensor<1x8x8x1xf32>
  }
}

where in on applying the LinalgFusionOfTensorOp Pass we expect
%2 = linalg.depthwise_conv_2d_input_nhwc_filter_hwc
and
%3 = linalg.generic { %4 = cmpf ogt, %cst_2, … }

both of them to get inside a single linalg region , or we can say %3 which houses cmp/select in addition houses linalg.depthwise_conv_2d_input_nhwc_filter_hwc also .

our hypothetical output would be some thing like this

#map0 = affine_map<(d0, d1, d2, d3) -> (0, d1, d2, 0)>
#map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>
#map2 = affine_map<(d0, d1, d2 ) -> (d0, d1, d2 )>
module  {
  func @main() -> tensor<1x8x8x1xf32> {
    %cst = constant 0x7FC00000 : f32
    %cst_0 = constant dense<[[[8.405740e-01], [0.107482761], [0.885160744]], [[0.879221558], [0.272046864], [0.219075441]], [[0.853731691], [0.786423742], [0.132054776]]]> : tensor<3x3x1xf32>
    %cst_1 = constant dense<[[[[0.61719793], [0.793435096], [0.121170968], [0.573920846], [0.445487499], [0.226183072], [0.973448157], [0.851443469], [0.565707207], [5.227910e-01]], [[0.00731075834], [0.557578623], [0.707973182], [0.959501981], [0.358185142], [0.699126303], [0.466926485], [0.413297445], [0.0725673139], [0.921178698]], [[0.30189532], [0.731407762], [0.370444775], [0.848782122], [0.871007978], [0.748299241], [0.102783702], [0.551843107], [0.816904246], [0.0332700573]], [[0.343397468], [0.982405781], [0.0607045554], [0.541470528], [0.726823389], [0.808600127], [0.256532729], [0.252898484], [0.110048898], [0.478431761]], [[0.979851245], [0.111159958], [0.724086046], [0.982171118], [0.211429045], [0.678284585], [0.0563224852], [0.837513744], [0.657312452], [0.536515653]], [[0.505726278], [0.648696065], [0.999981284], [0.00183360698], [0.745425224], [0.40943867], [0.333478332], [0.644900322], [0.19442305], [0.178031594]], [[0.681895732], [0.0306128804], [0.390298098], [0.893880724], [0.859292924], [0.445982367], [0.303117335], [0.769601822], [0.510563135], [0.387194842]], [[0.426417202], [0.49324289], [0.107445173], [0.790808141], [0.891636371], [0.0934373661], [0.42853874], [0.835353791], [0.0698968098], [0.611316978]], [[0.999428808], [0.938870192], [0.778351902], [0.973116576], [0.0702748299], [0.479627848], [0.455716878], [0.779465734], [0.614541173], [0.207848176]], [[0.847904145], [0.130607277], [0.678844749], [0.03101651], [0.964369654], [0.600809455], [0.0234705787], [0.356526107], [0.636354804], [0.281212419]]]]> : tensor<1x10x10x1xf32>
    %cst_2 = constant 0.000000e+00 : f32 
    %0 = linalg.init_tensor [1, 8, 8, 1] : tensor<1x8x8x1xf32> 
    %1 = linalg.fill(%cst_2, %0) : f32, tensor<1x8x8x1xf32> -> tensor<1x8x8x1xf32>
    %2 = linalg.complex {indexing_maps = [#map0, #map2, #map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%cst_1, %cst_0 : tensor<1x10x10x1xf32>, tensor<3x3x1xf32> ) outs(%1 : tensor<1x8x8x1xf32>) {
      ^bb0(%arg0: f32, %arg1: f32,  %arg2: f32):  // no predecessors 
      %3 = linalg.depthwise_conv_2d_input_nhwc_filter_hwc {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%cst_1, %cst_0 : tensor<1x10x10x1xf32>, tensor<3x3x1xf32>) outs(%1 : tensor<1x8x8x1xf32>) -> tensor<1x8x8x1xf32>
      %4  = linalg.relu %3 
       linalg.yield %4 : f32
    } -> tensor<1x8x8x1xf32>
    return %2 : tensor<1x8x8x1xf32>
  }
}

%2 = linalg.complex {
%3 = linalg.depthwise_conv_2d_input_nhwc_filter_hwc
%4 = linalg.relu %3
}
so our so called linalg.complex is a Op similar to linalg.generic but it neither has indexing_maps
nor the iterator_types ( right now it has the indexing_maps = [#map0, #map2, #map1], iterator_types = [“parallel”, “parallel”, “parallel”, “parallel”]} but if needed we would prefer to strip it)

And it is a region-based which now houses linalg.depthwise_conv_2d_input_nhwc_filter_hwc whose output is being consumed by relu Op (and btw conv output is not a scalar, essentailly nothing is scalar here ).
Further linalg.relu is equivalent to below elementwise ops compare & select or mhlo.maximum.
%4 = cmpf ogt, %cst_2, %arg0 : f32
%5 = select %4, %cst_2, %arg0 : f32
%6 = cmpf uno, %cst_2, %arg0 : f32
%7 = select %6, %cst, %5 : f32

The motive is to get linalg.depthwise_conv_2d_input_nhwc_filter_hwc and relu inside the region .

I think there is a mismatch in expectation of what LinalgFusionOfTensorOpPass is supposed to do (might have to do with bad naming of this pass. It should really be LinalgFuseElementwiseOpsPass. If you look at this pass this is just invoking populateElementwiseOpsFusionPatterns here. So I will change the name of the pass :slight_smile: . With that I hope it is clear that this pass is meant to take two linalg.generic operations and create a new linalg.generic operations. The result operation is a perfectly nested loop. This is fusion of cases where you have two perfectly nested loops that you fuse to get a single perfectly nested loops.

What you are reaching for is an imperfectly nested loop operation. I was going to post on the other thread, but you can use tile + fuse to also generate linalg.tiled_loop operations. So for your example it would look something like this

map0 = affine_map<(d0) -> (2, -d0 + 1)>
#map1 = affine_map<(d0) -> (3, -d0 + 8)>
#map2 = affine_map<(d0) -> (4, -d0 + 8)>
#map3 = affine_map<(d0, d1) -> (d1, -d0 + 1)>
#map4 = affine_map<(d0, d1) -> (d1 + 2, -d0 + 10)>
#map5 = affine_map<(d0, d1) -> (d1, -d0 + 8)>
#map6 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>
module  {
  func @main(%arg0: tensor<1x10x10x1xf32>, %arg1: tensor<3x3x1xf32>, %arg2: tensor<1x8x8x1xf32>, %arg3: tensor<1x8x8x1xf32>) -> tensor<1x8x8x1xf32> {
    %c2 = constant 2 : index
    %c4 = constant 4 : index
    %c8 = constant 8 : index
    %c3 = constant 3 : index
    %c0 = constant 0 : index
    %c1 = constant 1 : index
    %0 = linalg.init_tensor [1, 8, 8, 1] : tensor<1x8x8x1xf32>
    %1 = linalg.tiled_loop (%arg4, %arg5, %arg6) = (%c0, %c0, %c0) to (%c1, %c8, %c8) step (%c2, %c3, %c4) ins (%arg7 = %arg3: tensor<1x8x8x1xf32>, %arg8 = %arg0: tensor<1x10x10x1xf32>, %arg9 = %arg1: tensor<3x3x1xf32>, %arg10 = %arg2: tensor<1x8x8x1xf32>) outs (%arg11 = %0: tenso\
r<1x8x8x1xf32>) {
      %2 = affine.min #map0(%arg4)
      %3 = affine.min #map1(%arg5)
      %4 = affine.min #map2(%arg6)
      %5 = tensor.extract_slice %arg7[%arg4, %arg5, %arg6, 0] [%2, %3, %4, %c1] [1, 1, 1, 1] : tensor<1x8x8x1xf32> to tensor<?x?x?x?xf32>
      %6 = tensor.extract_slice %arg11[%arg4, %arg5, %arg6, 0] [%2, %3, %4, %c1] [1, 1, 1, 1] : tensor<1x8x8x1xf32> to tensor<?x?x?x?xf32>
      %7 = affine.min #map3(%arg4, %2)
      %8 = affine.min #map4(%arg5, %3)
      %9 = affine.min #map4(%arg6, %4)
      %10 = tensor.extract_slice %arg8[%arg4, %arg5, %arg6, %c0] [%7, %8, %9, %c1] [1, 1, 1, 1] : tensor<1x10x10x1xf32> to tensor<?x?x?x?xf32>
      %11 = tensor.extract_slice %arg9[0, 0, %c0] [%c3, %c3, %c1] [1, 1, 1] : tensor<3x3x1xf32> to tensor<?x?x?xf32>
      %12 = affine.min #map5(%arg5, %3)
      %13 = affine.min #map5(%arg6, %4)
      %14 = tensor.extract_slice %arg10[%arg4, %arg5, %arg6, %c0] [%7, %12, %13, %c1] [1, 1, 1, 1] : tensor<1x8x8x1xf32> to tensor<?x?x?x?xf32>
      %15 = linalg.depthwise_conv_2d_input_nhwc_filter_hwc {__internal_linalg_transform__ = "after_basic_static_fusion_producer", dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%10, %11 : tensor<?x?x?x?xf32>, tensor<?x?x?xf32>) outs(%14 : tensor<?x?x\
?x?xf32>) -> tensor<?x?x?x?xf32>
      %16 = linalg.generic {indexing_maps = [#map6, #map6, #map6], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%15, %5 : tensor<?x?x?x?xf32>, tensor<?x?x?x?xf32>) outs(%6 : tensor<?x?x?x?xf32>) attrs =  {__internal_linalg_transform__ = "after_basic_stati\
c_fusion"} {
      ^bb0(%arg12: f32, %arg13: f32, %arg14: f32):  // no predecessors
        %18 = addf %arg12, %arg13 : f32
        linalg.yield %18 : f32
      } -> tensor<?x?x?x?xf32>
      %17 = tensor.insert_slice %16 into %arg11[%arg4, %arg5, %arg6, 0] [%2, %3, %4, %c1] [1, 1, 1, 1] : tensor<?x?x?x?xf32> into tensor<1x8x8x1xf32>
      linalg.yield %17 : tensor<1x8x8x1xf32>
    }
    return %1 : tensor<1x8x8x1xf32>
  }
}

The linalg.*conv* operation and the linalg.generic are in “one region”. For the most part the tiled_loop and scf.for is just a different representation of the inter-tile loops. The computation above is an imperfectly nested computation.

Thanks for the pointer, Yes probably linalg.tiled_loop is something that might work out, BTW
I ran the conv-relu combo with -test-linalg-tiled-loop-fusion-transform-patterns but it didn’t seem to generate the linalg.tiled_loop Op.

llvm-project/build/bin/mlir-opt -test-linalg-tiled-loop-fusion-transform-patterns -print-ir-before-all -print-ir-after-all linalg_convo_relu.mlir

BTW the grep for after_basic_static_fusion and after_basic_static_fusion_producer too didn’t come out with anything.

On other and

// -----// IR Dump Before {anonymous}::TestLinalgFusionTransforms<(mlir::linalg::LinalgTilingLoopType)3> //----- //
func @matmul_plus_matmul(%arg0: tensor<2x3xf32>, %arg1: tensor<3x4xf32>, %arg2: tensor<2x4xf32>) -> tensor<2x4xf32> {
  %c0 = constant 0 : index
  %c1 = constant 1 : index
  %c2 = constant 2 : index
  %c3 = constant 3 : index
  %c4 = constant 4 : index
  %0 = linalg.matmul ins(%arg0, %arg1 : tensor<2x3xf32>, tensor<3x4xf32>) outs(%arg2 : tensor<2x4xf32>) -> tensor<2x4xf32>
  %1 = linalg.init_tensor [2, 4] : tensor<2x4xf32>
  %2 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%0, %0 : tensor<2x4xf32>, tensor<2x4xf32>) outs(%1 : tensor<2x4xf32>) attrs =  {__internal_linalg_transform__ = "transpose_fusion"} {
  ^bb0(%arg3: f32, %arg4: f32, %arg5: f32):  // no predecessors
    %3 = addf %arg3, %arg4 : f32
    linalg.yield %3 : f32
  } -> tensor<2x4xf32>
  return %2 : tensor<2x4xf32>
}

seems to correctly generate the linalg.tiled_loop encasing linalg.matmul and linalg.generic in it

#map0 = affine_map<(d0) -> (32, -d0 + 2)>
#map1 = affine_map<(d0) -> (64, -d0 + 4)>
#map2 = affine_map<(d0) -> (-d0 + 2, 32)>
#map3 = affine_map<(d0) -> (-d0 + 4, 64)>
#map4 = affine_map<(d0, d1) -> (d0, d1)>
module  {
  func @matmul_plus_matmul(%arg0: tensor<2x3xf32>, %arg1: tensor<3x4xf32>, %arg2: tensor<2x4xf32>) -> tensor<2x4xf32> {
    %c0 = constant 0 : index
    %c4 = constant 4 : index
    %c2 = constant 2 : index
    %c64 = constant 64 : index
    %c32 = constant 32 : index
    %0 = linalg.init_tensor [2, 4] : tensor<2x4xf32>
    %1 = linalg.tiled_loop (%arg3, %arg4) = (%c0, %c0) to (%c2, %c4) step (%c32, %c64) ins (%arg5 = %arg0: tensor<2x3xf32>, %arg6 = %arg1: tensor<3x4xf32>, %arg7 = %arg2: tensor<2x4xf32>) outs (%arg8 = %0: tensor<2x4xf32>) {
      %2 = affine.min #map0(%arg3)
      %3 = affine.min #map1(%arg4)
      %4 = tensor.extract_slice %arg8[%arg3, %arg4] [%2, %3] [1, 1] : tensor<2x4xf32> to tensor<?x?xf32>
      %5 = affine.min #map2(%arg3)
      %6 = tensor.extract_slice %arg5[%arg3, 0] [%5, 3] [1, 1] : tensor<2x3xf32> to tensor<?x3xf32>
      %7 = affine.min #map3(%arg4)
      %8 = tensor.extract_slice %arg6[0, %arg4] [3, %7] [1, 1] : tensor<3x4xf32> to tensor<3x?xf32>
      %9 = affine.min #map2(%arg3)
      %10 = affine.min #map3(%arg4)
      %11 = tensor.extract_slice %arg7[%arg3, %arg4] [%9, %10] [1, 1] : tensor<2x4xf32> to tensor<?x?xf32>
      %12 = linalg.matmul {__internal_linalg_transform__ = "after_transpose_fusion_producer"} ins(%6, %8 : tensor<?x3xf32>, tensor<3x?xf32>) outs(%11 : tensor<?x?xf32>) -> tensor<?x?xf32>
      %13 = linalg.generic {indexing_maps = [#map4, #map4], iterator_types = ["parallel", "parallel"]} ins(%12 : tensor<?x?xf32>) outs(%4 : tensor<?x?xf32>) attrs =  {__internal_linalg_transform__ = "after_transpose_fusion"} {
      ^bb0(%arg9: f32, %arg10: f32):  // no predecessors
        %15 = addf %arg9, %arg9 : f32
        linalg.yield %15 : f32
      } -> tensor<?x?xf32>
      %14 = tensor.insert_slice %13 into %arg8[%arg3, %arg4] [%2, %3] [1, 1] : tensor<?x?xf32> into tensor<2x4xf32>
      linalg.yield %14 : tensor<2x4xf32>
    }
    return %1 : tensor<2x4xf32>
  }
}

Is there some other command-line option I need to give so as to get what you obtained for conv - add case above, or the conv/generic needs to be added to the LinalgTransformationFilter()

One thing to keep in mind is that the -test-linalg-tiled-loop-fusion-transform-pattern is just a pass for testing the pattern LinalgTileAndFusePattern (here).

WIth that disclaimer, to test this for your use case, you could add a new pattern here

patterns.add<LinalgTileAndFusePattern<GenericOp>>(
      context, dependenceGraph,
      LinalgTilingOptions().setTileSizes({32, 64}).setLoopType(LoopType),
      LinalgFusionOptions().setIndicesToFuse({0, 1}),
      LinalgTransformationFilter(
          Identifier::get("conv_add_fusion", context),
          Identifier::get("after_conv_add_fusion", context)),
      LinalgTransformationFilter(
          ArrayRef<Identifier>(),
          Identifier::get("after_conv_add_fusion_producer", context)),
      LinalgTransformationFilter(
          ArrayRef<Identifier>(),
          Identifier::get("after_conv_add_fusion_original", context)));
}

To the linalg.generic in your test you would just add {__internal_linalg_transform__ = "conv_add_fusion}.

You can avoid all of that by just using the -test-linalg-tile-and-fuse pass here (example here) . It currently only generates nested scf.for loops. If you want to make it generated tiled loops you can change this to LinalgTilingLoopType::TiledLoops (or better yet, you can add a pass option that can allow you to toggle that from command line, that would be a welcome contribution :wink: ).

A things to watch out for. If you have static sizes and your tile size is greater than problem size the tiled loops get optimized away with canonicalization.
Also, I’d reiterate, that whether you use scf.for or linalg.tiled_loops they are fundamentally representing imperfectly nested loop computation. So they are equivalent.

Thanks Mahesh I tried both the approaches and they generated the tiled loops .

  1. adding the filter string as suggested and it generated the tiled.loops
  2. The test-linalg-tile-and-fuse after changing the TestLinalgFusionTransforms.cpp generated the tiled.loops.

Tiled loops seem to be something similar to “generic region” and container that encloses the needed and identified potential ops in a region together and gets us a good hint to pick the requested ops and generate a fused kernel for our HW.

Now we would like get these Tiled loops through the Flow Dialect of iree and get them further enclosed in the “Dispatch regions”.

Now coming to IREE I tried to run these 2 passed with iree-opt paths but both the approaches seemed to fail.
1)

iree-build/iree/tools/iree-opt --iree-mhlo-to-linalg-ext  --iree-mhlo-to-linalg-on-tensors --canonicalize  -cse --linalg-fusion-for-tensor-ops -iree-hal-target-backends=vmvx -print-ir-before-all  -print-ir-after-all  -iree-hal-target-backends=vmvx -test-linalg-tiled-loop-fusion-transform-patterns --canonicalize  -cse relu.mlir 
iree-opt: Unknown command line argument '-test-linalg-tiled-loop-fusion-transform-patterns'. 
iree-opt: Unknown command line argument '-test-linalg-tiled-loop-fusion-transform-patterns'. 
iree-build/iree/tools/iree-opt --iree-mhlo-to-linalg-ext  --iree-mhlo-to-linalg-on-tensors --canonicalize  -cse --linalg-fusion-for-tensor-ops -iree-hal-target-backends=vmvx -print-ir-before-all  -print-ir-after-all  -iree-hal-target-backends=vmvx -pass-pipeline="func(test-linalg-tile-and-fuse{tile-sizes=16,32,64}),resolve-shaped-type-result-dims,canonicalize,cse" relu.mlir
MLIR Textual PassPipeline Parser:1:6: error: 'test-linalg-tile-and-fuse' does not refer to a registered pass or pass pipeline
func(test-linalg-tile-and-fuse{tile-sizes=16,32,64}),resolve-shaped-type-result-dims,canonicalize,cse

So for IREE, is it something I am missing in the options provided, or is it iree-opt/iree-run-mlir are not up to with these passes .

Actually if you wanted to do this in IREE, we didnt have to go down all of this path. In theory, IREE already does this and it probably just disabled cause we are being more controlled in what is supported.

The first change needed would be to replace this code here

return isa<linalg::ConvInputNHWCFilterHWCFOp,
             linalg::DepthwiseConvInputNHWCFilterHWCOp,
             linalg::DepthwiseConvInputNHWCFilterHWCFOp,
             linalg::PoolingNHWCMaxI8Op, linalg::PoolingNHWCMaxI16Op,
             linalg::PoolingNHWCMaxI32Op, linalg::PoolingNHWCSumFOp,
             linalg::PoolingNHWCMaxFOp, linalg::PoolingNHWCMinFOp>(op);

with

  return isa<linalg::LinalgOp>(op) && !isa<linalg::GenericOp>(op);

Then you should be able to get dispatch region formation to fuse the linalg.*conv* and linalg.generic. It is disabled by default, but you can enable it with a flag -iree-flow-dispatch-formation-enable-operand-fusion.

This is more IREE specific. You can follow up on IREE Discord with more details.

1 Like

Thanks Mahesh I will have started looking into the Fusion provided by IREE, but looks like capabilities coming from both would be mutually exclusive and we would have to rely on either one of these .

I am not sure I follow. Both what is in core and IREE are using the exact same transformations. THe only difference is that the core pattern tries to do fusion while accounting for depdendences, while the transformations in IREE are correct by construction. So they are exactly the same functionally