Tiling generates complex maps that seem to interfere with unrolling

When tiling a matrix multiplication of size 2048x2048x2048, here is what mlir::tile generates:

affine.for %arg0 = 0 to 2048 step 64 {
  affine.for %arg1 = 0 to 2048 step 2048 {
    affine.for %arg2 = 0 to 2048 step 256 {
      affine.for %arg3 = max #map5(%arg0) to min #map8(%arg0) {
        affine.for %arg4 = max #map5(%arg1) to min #map7(%arg1) {
          affine.for %arg5 = max #map5(%arg2) to min #map6(%arg2) {
            %5 = affine.load %1[%arg3, %arg5] : memref<2048x2048xf64>
            %6 = affine.load %2[%arg5, %arg4] : memref<2048x2048xf64>
            %7 = affine.load %0[%arg3, %arg4] : memref<2048x2048xf64>
            %8 = mulf %5, %6 : f64
            %9 = addf %7, %8 : f64
            affine.store %9, %0[%arg3, %arg4] : memref<2048x2048xf64>
          } {poly_codegen_name = "k"}
        } {poly_codegen_name = "jR"}
      } {poly_codegen_name = "iR"}
    } {poly_codegen_name = "kC"}
  } {poly_codegen_name = "jC"}
} {poly_codegen_name = "iC"}

The code is correct, but uses complex maps with “max” and “min” for the inner loops when in my oppinion it should not. When in later stages I try to unroll (after tiling once more) the transformation fails, and my assumption currently is that it cannot handle inner loops where the maps are not “simple” (BTW, what’s their formal name?).

I’ve tried both the main branch of llvm, and the hop branch of Uday’s fork. The result is the same in both cases.

The questions:

  • is it normal that mlir::tile is unable to generate simple maps?
  • does this influence loop unrolling (I presume so, because to unroll one needs exact bounds)
  • how to work around it?

Dumitru

Could you paste the input file as well? There shouldn’t have been max’s and min’s for intra-tile loops - the tiled code generation utility avoids introducing max/min’s in such cases. You can always unroll/unroll-and-jam in such cases in the presence of max/min (even if the tile sizes didn’t divide problem sizes or the latter were dynamic) after using the if/else based full/partial tile separation utility (mlir::separateFullTiles). That would lead to the most compact code for cleanup - much smaller than traditional unroll/unroll-and-jam with dimension wise cleanup.

Here is the function prior to tiling (I simply comment out the tiling code and this is the result):

  func @main() {
    %cst = constant 0.000000e+00 : f64
    %0 = alloc() : memref<2048x2048xf64>
    %1 = alloc() {alignment = 2048 : i64} : memref<2048x2048xf64>
    %2 = alloc() {alignment = 2048 : i64} : memref<2048x2048xf64>
    %3 = memref_cast %1 : memref<2048x2048xf64> to memref<?x?xf64>
    %4 = memref_cast %2 : memref<2048x2048xf64> to memref<?x?xf64>
    %c2048_i64 = constant 2048 : i64
    call @fill_matrix(%c2048_i64, %c2048_i64, %3) : (i64, i64, memref<?x?xf64>) -> ()
    call @fill_matrix(%c2048_i64, %c2048_i64, %4) : (i64, i64, memref<?x?xf64>) -> ()
    linalg.fill(%0, %cst) : memref<2048x2048xf64>, f64
    affine.for %arg0 = 0 to 2048 {
      affine.for %arg1 = 0 to 2048 {
        affine.for %arg2 = 0 to 2048 {
          %5 = affine.load %1[%arg0, %arg2] : memref<2048x2048xf64>
          %6 = affine.load %2[%arg2, %arg1] : memref<2048x2048xf64>
          %7 = affine.load %0[%arg0, %arg1] : memref<2048x2048xf64>
          %8 = mulf %5, %6 : f64
          %9 = addf %7, %8 : f64
          affine.store %9, %0[%arg0, %arg1] : memref<2048x2048xf64>
        } {poly_codegen_name = "kC"}
      } {poly_codegen_name = "jC"}
    } {poly_codegen_name = "iC"}
    return
  }

There is no file per se, as tiling is part of a larger lowering step coming from my dialect. It’s easier, as I know the structure of the code.

The code doing the tiling is:

{ // BLIS-like loop tiling
	      SmallVector<AffineForOp,3> loop_nest ;
	      loop_nest.push_back(outer_loop) ;
	      loop_nest.push_back(middle_loop) ;
	      loop_nest.push_back(inner_loop) ;
	      SmallVector<uint64_t,3> tile_size ;
	      tile_size.push_back(M_C) ;
	      tile_size.push_back(N_C) ;
	      tile_size.push_back(K_C) ;
	      
	      SmallVector<AffineForOp,8> new_loops =
		mlir::tile(llvm::makeArrayRef(loop_nest),
			   llvm::makeArrayRef(tile_size),
			   inner_loop) ;
	      {
		assert(new_loops.size() == 3) ;
		new_loops[0].setAttr(poly_id,StringAttr::get("iR",op->getContext())) ;
		new_loops[1].setAttr(poly_id,StringAttr::get("jR",op->getContext())) ;
		new_loops[2].setAttr(poly_id,StringAttr::get("k",op->getContext())) ;
	      }
}

Concerning unrolling, this is the loop I get with two levels of tiling:

  func @main() {
    %cst = constant 0.000000e+00 : f64
    %0 = alloc() : memref<2048x2048xf64>
    %1 = alloc() {alignment = 2048 : i64} : memref<2048x2048xf64>
    %2 = alloc() {alignment = 2048 : i64} : memref<2048x2048xf64>
    %3 = memref_cast %1 : memref<2048x2048xf64> to memref<?x?xf64>
    %4 = memref_cast %2 : memref<2048x2048xf64> to memref<?x?xf64>
    %c2048_i64 = constant 2048 : i64
    call @fill_matrix(%c2048_i64, %c2048_i64, %3) : (i64, i64, memref<?x?xf64>) -> ()
    call @fill_matrix(%c2048_i64, %c2048_i64, %4) : (i64, i64, memref<?x?xf64>) -> ()
    linalg.fill(%0, %cst) : memref<2048x2048xf64>, f64
    affine.for %arg0 = 0 to 2048 step 2048 {
      affine.for %arg1 = 0 to 2048 step 256 {
        affine.for %arg2 = 0 to 2048 step 64 {
          affine.for %arg3 = max #map8(%arg0) to min #map11(%arg0) step 8 {
            affine.for %arg4 = max #map8(%arg2) to min #map10(%arg2) step 4 {
              affine.for %arg5 = max #map8(%arg1) to min #map9(%arg1) {
                affine.for %arg6 = max #map5(%arg0, %arg3) to min #map7(%arg0, %arg3) {
                  affine.for %arg7 = max #map5(%arg2, %arg4) to min #map6(%arg2, %arg4) {
                    %5 = affine.load %1[%arg7, %arg5] : memref<2048x2048xf64>
                    %6 = affine.load %2[%arg5, %arg6] : memref<2048x2048xf64>
                    %7 = affine.load %0[%arg7, %arg6] : memref<2048x2048xf64>
                    %8 = mulf %5, %6 : f64
                    %9 = addf %7, %8 : f64
                    affine.store %9, %0[%arg7, %arg6] : memref<2048x2048xf64>
                  } {poly_codegen_name = "iiR"}
                } {poly_codegen_name = "jjR"}
              } {poly_codegen_name = "k"}
            } {poly_codegen_name = "iR"}
          } {poly_codegen_name = "jR"}
        } {poly_codegen_name = "iC"}
      } {poly_codegen_name = "kC"}
    } {poly_codegen_name = "jC"}
    return
  }

On this loop, I applied (in your hop branch, with M_R=4, N_R=8):

		    mlir::loopUnrollJamUpToFactor(inner_loops[0], M_R);
		    mlir::loopUnrollJamUpToFactor(inner_loops[1], N_R);		    

It does nothing. Is this normal?

You are using the wrong tiling utility, but this is the problem with duplicate infrastructure. :frowning: Please use mlir::tileCodeGen, which is the one that that is aware of the bounds and avoids introducing such redundant max/min. It already existed, but unfortunately, another one was introduced with overlapping functionality while being inefficient for the simplest inputs. mlir::tileCodeGen also has exactly the interface you need. You’ll never see max/min’s for trip counts that are multiples of corresponding tile sizes, while mlir::tile does it naively.

For the input loop nest that you have, if you just use mlir::tileCodeGen, you’ll see no max/min’s on the output and calling loopUnrollJamUpToFactor will give you the desired output. (Due to the redundant max/min’s you earlier had from the other tile utility, unroll/unroll-and-jam won’t do anything on such loop nests unless you do separation of full tiles - in that case, the generation of max/min was completely unnecessary for what was trivial input).

1 Like

Thanks. If I understand well, mlir::tile can handle any loop nest, while tileCodeGen only handles perfectly nested loops. Is this correct?

Reading the doc comments, mlir::tile provides a targets option to specify the sink destination. There is absolutely no reason why mlir::tileCodeGen can’t be extended to provide such an option or better something else using mlir::tileCodeGen and providing such a functionality - it’s just that mlir::tileCodeGen is restricted to doing the basic n → 2n loops for tiling.

I’m coding now, and I have to say that (from my “newbie” perspective) mlir::tile is more convenient than mlir::tileCodeGen. It directly provides a handle for the new loops and, for more complex tiling (e.g. the matmul one), it allows specifying the place where to attach the new loops. With mlir::tileCodeGen there’s more to do. Maybe optimizing the output of mlir::tile is the good way to go.

The last parameter for tileCodeGen provides you the handle for the new loops - is that not what you expected? You can always do the sinking separately. For things like the redundant bounds, you’d want to avoid inserting them in the first place by construction instead of having to analyze and eliminate them later.

1 Like

You are dealing with a perfect nest here - so you could get nearly anything using a combination of mlir::tileCodeGen and mlir::permuteLoops.

Ok, I understand now - you provide, say 3 loops, and it gives you back 6. Thanks!

That’s right. You get the new 6-d tiled nest. You can then apply permuteLoops with the map you want on it (or a subsequence of it as you wish) to get the right permutation of intra-tile loops for example. There are also other ways such as specifying tile size one for certain dimensions that could give you the desired ordering without having to do permute loops later.

Tried it, does not seem to work as it should. Here is the code I use:

	    { // BLIS-like loop tiling
	      SmallVector<AffineForOp,3> loop_nest ;
	      loop_nest.push_back(outer_loop) ;
	      loop_nest.push_back(middle_loop) ;
	      loop_nest.push_back(inner_loop) ;
	      SmallVector<unsigned int,3> tile_size ;
	      tile_size.push_back(M_C) ;
	      tile_size.push_back(N_C) ;
	      tile_size.push_back(K_C) ;
	      MutableArrayRef<AffineForOp> loop_nest_mutable(loop_nest) ;

	      
	      if(mlir::failed(mlir::tileCodeGen(loop_nest_mutable,
						tile_size))) {
		err << "Tiling did not work!\n" ;
	      }
	      err << "Tiling worked and resulting size is " << loop_nest_mutable.size() << "\n" ;
	      
	      {
		loop_nest_mutable[0].setAttr(poly_id,StringAttr::get("iR",op->getContext())) ;
		loop_nest_mutable[1].setAttr(poly_id,StringAttr::get("jR",op->getContext())) ;
		loop_nest_mutable[2].setAttr(poly_id,StringAttr::get("k",op->getContext())) ;
	      }

The input loop nest has 3 loops with string attributes (the one above). The output nest has the 6 loops OK, but loop_nest_mutable only contains the 3 original loops. Furthermore, the string attributes in the input are lost:

affine.for %arg0 = 0 to 2048 step 64 {
  affine.for %arg1 = 0 to 2048 step 2048 {
    affine.for %arg2 = 0 to 2048 step 256 {
      affine.for %arg3 = #map5(%arg0) to #map8(%arg0) {
        affine.for %arg4 = #map5(%arg1) to #map7(%arg1) {
          affine.for %arg5 = #map5(%arg2) to #map6(%arg2) {
            %5 = affine.load %1[%arg3, %arg5] : memref<2048x2048xf64>
            %6 = affine.load %2[%arg5, %arg4] : memref<2048x2048xf64>
            %7 = affine.load %0[%arg3, %arg4] : memref<2048x2048xf64>
            %8 = mulf %5, %6 : f64
            %9 = addf %7, %8 : f64
            affine.store %9, %0[%arg3, %arg4] : memref<2048x2048xf64>
          }
        }
      }
    }
  }
}

This means I don’t even have a hook on the outmost affine loop. How could I find it?

loop_nest_mutable would just be erased! (you shouldn’t look at it). The attributes aren’t preserved because there isn’t a clear notion as to where they should go on the tiled nest - for eg. does the client want them on the tile space loops or intra-tile loops (i.e., outer three or inner three here?) - although it’s straightforward to propagate them on either.

Ok, but then my question above remains - how do I get a handle to the new loops? You said “The last parameter for tileCodeGen provides you the handle for the new loops”. But if this parameter is not to be used any more…

The last parameter tiledNest are your new 6 loops - I think this part was clear?

mlir::tileCodeGen only has two arguments, none called tiledNest.
Addendum : the function you probably wanted to give me is constructTiledIndexSetHyperRect. It has an output argument.

So, that was the confusion! There are three on the upstream version.

2 Likes