Shrinking iteration space of loop with condition

When handling some data structure we are facing an opportunity of simplifying loops with condition check inside body. When the condition inside loop body invalidate some of the loop iteration space, it can be shrunk and the condition check can be removed.

A simple case would be like how to pass this test:


#set_half = affine_set<(d0) : (d0 >= 0, d0 <= 511)>

func.func @for_if(%buffer: memref<1024xf32>) -> (f32) {
 %sum_0 = arith.constant 0.0 : f32
 %sum = affine.for %i = 0 to 1024 step 1 iter_args(%sum_iter = %sum_0) -> (f32) {
   // CHECK: affine.for [[IDX_]] = 0 to 512 step 1
   %sum_for = affine.if #set_half(%i) -> (f32) {
   // CHECK-NOT: affine.if
     %t = affine.load %buffer[%i] : memref<1024xf32>
     %sum_next = arith.addf %sum_iter, %t : f32
     affine.yield %sum_next : f32
   } else {
     affine.yield %sum_iter : f32
   }
   affine.yield %sum_for : f32
 }
 return %sum : f32
}

Are there some existing MLIR passes that can transform this? If not what will be a good starting point for a new RFC?

I’m not sure if this pass exists, but @Superty and I gave a tutorial in EuroLLVM '22 on how you could implement such a pass: 2022 EuroLLVM Dev Mtg “Precise Polyhedral Analyses For MLIR using the FPL Presburger Library” - YouTube in case you want to implement this yourself.

I agree with you that FPL should play a critical role here. The tutorial seems to be calculating the set in forward direction to remove condition checks later in the control flow. To resolve the problem I feel it needs to go backward from loop body to loop header to reduce iteration space, and then go forward again to remove the redundant check.

A pass that performs such a simplification has been missing and would be very welcome! In fact, whenever the iteration space can be shrunk to such a convex subset, a simplification is almost always desirable to have. It can be added to -simplify-affine-structures or a separate pass. While a full-fledged polyhedral library like FPL can be used to implement it generically and powerfully, it’s possible to catch the simple cases with lightweight analysis and checks (to easily guarantee that it’s always fast).

1 Like

Hi all,
We are trying to implement your suggestions, but we got stuck.

To make this discussion more concrete, this is the IR we are trying to simplify:

#map = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>
#set0 = affine_set<(d0, d1) : (d0 >= 0, -d0 + 3 >= 0, d1 - d0 >= 0, -d1 + 3 >= 0)>
#set1 = affine_set<(d0, d1) : (d0 >= 0, -d0 + 3 >= 0, d0 - d1 >= 0, -d1 + 3 >= 0)>
module {
  func.func @test_structure(%arg0: memref<4x4xf32, #map>, %arg1: memref<4x4xf32, #map>, %arg2: memref<4x4xf32, #map>) -> memref<4x4xf32, #map> {
    %cst = arith.constant 0.000000e+00 : f32
    affine.for %arg3 = 0 to 4 {
      affine.for %arg4 = 0 to 4 {
        %0 = affine.if #set0(%arg3, %arg4) -> f32 {
          %3 = affine.load %arg0[%arg3, %arg4] : memref<4x4xf32, #map>
          affine.yield %3 : f32
        } else {
          affine.yield %cst : f32
        }
        %1 = affine.if #set1(%arg3, %arg4) -> f32 {
          %3 = affine.load %arg1[%arg3, %arg4] : memref<4x4xf32, #map>
          affine.yield %3 : f32
        } else {
          affine.yield %cst : f32
        }
        %2 = arith.addf %0, %1 : f32
        affine.store %2, %arg2[%arg3, %arg4] : memref<4x4xf32, #map>
      }
    }
    return %arg2 : memref<4x4xf32, #map>
  }
}

It is a sum between a upper triangular and a lower triangular matrix. I expect the internal loop to be split in three:

  • One for copying arg0 into arg2
  • One for copying arg1into arg2
  • One for computing arg0 + arg1 and storing it into arg2. This would be the loop along the diagonal.

My idea was to intersect the loop iteration space with the set of each of the if and then rewrite the loop. But this is where I have an issue. To be even more concrete, in my pass I do something like:

  LogicalResult matchAndRewrite(AffineIfOp ifOp,
                                PatternRewriter &rewriter) const override {
    FlatAffineValueConstraints cst;
   
    // Step 1 - get the presburger set for the surrounding for loops
    Operation *curOp = ifOp;
    while (isa<AffineForOp>(curOp->getParentOp())){
        AffineForOp forOp = dyn_cast<AffineForOp>(curOp->getParentOp());
        cst.appendDimVar(forOp.getInductionVar());
        curOp = forOp;
    }
    cst.addAffineForOpDomain(dyn_cast<AffineForOp>(curOp));
    // Step 2 
    // newset = cst.intersect(ifOp->getSet());
    // Step 3
    // generate_loop_for(new_set)
  }

My issue is in Step3. How can I go from a presburger set (or any polyhedral object) to a loop? Can this be done in FPL (or in any other way in MLIR)?

Thanks,
Giuseppe

cc @JoeyYe @bondhugula @Groverkss

I don’t think we have implemented that in FPL, but a simple way to implement would be something like this:

depth = 0
numLoops = cst.getNumSetDims()
while depth < numLoops:
  - depthCst = project out loop variables from [depth, numLoops) in cst using fourier motzakin
  - put constraints into 3 classes, lower bound on induction variable at depth, upper bound, and constraints not involving the induction variable at depth.
  - Put constraints not involving induction variables as if conditions
  - Create a loop as affine.max(lower bounds) to affine.min(upper bounds)

Working example for a triangle:

cst = (i, j)[N] : (0 <= i < N, 0 <= j < i)

---- depth = 0 ----

depthCst0 = (i)[N] : (0 <= i < N)

affine.for i = 0 to N {
}

---- depth = 1 ---

depthCst1 = (i, j)[N] : (0 <= i < N, 0 <= j < i)

affine.for i = 0 to N {
  affine.if (0 <= i < N) { // --> Can be removed with a "gist" like function
    affine.for j = 0 to i {
    }
  }
}

FYI, @JoeyYe @stevenvar