[RFC] Add affine.parallel

Hello everyone,

I’d like to suggest an additional operation in the affine dialect to represent parallel for loops. Please see the inline markdown text below. I look forward to everyone’s comments. We are currently experimenting with implementation of this operation, and based on the feedback here, we hope to finalize the design and implementation for a future code review.

-Jeremy

[RFC] Add affine.parallel

Table of Contents

  • Background
  • Goals
  • Proposal
    • Summary
    • IR Representation
    • Examples
  • Future Work
  • FAQ

Background

This proposal is intended to provide an effective representation of loops over affine induction variables whose iterations may safely be run in parallel. It draws on techniques and experience from the Stripe dialect discussed in the open design meeting on 2019-11-7, as well as a proposal for a parallel loop operation / dialect made by Stephan Herhut. Additionally it was informed by the recent commit which adds parallel for to the loop dialect.

Goals

  • Add representation of parallel for loops with affine memory access patterns

  • Allow lowering from implicitly parallel representations to maintain semantic parallelism

  • Allow dependence analysis and other techniques to infer parallelism and capture it in the IR

  • Enable rearrangement of iterations (such as in tiling) without needing to make changes to the loop body or break perfect nesting via affine.apply

  • Allow representation of parallelizable reductions within parallel for loops

  • As a specific use case, we want to be able to replace our Stripe dialect with the Affine dialect. Adding parallel for loops meeting the above goals would add most of the functionality we need for this.

Proposal

We propose adding a parallel for operation to the Affine dialect with a pretty-printed form like

affine.parallel %i0 = 0 to 10, %i1 = 0 to 16 {
  %0 = affine.load %I[%i0, %i1] : memref<16x10xf32>
  affine.store %0 %O[%i1, %i0] : memref<10x16xf32>
}

This parallel for operation iterates over induction variables (in this case %i0 and %i1), each of which has a lower and upper bound, along with an optional step size.

The lower and upper bounds can additionally take affine dimensions and symbols passed to the parallel for as SSA operands, providing further generality.

IR Representation

We propose the corresponding IR specification (with details elided) to be

def ParallelOp {
  let arguments = (ins 
     AffineMapAttr:$lowerBounds, 
     AffineMapAttr:$upperBounds, 
     IndexArrayAttr:$steps,
     Variadic<IndexType>:$mapOperands);
  let regions = (region SizedRegion<1>:$region);
}

where $lowerBounds and $upperBounds are affine maps which each produce N results, where N is the number of induction variables in the parallel for. The bounds represent the inclusive lower bound and exclusive upper bound. $steps contains an array attribute with N positive step sizes, which represent the step size for each induction variable.

The values in $mapOperands will be use for the map arguments of lowerBounds and upperBounds in that order. In the actual implementation, helper methods similar to those in affine.for will be provided to manage mapOperands more easily.

Examples

Here are some examples of simple operations expressed using parallel for.

Fixed Size Transpose

Use a parallel for to transpose a 16 by 10 tensor I to a 10 by 16 tensor O (this is the same example shown above):

func @fixed_transpose(%I: memref<16x10xf32>, %O: memref<10x16xf32>) {
  affine.parallel %i0 = 0 to 16, %i1 = 0 to 10 {
    %0 = affine.load %I[%i0, %i1] : memref<16x10xf32>
    affine.store %0 %O[%i1, %i0] : memref<10x16xf32>
  }
  return
}

A direct representation (without pretty printing) of the IR for this would be

func @fixed_transpose(%I: memref<16x10xf32>, %O: memref<10x16xf32>) {
  affine.parallel() ({
  ^bb0(%i0: index, %i1: index): // no predecessors
    %0 = affine.load %I[%i0, %i1] : memref<16x10xf32>
    affine.store %0 %O[%i1, %i0] : memref<10x16xf32>
    affine.terminator() : () -> ()
  }) {
    lowerBounds = () -> (0, 0), 
    upperBounds = () -> (16, 10),
    steps = [1, 1]
  } : () -> ()
  return
}

Dynamic Size Transpose

Use a parallel for to transpose an N by M tensor I to an M by N tensor O:

func @sym_transpose(%I: memref<?x?xf32>, %O: memref<?x?xf32>, %N: index, %M: index) {
  affine.parallel %i0 = 0 to sym(%N), %i1 = 0 to sym(%M) {
    %0 = affine.load %I[%i0, %i1] : memref<?x?xf32>
    affine.store %0 %O[%i1, %i0] : memref<?x?xf32>
  }
  return
}

A direct representation (without pretty printing) of the IR for this would be

func @sym_transpose(%I: memref<?x?xf32>, %O: memref<?x?xf32>, %N: index, %M: index) {
  affine.parallel(%N, %M) ({
  ^bb0(%i0: index, %i1: index): // no predecessors
    %0 = affine.load %I[%i0, %i1] : memref<?x?xf32>
    affine.store %0 %O[%i1, %i0] : memref<?x?xf32>
    affine.terminator() : () -> ()
  }) {
    lowerBounds = () -> (0, 0), 
    upperBounds = ()[s0, s1] -> (s0, s1),
    steps = [1, 1]
  } : (index, index) -> ()
  return
}

Tiling of a Transpose

The transpose from the first example tiled into 4 by 5 tiles:

func @fixed_transpose(%I: memref<16x10xf32>, %0: memref<10x16xf32>) {
  affine.parallel %j0 = 0 to 4, %j1 = 0 to 2 {
    affine.parallel %i0 = 4*%j0 to 4*%j0 + 4, %i1 = 5*%j1 to 5*%j1 + 5 {
    %0 = affine.load %I[%i0, %i1] : memref<16x10xf32>
      affine.store %0 %O[%i1, %i0] : memref<10x16xf32>
    }
  }
  return
}

A direct representation (without pretty printing) of the IR for this would be

func @fixed_transpose(%I: memref<16x10xf32>, %O: memref<10x16xf32>) {
  affine.parallel() ({
  ^bb0(%j0: index, %j1: index): // no predecessors
    affine.parallel(%j0, %j1, %j0, %j1) ({
    ^bb0(%i0: index, %i1: index): // no predecessors
      %0 = affine.load %I[%i0, %i1] : memref<16x10xf32>
      affine.store %0 %O[%i1, %i0] : memref<10x16xf32>
      affine.terminator() : () -> ()
    }) {
      lowerBounds = (d0, d1) -> (4*d0, 5*d1), 
      upperBounds = (d0, d1) -> (4*d0 + 4, 5*d1 + 5), 
      steps = [1, 1]
    } : (index, index, index, index) -> ()
    affine.terminator() : () -> ()
  }) {
    lowerBounds = () -> (0, 0), 
    upperBounds = () -> (4, 2), 
    steps = [1, 1]
  } : () -> ()
  return
}

Future Work

This can be combined with an affine atomic read-modify-write operation to represent parallel reductions. We intend to also implement such an operation, see also this diff adding LLVM’s atomicrmw op to the LLVM dialect.

It would also be possible to model reductions usings a design similar to the newly added loop.reduce operation.

Additionally, it may be useful to support symbolic step sizes for both affine.parallel and affine.for, since such symbolic steps should be compatible with affine analysis.

FAQ

Here are some questions that came up as we were making design decisions for parallel for, and the corresponding rationales for our decisions.

Why use multiple induction variables in a single for loop?

There is no semantic difference from changing the nesting order of multiple parallel for loops. By allowing multiple induction variables, we eliminate redundant forms of expressing the same operation. This provides an easier to canonicalize and more concise representation.

Moreover, we believe a single parallel for operation with multiple induction variables provides a more intuitive representation of the semantics. We are concerned that nested parallel for loops might be read as “first iterate over this index in parallel, then iterate over that index in parallel”. The intended semantics of “iterate over these indices in parallel” should be clearer with a single loop op and multiple induction variables.

This also means that nested parallel for loops can represent different semantic structures that may be more meaningful. For example, a tiling pass might transform a single parallel for into two nested parallel fors, representing loops whose memory will live at different levels of the cache hierarchy.

Why put this in the Affine dialect instead of using something more general (e.g. loop.parallel)?

Our use cases must interact with other components of the Affine dialect, such as affine.if, affine.load, and affine.store. We also want optimizations that make heavy use of the easier aliasing analysis available with affine accesses. Thus, we benefit both from integrating with other ops within the same dialect and from being able to write optimizations knowing we are dealing specifically with an affine parallel for.

Why have a separate parallel operation instead of adding an attribute to the current for indicating whether it’s parallel?

We think that it is better to have serial and parallel for as separate operations because there are enough differences between the ideal structure of serial and parallel for operations. This is true both of current differences in the operations and also because of differences in representing possible extensions of the current operations.

In terms of current structure, we think a single parallel for loops should allow multiple induction variables for semantic clarity. With serial for, this is less clear, as serial for requires an ordering of the induction variables that is typically represented by nesting multiple loop operations. It is certainly possible to represent multiple induction variables in a single for loop operation, and perhaps even desirable. However, we prefer to not make that decision for serial for while designing parallel for.

In terms of possible extensions, there has been disucssions on the mailing list regarding adding support for loop-carried values to serial for. However, loop-carried values only make sense in the context of serial for.

Why not make parallelizability a flag on individual induction variables (instead of on the entire multivariable loop)?

Primarily because that would mean we had a single unified serial and parallel for operation. We didn’t want to do this for reasons described in the previous question. Also, parallel induction variables can be freely interchanged in order while serial induction variables can’t. Including both serial and parallel induction variables in a single operation on equal footing strikes us as too likely to cause confusion. It is especially prone to confusion amongst optimization pass writers who are primarily thinking of one case (serial or parallel).

3 Likes

Great to see this proposal! Extending the ideas of loop.parallel to affine dialect seems really useful! As both constructs are very close in their intended semantics, we should aim at making them look and behave as similar as possible.

The syntax here is slightly different from what loop.parallel uses. We have opted for a tuples representation, mostly because that felt more concise than repeating keywords and I personally find it easier to glance the iteration space in that form. This is pure bikeshedding but having the same syntax for similar concepts is quite powerful.

I also agree that this should live in the affine dialect because we already have affine.for. We should also have a lowering from affine.parallel to loop.parallel as part of the general lowering of affine.

Looking forward to a patch adding this!

Cheers
Stephan

Hi @jbruestle,

This proposal looks really great to me. Separate from the pretty-syntax point above (which can be a follow-on step to your proposal in any case), I am really thrilled to see you push on this, the affine dialect certainly needs this.

-Chris

This op appears to be a good addition for the goals you list. This could have subsumed an “affine.parallel_for”, except for a limitation listed below. Some comments.

  1. What you have here is an unordered band (contiguous set) of parallel loops. Something like affine.parallel_for_band or affine.parallel_for_set appears to be a more accurate name - ‘parallel’ itself isn’t descriptive enough?

  2. Per your proposal, lowerBounds and upperBounds are N-dimensional affine maps, i.e., one result each for an induction variable. So you won’t be able to represent max/min bounds using them. You don’t need them? If you do, these will have to be lists of affine maps (list of size N). Getting rid of this limitation allows this op to represent a single parallel affine.for as its trivial case.

  3. Besides the listed goals/rationale, when I try to compare this proposal with just having individual affine.parallel_for’s, what this really appears to help with (first FAQ) is of keeping a band of parallel dimensions together without letting other IR get in between those parallel iterators as it moves through transformation passes and for as long as one wants. This also means that while this op is around, you can’t really hoist code inside the parallel band up to the right depths, replace invariant subscripted accesses by scalars, or perform other transformations that might want to introduce things in between. More importantly, it also makes it difficult to transform any of the individual iterators, for eg., tile one of the N band iterators - because now you’d need a new utility to do that and update the op as opposed to a straightforward stripmine/interchange on an affine.for or affine.parallel_for (both of which could share an interface).

In summary, do you have examples where there is a significant benefit in keeping multiple contiguous parallel loops together in an op? I think it’d be good to carefully weigh the pros/cons of 1-d affine.parallel_for vs. the proposed multi-dimensional affine.parallel_for.

The examples provided in the first FAQ are easily dealt with even with multiple affine.parallel_for’s. It’s common to obtain bands of parallel or permutable loops for tiling purposes.

This shouldn’t be an issue AFAICS. A utility/pass should be looking at bands of affine.parallel_for’s instead of individual ones if that’s the right thing to do. However, if you are concerned about other IR getting in between the individual parallel_for’s, that makes sense - but I think it would be good to understand the lifetime of this op and a concrete example or two of the transformations you’d want to perform on this.

Sorry for the late reply, I was out for an extended weekend. I think your suggestion on syntax is a good one, and it actually makes the parsing easier as well since the affine maps are contiguous. The one disadvantage in my mind would be that step size must be specified if any step size is not 1, however I think that is probably a worth trading off to gain concision, consistency, and easier parsing.

  1. Regarding naming, my original name was parallel_for, however I opted for parallel to match with the loop dialect. Honestly, I don’t have strong feelings about the name, so I’m happy to consider multiple options here, and I’m not really sure the best way to pick one. I wish I could put it up for a vote.

  2. Correct, this proposal does not support min/max on induction variables. Currently the existing set of Stipe passes we intend to port to the affine dialect do not need any such support. Perhaps you could provide some examples concrete example of where it would be valuable? It would be possible to add, but it would incur some syntactic and representational complexity, which I’d like to avoid pending a sufficiently compelling use case.

  3. Generally the argument for multidimensional parallel for is that semantically the order of the indexes doesn’t matter, so representing them by nesting provides needless degrees of freedom and works against canonicalization. Additionally, any IR that exists between what would otherwise be a perfect loop nest can significantly increase the complexity of analysis in out experience, so we’d like to actively avoid it. Transformations like hoisting on purely parallel code can actually reduce flexibility if done too early, so generally, we avoid such things until after most of the relevant tilings / fusions / etc have been done. Finally, things like multidimensional parallel hardware operations (say for example a fixed function GEMM unit) are more clearly represented via multidimensional parallel for.

As for passes, the all of our existing pre-MLIR passes are currently implemented using multidimensional parallel for. Probably the most relevant examples are 1) Autotiling for cache usage, which we typically do over a multidimensional index set, considering the full multidimensional inner tile size at once. 2) Tiling for ‘stenciling’, i.e. matching some larger parallel for to some fixed size (possibly lower dimensional) inner operation that is supported by hardware accelerators. 3) Fusion, which we generally do over two parallel for operations that may share only some indexes. Here we look for the intersecting set of indexes, and then make an ‘outer’ PF over those indexes, and then two inner PF’s over each original ops remaining indexes.

Clearly it would be possible to ‘group’ single dimensional parallel_for ops via analysis, but I think keeping ‘perfect nesting’ as something which is represented in the IR rather than discovered during analysis is an advantage. Certainly, we have found that consider the multidimensional case during passes is generally required anyway (in tiling for example), and it’s a bit easier to code when you can just perform analysis / transforms on a single operation rather than on a nesting.

When one of your upper bounds is a symbol and you have to tile such a set (since you mention later that autotiling is something you have), you’ll need a min to represent the intra-tile loop upper bound if that symbol is not known to be a multiple of the tile size. If you didn’t need the intra-tile loop to be part of that affine.parallel (?), then you wouldn’t have needed to support a min, but one might just want those new loops included in the band as well. (Because intra-tile loops resulting from tiling a parallel band will also be parallel.)

Do note however that general fusion often needed with tensor graphs (which goes beyond loop fusion and which could add redundant computation and cut down intermediate storage drastically) in most cases leads to an imperfect nesting. (The affine loop fusion pass does such fusion.) So I suspect you may still want to model the parallel bands in such an output at some point. It’s just that you’ll have narrower bands when you have such an imperfect nesting, which is fine. As I mentioned, in the limiting case of IR getting in between all of those parallel loops, you’ll have one dimension in the affine.parallel.

My general concern here is the need to manipulate / break up / update the affine.parallel for seemingly simple transformations that would have been otherwise really easy to perform on a contiguous set of 1-d affine.parallel_for’s. As a representative example, consider outer loop vectorization on or more loops of say a 3-d affine.parallel.
`
affine.parallel %i = 0 …, %j = 0 … , %k = 0 … , {

}
`
As a use case, you may want to contrast the mechanics of implementing such outer loop vectorization (potentially even multi-dimensional) on this representation vs a sequence of affine.parallel_for’s (assuming it was beneficial to do this early while you have this affine.parallel around). For some reason, if the best thing to do also as part of the transformation was to hoist some of the generate vector body somewhere in between (breaking the perfect nesting), you’d run into a phase ordering issue here, i.e., you’d either have to break up the band or delay such hoisting to a later pass and risk such hoisting not being performed altogether.

Have you run into or considered such issues surrounding interaction and ordering w.r.t other things that don’t want to see monolithic bands?

Right - but my point was that a pass / utility is already expected to know this when dealing with multiple contiguous parallel_for’s. So I see the “preservation of a perfect parallel nest” as the more important issue/potential gain with this op here.

We use affine.if to handle uneven tilings. So basically, if apply an inner tile size of 32 to a loop over %i from 0 to %n, we get something like:

affine.parallel %i0 = 0 to sym(%n) step 16 {
  affine.parallel %i1 = %i0 to %i0 + 16 step 1 {
    affine.if affine_set<(d0, d1)[s0] : (d0 + d1 < s0) (%i0, %i1)[%n] {
      <body>
    }
  }
}

Now, the affine if here is a bit wordy, but that’s largely syntactic, nothing prevents us from supporting something more like:

affine.if %i0 + %i1 < sym(%n) {

Also,

So as an example of a multidimensional tiling/vectorization in practice, consider the following code (simplified to not consider uneven tiling, and with most asserts removed for clarity)

void Tile(AffineParallelOp op, llvm::ArrayRef<int64_t> tileSizes) {
  auto builder = op.getBodyBuilder();
  mlir::Block* outerBody = op.getBody();

  // Make the maps for the inner parallel
  llvm::SmallVector<mlir::AffineExpr, 8> lbExprs;
  llvm::SmallVector<mlir::AffineExpr, 8> ubExprs;
  for (size_t i = 0; i < dimCount; i++) {
    auto outerDim = builder.getAffineDimExpr(i);
    auto tileSize = builder.getAffineConstantExpr(tileSizes[i]);
    lbExprs.push_back(outerDim);
    ubExprs.push_back(outerDim + tileSize);
  }
  auto lbMap = AffineMap::get(dimCount, 0, lbExprs);
  auto ubMap = AffineMap::get(dimCount, 0, ubExprs);

  // Generate args for maps
  llvm::SmallVector<mlir::Value, 8> outerIdxs;
  for (size_t i = 0; i < outerBody->getNumArguments(); i++) {
    outerIdxs.push_back(outerBody->getArgument(i));
  }
  // Make the inner parallel for
  auto inner = builder.create<AffineParallelOp>(
     op.getLoc(), lbMap, outerIdxs, ubMap, outerIdxs));

  // Splice instructions into the interior
  auto& innerLoopOps = inner.getBody()->getOperations();
  auto& outerLoopOps = outerBody->getOperations();
  innerLoopOps.splice(std::prev(innerLoopOps.end()), outerLoopOps, outerLoopOps.begin(),
                                                          std::prev(outerLoopOps.end(), 2));

  // Update outer step size
  llvm::SmallVector<int64_t, 8> newSteps;
  auto oldSteps = op.steps().cast<ArrayAttr>().getValue();
  for (size_t i = 0; i < dimCount; i++) {
    newSteps.push_back(oldSteps[i].cast<IntegerAttr>().getInt() * tileSizes[i]);
  }
  op.setAttr("steps", builder.getI64ArrayAttr(newSteps));
}

It transforms this:

  func @dot(%o: memref<100x100xf32>, %a: memref<100x100xf32>, %b: memref<100x100xf32>) {
    affine.parallel (%i, %j, %k) = (0, 0, 0) to (100, 100, 100) {
      %0 = affine.load %a[%i, %k] : memref<100x100xf32>
      %1 = affine.load %b[%k, %j] : memref<100x100xf32>
      %2 = mulf %0, %1 : f32
      pxa.reduce add %2, %o[%i, %j] : memref<100x100xf32>
    }
    return
  }
}

Into this:

  func @dot(%o: memref<100x100xf32>, %a: memref<100x100xf32>, %b: memref<100x100xf32>) {
    affine.parallel (%i0, %j0, %k0) = (0, 0, 0) to (100, 100, 100) step (10, 10, 10) {
      affine.parallel (%i1, %j1, %k1) = (%i0, %j0, %k0) to (%i0 + 10, %j0 + 10, %k0 + 10) {
        %0 = affine.load %a[%i1, %k1] : memref<100x100xf32>
        %1 = affine.load %b[%k1, %j1] : memref<100x100xf32>
        %2 = mulf %0, %1 : f32
        pxa.reduce add %2, %arg2[%i1, %j1] : memref<100x100xf32>
      }
    }
    return
  }

which I think shows the ease of both the transformation and the readability of the resulting code in the multidimensional case.

Agree.

Hi Jeremy,

My apologies for the long delay on this thread, I have now been able to catch up and can offer some comments.

Generally, I am very much +1 on your proposal: raising the level of abstraction is a good when possible thing and participates in making transformations much easier to write, as you show. This simple observation has also been driving my own work for some time now. In the grander order of things I would advocate for even much more structure but that is where Tile and Linalg sit so I appreciate the separation of concerns of wanting to extend affine just enough that you can cross the bridge.

I am also happy to see convergence with the loop.parallel_for construct and convergence of ideas and implementations.

There are a few details that I am surprised no one picked up yet though, I’ll just reply in bulk where I think something may be missing or clarification needed.

IR Representation

The key thing I see here, and that I think may have been missed, is that you have adopted a notion of HyperRectangular, Parametric Loop Band: within a single paralel_for op, you cannot read the induction variables you define.
The implication is that you cannot represent and transform non-rectangular (e.g. triangular) iteration domains with this op. This is fine as you can still use 1 op per loop but may appear to defeat the purpose.

I think this is an important point and we have discussed this a lot internally. My personal take is that this is totally fine because the HyperRectangular domain is sufficient for a lot of cases. I have personally argued going much further in that direction and separate the HyperRectangular at the level affine expressions directly in the type system.

This follows a somewhat longer trail of experience with the loop skewing transformation which happens to be almost always counter-productive (others will disagree with that statement).

Still, despite my opinions on the direction towards which I would like to see Affine evolve, I think it is important to make that expressiveness restriction very visible.

Various comments

Something like affine.parallel_for_band or affine.parallel_for_set

I think multi-result parallel_for is clear enough. We know from Ken Kennedy and others that this is a set that can be reordered but we still need determinism when lowering to multiple single loops. I think set is more confusing as it would seem to convey that any lowering order may happen.

Per your proposal, lowerBounds and upperBounds are N-dimensional affine maps, i.e., one result each for an induction variable. So you won’t be able to represent max/min bounds using them.

I think this is a red herring as you can use affine.min (and add an affine.max if needed) in-between parallel_for ops. This does not break the intra-op perfect-nested-ness. The real expressiveness limit is triangular loops.

keeping a band of parallel dimensions together without letting other IR get in between those parallel iterators

Yes, this is the key structural property (modulo the expressiveness limit). Note that we also have a generalized version of this in Linalg. For the purpose of a lower-level dialect like affine, @jbruestle’s proposal seems good to me given the major upside (bringing in Stripe’s experience and expertise into affine).

Correct, this proposal does not support min/max on induction variables.

And I don’t think you need a special representation in the type. Since you have restricted the semantics to the HyperRectangular domain, you can just use affine.min/max before the op for similar expressiveness. There would be a difference in chains of affine.min/max but we could make it take variadic operands and enforce some verification constraints.

Additionally, any IR that exists between what would otherwise be a perfect loop nest can significantly increase the complexity of analysis in out experience, so we’d like to actively avoid it

Yes, absolutely !

Transformations like hoisting on purely parallel code can actually reduce flexibility if done too early, so generally, we avoid such things until after most of the relevant tilings / fusions / etc have been done.

Also +1 on this!

Finally, things like multidimensional parallel hardware operations (say for example a fixed function GEMM unit) are more clearly represented via multidimensional parallel for.

While I fully agree, I would also claim that you’ve already lost key information here. But this is better left for another discussion.

you’ll need a min to represent the intra-tile loop upper bound if that symbol is not known to be a multiple of the tile size

This seems incorrect to me, since it’s HR why not just have affine ops in between parallel_for ?

We use affine.if to handle uneven tilings.

How about affine.min/max ? Seems like you could benefit from the complexity reduction of not adding extra regions.

I won’t comment on the code examples and simplicity etc. While I am totally sold on this at the level of affine, I also think we can do significantly better at a higher level of abstraction (as you already know :slight_smile: ).

Thanks a lot for pushing this @jbruestle!

Thanks for the comments @nicolasvasilache. I really only have a few follow up points:

  1. Thanks for driving home the fundamental relation between affine.parallel_for’s multiple indexes and hyper-rectangular regions. This relation is intentional. It is certainly possible to represent a triangular iteration domains (or really any polyhedral domain), even multidimensional ones, via a single parallel_for, by the addition of an affine.if in the interior. This does however look somewhat different from triangular iteration domains when represented by nested ifs. Personal experience has shown that in addition to the fact that many real world use cases are purely hyper-rectangular, even the cases where there are not (for example, edge handling on convolutions), taking an affine.if approach to simply ‘loping off’ the out-of-bounds examples provides for easy analysis during the early set of transforms like cache tiling and fusion. Generally at some later point we lower into some hardware notion of threads and serial loops, at which point transforming the inner affine.ifs into outer loop dependent bounds on the inner loops is fairly straightforward in most cases.

  2. Unless I’m mistaken, there is no affine.min / affine.max, and in fact, as far as I know general min/max over affine variables breaks both affine and semi-affine analysis (where I’m calling a set semi-affine if it can be represented as a polyhedra in an index space extended via slack variables, similar to the method used for ceildiv and mod in the existing AffineExpr). It’s possible an affine.max between an affine variable and constant/symbol is valid (I’d have to look into that), but it’s not currently supported AFAIK. Regardless, I do think affine.if is a fine option, as our current IR never had any min/max support of any sort, just an affine.if like construct, and it seemed sufficient.

Also, minor point, seems like parallel_for is maybe the winning name?

This relation is intentional. It is certainly possible to represent a triangular iteration domains (or really any polyhedral domain), even multidimensional ones, via a single parallel_for , by the addition of an affine.if in the interior.

Yes, I am absolutely sold on this, having used similar tricks in the past to express some pretty intricate ragged array control-flow (reference, transformation-friendly predicated form). Of course I am not advocating that this should be exposed to the user (as we had done) but it has proven itself to work really well in practice (only 20% overhead on GPU wrt tuned non-predicated code impl).

In fact, I try to hammer that point in Property 6: Perfectly Nested Writes To The Whole Output Operands here.

In any case, my remark was not intended to suggest this is limiting or a blocker. But that the HR abstraction is fundamentally a good separation of concerns that the design of Affine has overlooked in spite of my raising the point.

Unless I’m mistaken, there is no affine.min / affine.max,

We only have affine.min and would welcome affine.max. But since everything is on a per-need basis and you seem happy with the extra region in affine.if I am not compelled to tell you otherwise.
My use case for affine.min is going to be on canonicalization, padding and vectorization in Linalg transforms. I find that affine.min is easier to reason about than affine.if which may be piecewise, hole-y or both IIRC. I was planning to turn affine.if and other conditionals into predication and compute as per the point above.

Also, minor point, seems like parallel_for is maybe the winning name

I am happy with it and see issues with the currently proposed alternatives but 66% isn’t what I would call strong consensus :slight_smile: . I think it’s safe to go this route and adapt as others voice in, if necessary.

I was referring to having the affine.parallel itself encode the min and max instead of following / tracing operands to get max/min. This way a parallelized affine.for could be represented “as is” by an affine.parallel without the need break it up into an affine.min, affine.max, and affine.parallel in general. The motivation for that would be no different from keeping maps together with affine.loads and affine.stores. Allowing such a min doesn’t break the hyper-rectangular property nor the perfect nested view. If you are using a map in the bounds, you might as well pull that information in.

I tend to prefer adding a suffix that captures multi dimensionality. If not, parallel_for_band, parallel_multi_for?

When considering inequalities, a <= affine_min or >= affine_max doesn’t break affine analysis (it’s still a conjunction of inequalities). However, equalities involving affine_min/max (when not used in upper/lower bound resp.), or a >= affine_min or <= affine_max lead to a union of pieces / cases, and significantly complicate things in analysis. So, there is the choice of having those pieces materialized in the IR (via if/else constructs) as the alternative to keep analysis simple. So long as you use affine_min for an upper bound and an affine_max for a lower bound (and not the other way round), it shouldn’t limit you. This is also another reason why it may be better to just directly allow a min in the upper bound and a max in the lower bound instead of getting results from affine.min / max resp. But it should be fine to start with a design that doesn’t and see later if it improves things.

If it has to do with bounds, you should really be using an affine.min as @nicolasvasilache points out. Just from the representation point of view, a min in the bound is executed once per that entire loop, while putting an affine.if in the loop body executes every iteration of your entire set.

The op itself is hyper-rectangular, but the IR inside (as @jbruestle’s example shows below) could be using an affine.if that deviates from HR and you could, as you point out too, be using multiple affine.parallel_for’s effectively modeling arbitrary shapes. As such, depending on the analysis/transformation, you may or may not be able to take “local” HR views just because you have the op – it could just even lead to wrong analysis/results! Simple eg: looking at a 3-d affine.parallel with bounds %N and then concluding that there are %N^3 iterations in it would appear simple (yes, it looks rectangular locally :-)), but that would just be wrong if you hadn’t explicitly checked for any affine.if’s nested inside. In general, everything up until the leaf operations (including access functions) would need to be looked at. And if you need to anyway explicitly check, it’s trivial to check if a band of affine.for’s is hyperrectangular while at it. This brings us back to the main point that the benefit here is potentially from just keeping the loops together.

@jbruestle - this is still showing one simple transformation going from a perfect nest to perfect nests and introducing an affine.parallel. My question was whether you had run into phase ordering issues with other transformations that didn’t want to see monolithic bands? (i.e., they’d want to insert things in between requiring an existing affine.parallel_for to be broken up.)

Regarding phase ordering issues, the short answer that our existing optimization passes don’t currently run into any issues, since there are basically two phases: High level multidimensional transforms, followed by transformation of parallel code to serial code, post thread/workgroup/etc assignment, at which point we go down to simple loops, and do lower level single dimensional transforms there. Take for example our Gen9 GPU pass configuration. The pre-MLIR pass plan is basically here, which hopefully give you a high level sense of pass order even though clearly it’s a custom format. We get SOTA performance for most networks, usually faster than hand coded kernels. Note: all of the passes here operate on mulitdimensional parallel_fors, at which point each thread is transcoded into serial loops and additional passes are done (not shown above since they happen in a lower level IR).

As I reread the pass configuration, it’s worth noting that we sometimes use PF’s as a way to ‘group’ loop nests into a semantically relevant group, for example, perhaps a particular PF represents the iteration domain of workgroups, i.e. each inner execution is one workgroup, so effective the range of the PF is the number of workgroups launched. This is usually used in conjunction with an attribute specifying such a relation.

That said, I can see that in general it may be valuable to transform multi-fors into single dimensional parallel for’s or to go the other way (where possible given nesting structure), however these transforms are quite simple.

Also regarding parallel_multi_for as a name, this might be confusing in the case when the parallel_for only has a single induction variable. Also, the loop.parallel operation similarily allows multi-for operation without it being explicit in the name, so I’d argue there is precidence for that. I do think that parallel_for is better than just parallel in the case of affine however. In the loop dialect, we are clearly dealing with loops, so leaving ‘for’ out of the name is probably appropriate, whereas it’s less so for affine.

@jbruestle All of this sounds good to me. As for the name, I don’t have any strong opinions - in fact, affine.parallel doesn’t give the impression that there is a single dimension while affine.parallel_for does; you anyway have the index vars appearing on the same line, so, even affine.parallel should be fine if you aren’t going for something like affine.parallel_multi_for.

An initial patch in a series of planned patches has been submitted here to implement this proposal: https://reviews.llvm.org/D74277.

1 Like

This was a very interesting discussion, and thanks for pushing this forward!

FWIW I like “parallel_multi_for”!

The first patch just landed, thanks everyone!

https://reviews.llvm.org/rGfdc7a16a8275b4ef69a17bb61c54c445ef506f90