[RFC] Yield for Affine Dialect

jbruestle · July 1, 2020, 6:01am

Currently, the affine dialect uses a very simple terminator (affine.terminate) for it’s flow control structures, affine.if, affine.for and affine.parallel. However, the scf dialect use scf.yield as its terminator, and provides a method to pass SSA values to the enclosing instructions to useful effect. For example, both branches of the scf.if may yield a value, which becomes a result of the scf.if operation. scf.for uses yield to pass loop carried values, etc.

It seems reasonable to add similar functionality to the affine dialect (including the appropriate lowerings to scf) via replacing affine.terminate with affine.yield. Yielding no values would be equivilent to the existing implementation. Blocks which yield no value would would be printed/parsed with an implicit yield (also similar to scf). And in general, this change should be 100% backward compatible.

I’ve put up an initial diff for review making some of the needed changes (affine.terminator → affine.yield, IR changes to affine.if, IR changes to affine.parallel) which is available here: ⚙ D82600 IR changes to add yield semantics for affine.if and affine.parallel, additional affine.parallel canonicalization. I’ll make additional changes based on this discussion as well as providing more complete documentation as suggested by @bondhugula once we have a basic consensus.

As for the detailed semantics of each instruction, I think there is some room for discussion. I’ll give my current opinions to get things started.

affine.if

Similar semantics to scf.if. Specifically, affine.if can return any number of values, and for each returned value, both branches of the if must ‘affine.yield’ the same number of values, and with matching types. The values returned by the if would be the values returned by whichever branch was executed. The ‘else’ branch of affine.if can now only be omitted for the case of no returns.

Example: Initalizing a buffer with edge padding

#interior = affine_set<(i, j) : (i - 1 >= 0, j - 1 >= 0,  10 - i >= 0, 10 - j >= 0)> (%i, %j)

func @pad_edges(%I : memref<10x10xf32>) -> (memref<12x12xf32) {
  %O = alloc memref<12x12xf32>
  affine.parallel (%i, %j) = (0, 0) to (12, 12) {
    %1 = affine.if #interior (%i, %j) {
      %2 = load %I[%i - 1, %j - 1] : memref<10x10xf32>
      affine.yield %2
    } else {
      %2 = constant 0.0 : f32
      affine.yield %2 : f32
    }
    affine.store %O[%i, %] : memref<12x12xf32>
  }
  return %O
}

Currently my diff includes the IR changes for thee semantics, but not the lowering changes.

affine.for

Nearly identical semantics to scf.for. affine.for would gain a new ‘iter_args’ section which specifies a set of input values. These would initialize a set of loop carried values. Then the inner block arguments would be augmented to include not only the induction variables, but also the additional ‘current state’ of the loop carried values. An affine.yield at the end of the block would specify the new values for the next loop iteration, and finally, the ultimate yielded values would be returned by the affine.for instructions.

Example: Serial accumulation:

func @serial_sum(%In : memref<100xf32) -> f32 {
  %cst0 = constant 0.0 : f32
  %tot = affine.for %i = 0 to 100 iter_args (%subtot = %cst0) -> (f32) {
    %0 = affine.load %In[%i]
    %1 = addf %subtot, %0
    affine.yield %1 : %f32
  }
  return %tot
}

Currently, I didn’t intend to make this change during my initial commit as it is perhaps the most complex and it is also not relevant to my particular use case.

affine.parallel

Here my current diff diverges slightly from scf.parallel. Similar to scf.parallel, the interior of affine.parallel is allowed to produce values, which are then reduced, and returned by the scf.parallel instruction. However, rather than the method used in scf.parallel, I suggest simply ‘yielding’ the values from the interior, and for each value specifying a reduction operation (which implies a ‘identity’ value, defining the results for 0 trip loops).

Example 1: Parallel accumulation:

fuction @sum(%A : memref<100x100xf32>) -> f32 {
  %out = affine.parallel (%i, %j) = (0, 0) to (100, 100) reduce ("addf") -> (f32) {
    %0 = affine.load %A[%i, %j]
    affine.yield %0 : f32
  }
  return %out
}

Example 2: 2d 3x3 centered valid convolution:

fuction @conv_2d(%D : memref<100x100xf32>, %K : memref<3x3xf32>) -> (memref<98x98xf32>) {
  %O = alloc memref<98x98xf32>
  affine.parallel (%x, %y) = (0, 0) to (98, 98) {
    %0 = affine.parallel (%kx, %ky) = (0, 0) to (2, 2) {
      %1 = affine.load %D[%x + %kx, %y + %ky] : memref<100x100xf32>
      %2 = affine.load %K[%kx, %ky] : memref<3x3xf32>
      %3 = mulf %1, %2 : f32
      affine.yield %3 : f32
    }
    affine.store %0, O[%x, %y] : memref<98x98xf32>
  }
  return %O
}

Here, the reduction op specified comes from a fixed set (currently I use the set from std.atomicrmw, although @dcaballe correctly points our this should probably be renamed at a minimum). The advantages of a fixed set are:

Easier pattern matching for lowerings to hardware that has fixed function parallel reduction units.
Less complex IR.
Initalizations (and 0-trip results) can be definied as idenity values for the operations in questions (i.e. 0 for sum, 1 for mul, -inf for max).

The advantages for following the method from scf are:

More consistent with scf of course
More general

Perhaps it is worth the additional complexity? I’m somewhat skeptical due to the fact that our prior work (which basically implements all of the Keras backend API) never ran into any cases needed more than a fixed set of reduction operations. However, I recognize that the MLIR community is broader, and so there may be lots of uses cases out there I’m not considering. Certainly some concrete examples of places where more general reduction is desirable would help frame the discussion is anyone knows some good examples.

dcaballe · July 1, 2020, 4:44pm

Hi Jeremy,

Thanks for bringing this up for discussion! I generally agree with the direction. Some comments regarding reductions below.

affine.for
Nearly identical semantics to scf.for. affine.for

What are the differences?

Perhaps it is worth the additional complexity?

I think it would be important to understand the limitations of this “clause-like” proposal vs scf.proposal to model parallel reductions. Would it be possible to elaborate on that with an example? Which scenarios can’t be modeled with the clause approach that can with the scf approach? As you mentioned in the review, I totally agree with limiting the design to some initial use cases. However, with Fortran, OpenMP and (hopefully) C/C++ knocking on the door, I think we have to make sure that the design will be easily extensible to cover more generic cases in the future. If we had to re-architect the design to support future cases, it would be a lot of work to port everything implemented on the old design to the new one.

Beyond lowering, I’m also thinking about scenarios with multiple nested affine.parallel some with reduction semantics. Do you see any problems with dealing with the reduction clauses when applying optimizations among them? What about the interaction of `affine.parallel’ with ‘affine.for’ or even ‘scf’ loop constructs within the same loop nest? I wonder if modeling reductions in the loop body would make them more orthogonal to the loop constructs they are nested in for analyses and optimizations. WDYT?

jbruestle · July 1, 2020, 6:00pm

I guess I should have said ‘no differences’, I don’t see any semantic differences between the intended modifications to affine.parallel and that of scf.for, at least with regard to use of yield, loop carried variables etc. The minor semantic differences that already exist (induction variables are affine, etc) remain. This of course does bring up the question of where it may make sense to merge some of affine and scf at some point, but that is a separate discussion.

So the only real use case of the more complex clause-like proposal I can see is to perform a reduction that is not in the existing enum of reductions. I’m not sure of a good real world example, although for example lets say I wanted to affine computations in the field modulo some prime P; one could define a reduction o = (a + b) % P, which is nicely commutative and associative and could be implemented using the clause based method.

In my mind however, there is a question of how common such things would be, since the clause based mechanism adds quite a bit of complexity, and means matching some custom reduction hardware requires matching the clause (which technically could be written multiple ways for the same logical operation, although canonicalization can fix some of that). Additionally, all the transforms involving reduction (such as for example the 1-trip loop canonicalization I have ready to upstream, or the affine.parallel loop fusion pass we have currently in the plaidml repo and plan to upstream) become significantly more complex. Most of that complexity is largely boilerplate pattern matching / IR manipulation rather than fundamental algorithmic complexities, but it does have a real cost.

I was originally going to write that both are fairly equivalent in terms of various transforms and that I didn’t see an issue for either case, but as I started going though examples, I realized there is one transform for which there is a fairly large issue with the ‘clause’ mechanism. Specifically, let’s presume that we want to tile a parallel loop with 100 steps into two loops of 10 steps each, and in either case the reduction is a summation. In the enum case, since the initialization is always the natural identity, this is no problem (i.e. the inner loops always start at 0). However, in the clause based approach, if one chose to use addition and an initialization of 1, then I don’t see any way to do a tiling without changing the output, since there is no way to infer for general reducitons what the appropriate identity value is, and you can’t simply initialize all then inner parallels and the outer parallel to 1, or you end up producing 11 + sum vs 1 + sum. Perhaps the ‘initialization’ value is meant to be an identity always and using a value of 1 for addition is already semantically invalid? I think I need to look into this more… I’ll try to write something more up regarding this tonight with concrete examples.

Regardless that might be another advantage of the enum approach.

bondhugula · July 1, 2020, 6:04pm

Overall, all of this looks great to me! A few comments/questions here.

Many of the current affine passes/utilities will have to be guarded on the presence of such yield values. For affne.if, I can think of unswitching (hostIfElse) and nearly everything for affine.for.
Such yield values (cross loop / loop live out / if/else live out) do not play well with several affine passes/utilities which tend to like memory. Dealing with yield values can often be a pain and one more thing to look at for such “memory-based” transformations. Broadly speaking: there are already two kinds of data flow / dependence information these passes look at (for eg. the affine fusion pass): memory-based (memref load/stores) dependences and SSA edges (on values). This would add a third thing to consider for all transformation/analysis purposes. So, at some point, we’d need a pass that converts from this form to a memory-based form (the opposite of mem2reg) so that one could use that as a pre-pass to then seamlessly apply affine/polyhedral passes that prefer just memory / or are better designed to not care about yield values. Then, we will also anyway have a pass that does the mem2reg equivalent to convert to a yield value form (which is expected to be better for further lowering - this could be done on scf as well). All of this complexity notwithstanding, I’m still in favor of having such yield values since there are definitely things that benefit from modeling these including (1) allowing easier entry into the affine dialect for various forms, (2) allowing direct mapping to scf, (3) having at least some opts that probably are able to look at and benefit from using yield values on affine.for’s/if’s in addition to memory-based data flow and SSA edges, and finally for completeness sake. As one immediate benefit example, sparse conditional const prop would be able to directly work on the affine dialect with this change. If reg2mem is available, I really don’t see a downside.
Minor note: for the reduction operator, I think you’ll need to fully qualify the op name: like “std.addf”?
I think the design choice for affine.parallel requires some more discussion - it sounds quite interesting! I tend to be more in favor of an “open” reduction function form because the cost of changing a form like this on a future requirement can be prohibitive. Reg. scf.for’s approach to model reductions: it came up for discussion after the design had already been implemented, and I’ve personally found it a bit problematic (to allow terminators and reduction functions interspersed in the middle).

clattner · July 1, 2020, 6:28pm

This all sounds great to me Jeremy, I think you’ve got the semantics right, and I think it makes sense to start out aligned with std.atomicrmw to keep the complexity limited.

jbruestle · July 1, 2020, 9:49pm

So regarding @bondhugula’s point about a reg2mem sort of pass that would remove yields and replace them with the in-memory equivalents, I think that is a great idea. It would additionally make for a nice transition since one could always run such a pass to get the pre-yield equivalent. It might be a good immediate next step, prior even to direct lowering from affine.X to scf.X, since it would make the existing lowering correct. On the affine.parallel side, we would probably need an affine version of atomicrmw to allow for such lowering, but I’ve been meaning to add that for a bit anyway.

Regard the fully qualified name issue, actually “addf” in my example is just an enum value, which happens to lower typically to std.addf (or std.reduce with “addf” enum element).

Regarding how reduction is handled in affine.parallel, I actual realize there are really two distinct questions: 1) Should the reductions be an open or a closed set 2) If it’s an open set, should we use the scf.parallel design. On #1 I think I could be convinced to move to an open set, for #2, I’m less excited. I think one problem with the existing scf design is that it conflates initialization with identity. For any given commutative/associative (or approximately so in the case of addf) reduction operation, there is some ‘identity’ value, which when reduced with X produces X. Knowing that value is important for splitting a reduction into some parallel set of reductions, or for handling the zero trip count case. Initialization can manage the zero trip count case, but doesn’t help with splitting very much, since if it is anything but the actual identity problems arise. Perhaps we could fix this just by redefining things to specify that is is the identity not an initialization, but the point is still relevant as for any sort of open system, one must specify for each reduction:

What is the reduction function itself (i.e. addf)
What is the identity value (0.0 in the case of addf)
For each loop ‘iteration’ which value is being reduced.

In the enum case, 1+2 are just defined via the enum value, and 3 comes from the yield. In the reduction clause case, 1 comes from the body of the reduction clause, 2 comes from the init clause (presuming we reinterpret init as identity), and 3 comes from the operand of the reduction clause.

I would argue that this design has a number of issues 1) Confusion about initialization vs identity 2) Weak binding between #1 and #2, 3) Lot of boilerplate for simple uses.

I do understand the desire for a more open set of reductions however, so what about an open design that is not the same as the scf design? One idea would be to define a ‘reduction’ as either an element of a fixed enum (still valuable I would argue because it covers 90% of the cases and is easy to pattern match on) or as perhaps a symbol which would refer to a pair of functions, one which does a reduction of two values and one which takes no parameters and returns the identity?

More generally, the reduction is really a pair of things, an aggregation function and an identity value, even for custom reductions, it seems excessive to constantly repeat the full function definition for each use, so dispatching though a symbol might be sensible, and it make the transforms as easy as the enum case.

Thoughts?

clattner · July 4, 2020, 10:03pm

Makes sense to me.

I think I missed it - what is the rationale for allowing an open set here? I’m concerned that this is premature generalization. Starting out conservative and having to generalize later has some advantages - when the time to generalize comes, you know a lot about the specific problems that need to be solved, and can adequately weight different options.

bondhugula · July 5, 2020, 10:21am

I think we need to distinguish between allowing an open set vs a design that can support an open set but handles a selected subset for now. For eg., the MLIR xla/hlo dialect reduce op uses a block to model an arbitrary reduction function (snippet below) but the lowering and other support will explicitly check and handle a subset (like only add, max, min) when lowering to say pooling ops. The proposed design in the RFC itself is closed (it’s using attributes/names while those ops are available) — my point was that you may have to change it a lot when you need more. One can keep the modeling flexible/open while still handling a closed subset now.

%red = "xla_hlo.reduce"(%arg0, %0) ( {
  ^bb0(%arg1: tensor<f32>, %arg2: tensor<f32>):
    %4 = xla_hlo.add %arg1, %arg2 : tensor<f32>
    "xla_hlo.return"(%4) : (tensor<f32>) -> ()
  }) {dimensions = dense<[1]> : tensor<1xi64>} : (tensor<4x8xf32>, tensor<f32>) -> tensor<4xf32>

The types uses in the reduction function above are “forced” to be 0-d tensor types (instead of elt types) because the xla HLO abstraction being derived/isomorphic to XLA HLO proper forces tensor types everywhere.

clattner · July 5, 2020, 4:44pm

Sure - my point is simply that there is a big design space to consider here. One simple example of this is that we could 1) generalize reduce to always take a region. 2) we could introduce a new op that takes a region. 3) we could have region take an enum or a region.

Each design point has tradeoffs. #2 is appealing if you’re lowering through level of representation for example.

dcaballe · July 7, 2020, 9:58pm

The proposed design in the RFC itself is closed (it’s using attributes/names while those ops are available) — my point was that you may have to change it a lot when you need more. One can keep the modeling flexible/open while still handling a closed subset now.

Yeah, this was exactly my concern as well.

In this regard, maybe it’s also worth looking at OpenMP’s user-defined reductions. Maybe we can learn something from that way of modeling generic reductions or the kind of generic reductions that we should expect in the future. See “declare reduction Directive”: https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf#page=324

The reg2mem/mem2ref approach sounds like a great idea to simplify analyses and transformations!

jbruestle · July 8, 2020, 2:11am

I think that OpenMP is in many ways a good model, and is similar to what I was getting at with regard to the idea of separating the definition of reductions from the use of reductions (which keeps an open design while still reducing the level of complexity of reduction uses like affine.parallel).

To make a concrete proposal, what if we make a ‘reduction’ op that lives at module level and defines a reduction, and use symbol attributes to refer to reductions where I currently use an enum. This would allow reductions to be reused (even across dialects) and prevents all the needless boilerplate. The reduction definition would specify both the ‘combiner’ and ‘initializer’ (to use OpenMP language), although I’d like to rename and clarify the semantics of the ‘initializer’ to be an ‘identity’, i.e. to make explicit that it can be ‘combined’ any number of times without changing the output (which is very critical for proper tiling transformation that want to initialize multiple independent reductions to be combined later).

So to give some code:

// Declare a sum reduction over floats
reduction @fsum : f32 
  combination (%a, %b) {
    %0 = addf %a, %b : f32
    reduction_yield %0
  }
  identity {
    %0 = constant 0.0 : f32
    reduction_yield 0.0
  }

Here we have a single op, which is part of the module symbol table, with two regions. The types of the inputs/outputs of the region is determined by the type being reduced over (f32 here) and only needs to be specified once, since by definition it must be consistent for reductions. By declaring this a reduction, we are stating that the compiler may reorder/split/etc the reduction as it sees fit (i.e. the combination function is logically associative and commutative, and even though floating point addition is only weakly associative, we are saying we don’t care how it is associated). Additionally, we are stating that it is always legal to ‘reduce’ 0.0 into an addition (it is the identity, and the results of a zero element reduction).

As for a use in affine.parallel, we would have:

fuction @sum(%A : memref<100x100xf32>) -> f32 {
  %out = affine.parallel (%i, %j) = (0, 0) to (100, 100) reduce @fsum -> f32 {
    %0 = affine.load %A[%i, %j]
    affine.yield %0 : f32
  }
  return %out
}

Clearly this would be a bigger change than the current diff, but I think the addition of an explicit reduction concept is valuable, and a design like this would both allow an open set of reduction, the reuse of reductions, and keep the use of reductions in constructs like affine.parallel nearly as simple as an enum.

As for the actual diff, if folks like this design, I suggest perhaps we stage things by putting though a simpler commit that adds yield with a fixed enum, and then replace the enum with a reduction concept (and possibly reuse it for other places, like atomicrmw or SCF).

Thoughts?

clattner · July 8, 2020, 3:25am

This is a really interesting and great idea, I agree this can work - and this is much more elegant than including the reduction logic as a region on every block. How do the symbols interact with other symbols in the module?

Is there a specific use-case in mind for the more general reduction, or are you just exploring the design space here?

jbruestle · July 8, 2020, 4:12am

Well as far as I understand (any my knowledge of symbols in MLIR is somewhat limited), a symbol is really just a reference to an operation, I would presume that symbols used in a reduction context (i.e. by affine.parallel reduce clause) would be required to ‘reduction’ operations, or the code would fail validation. I don’t believe we would need a separate namespace from functions/globals/etc, but honestly I suspect someone else might be better to at deciding the details of that than me.

Regarding reductions as an open set, there are definitely use cases for more general reductions. I mentioned for example the sum or product modulo some constant integer (which are both commutative and associative and have an identity). Another example might be complex numbers, or various binary Galois field operations . I think such examples are uncommon, and if we had a closed enum, I don’t think they would make the cut, but they do show up in crypto code for example (and in a personal side project of mind, I did write GPU code that in fact does a reduction using ((a * b) % (2^96 - 2^32 + 1)) for example).

As far as the use cases I’m currently working on (mostly ML related) I think a closed set is fine, and I am against the adding additional complexity just to gain flexibility that is very rarely used. However, if we can have a clean enough design that the complexity cost is low, I do think it’s worth considering open reductions.

bondhugula · July 8, 2020, 9:44am

This looks good. Since affine.parallel already has a block/region, it makes sense to have the reduction functions at the top level instead of just attaching them to the op’s region list or have them inside its main region. A couple of things to think about.

The reduction functions are never going to access anything other than their arguments or locally defined values? (i.e., they are IsolatedFromAbove).
It’d be good to make sure you have the syntax for handling multiple reduction variables. I think you’ll need a parentheses delimited list for reduce on affine.parallel with the references matching 1:1 with yield operands.
On a minor note, for the ‘reduction’ declarative op, you could just use return as the terminator instead of reduction_yield.

I think scf.parallel could borrow this design for uniformity - there appears to be nothing specific here for affine.parallel, and the best one could just be used at both places. (Please do update the title of this discussion topic to reflect reductions as well.)

@herhut @mehdi_amini @ftynse

jbruestle · July 8, 2020, 6:50pm

@bondhugula Agree on all numbered points. As for this discussion thead, I think I will start a new RFC on std.reduction (or whatever is ends up being), and in the meantime I’m going to try to land the open diff (with some doc changes) using the existing closed set for affine.parallel unless there are any objections.

dcaballe · July 8, 2020, 7:32pm

Thanks, Jeremy. It looks very promising to me! I think this ‘reduction’ op would allow us to define the proper semantics and restrictions on the reduction initializer and combiner operations and easily implement specific verifiers for them.

IMO, this demonstrates that the ‘reduce’ clause approach ala OpenMP using enums would be easily extensible to support more generic reductions without having to change the design significantly. I agree with @clattner in that maybe we shouldn’t go with the full-blown implementation right now unless we have specific use cases or other good reasons (aligning affine and scf reduction model could be one) but even the latter could also be done in the future once we have gained a bit more experience on the ‘reduce’ clause approach. So for me, going with the initial enum implementation seems a fair stepping-stone given that you proved that is extensible. WDYT?

On a side note, if we don’t go with the implementation right now, I think it would be good to move the reduction op brainstorming to the documentation so that the idea is not left behind in Discourse. Maybe a small ‘Parallel reductions’ section under ‘Rationale’ or somewhere else?

I think scf.parallel could borrow this design for uniformity - there appears to be nothing specific here for affine.parallel, and the best one could just be used at both places.

And perhaps even for the OpenMp dialect!

It’d be good to make sure you have the syntax for handling multiple reduction variables. I think you’ll need a parentheses delimited list for reduce on affine.parallel with the references matching 1:1 with yield operands.

This raises an interesting point. IIUC, all the values yielded from an affine.parallel would need to be “reduced” somehow. If we needed some other behavior that fell out of the ‘reduce’ clause, we would need an explicit way to describe which yielded operands map to each reduction enum/symbol. Not a big issue right now but something to take account.

clattner · July 8, 2020, 9:22pm

Thanks for the explanation @jbruestle makes sense, I still think this is a very clever design that would be fun to prototype and explore!

mehdi_amini · July 9, 2020, 7:41am

The only concern I have with using Symbols instead of having the reduction “in-line” as region is that is requires any lowering or transformation that involves one of them to be a module pass.

jbruestle · July 9, 2020, 4:09pm

Thanks for all the feedback everyone. It seems like there is some consensus to go forward with the affine.yield changes as an enum for now, and I will try to put together an RFC for reductions in more detail early next week, and we can have any further discussion there (including addressing @mehdi_amini’s comment). In the meantime, I’ve updated https://reviews.llvm.org/D82600, and I think it should be read to land.

bondhugula · July 9, 2020, 6:51pm

Actually, the lowering won’t be a module pass if scf also adopts the same design. You will just copy the symbol reference over to scf.parallel? As for transformations, yes - making a pass module pass just for trying to look into the reduction function may be too much but I think many transformations may not want to even look into something that’s purely functional; so it may not be that bad. You often just want to know the reduction vars - say to append two yield lists (fusing two affine.parallel as an example). I think readability is improved with inline if all else doesn’t matter FWIW.

With an inline design, one could just append the reduction functions as regions after the main body region of affine.parallel. I feel the two design choices could be discussed later with the next update as @jbruestle says.

Topic		Replies	Views
[RFC] Add affine.parallel MLIR	17	2897	May 28, 2024
[RFC] Modify IfOp in Loop dialect to yield values MLIR	55	2839	March 20, 2020
[RFC] OpenMP reduction support MLIR	17	2036	May 22, 2024
Should linalg.indexed_generic allow for affine operations on its body? MLIR	27	1813	October 31, 2021
[RFC] Changes to linalg::TiledLoopOp to unblock reductions MLIR	40	2219	October 25, 2021

[RFC] Yield for Affine Dialect

affine.if

affine.for

affine.parallel

Related topics