[RFC] Modify IfOp in Loop dialect to yield values

Hi Everyone,
As part of our work on nGraph dialect, we are trying to enable tensor-level control-flow. MLIR already has a loop dialect that we favor to re-use for our purposes instead of implementing our own control-flow dialect. However, in its current state, the loop dialect is not general enough. This RFC proposes a change that aim to alleviate that.



Currently, Loop dialect operations do not allow any temporary SSA values defined within their regions to escape to outer regions or to live across loop back-edges. While this works well for certain types of IR lowering (e.g Affine dialect to Loop dialect), it is a limiting factor to wider use of the dialect as a general tool to model structured control-flow .

This RFC proposes a change to the Loop dialect that allows IfOp to define values. Similar change to ForOp is proposed here

The RFC proposes the following changes:

  1. Introduce a new YieldOp to Loop dialect to return values out of the ops regions
  2. Modify IfOp IR format and representation to define and yield values
  3. Enable lowering logic of the new representations to standard dialect

Proposed YieldOp

In general, the proposed YieldOp passes a set of values from inside the IfOp regions to the outer enclosing region. In other words, it permits a set of value to escape the inner region. The exact semantics of the operation is dictated by the enclosing Loop dialect operation. This is in agreement with region specifications. See the proposed changes below for concrete use examples.

The proposed format for the operation is

def YieldOp : Loop_Op<”yield”> {
 let arguments = (ins Variadic<AnyType>:$operands);

Proposed Changes to IfOp

Currently the IfOp format is as follows (details elided)

def IfOp : Loop_Op<"if", [SingleBlockImplicitTerminator<"TerminatorOp">]> {
 let arguments = (ins I1:$condition);
 let regions = (region SizedRegion<1>:$thenRegion, AnyRegion:$elseRegion);

Example with pretty-print:

loop.if %b {
} else {

The then-region is mandatory and must have exactly 1 block, while the else-region can be empty. The operation does not define any values, and no values defined inside its region can escape. Operations inside the regions can access outer values, but not the other way around.

The new proposed format is

def IfOp : Loop_Op<"if", [SingleBlockImplicitTerminator<"TerminatorOp">]> {
 let arguments = (ins I1:$condition);
 let results = (outs Variadic<AnyType>);
 let regions = (region SizedRegion<1>:$thenRegion, AnyRegion:$elseRegion);

The result represents all merged values from the then- and else-regions.

Example 1 - Defining and merging two scoped values

func @test(%arg0: i1, %arg1: f32, %arg2: f32) {
  %x = loop.if %arg0 {
    %x1 = addf %arg1, %arg2 : (f32, f32) -> f32
    loop.yield %x1 : f32
  } else {
    %x2 = subf %arg1, %arg2 : (f32, f32) -> f32
    loop.yield %x2 : f32
  } : (i1) -> f32
  return %x : () -> f32

In the context of an IfOp, the yield in the taken execution path assigns the value directly to the destination operands.

Notes about representation:

  1. IfOp can define zero or more operands. The number of YieldOp operands and their types must match the IfOp destination operands.
  2. The YieldOp is always at the end of the basic block (before the TerminatorOp)
  3. If the IfOp defines no operands, the region is not allowed to have a YieldOp
  4. If the IfOp defines any operands, the else block is mandatory since both execution paths must yield values. In case the else-block doesn’t change the value of a variable, it should still be present to re-assign a default value from the outer regions.

Example 2 - Merging outer default value with scoped value

func @test(%arg0: i1, %arg1: f32, %arg2: f32) {
  %x_default = subf %arg1, %arg2 : (f32, f32) -> f32
  %x = loop.if %arg0 {
    %x1 = addf %arg1, %arg2 : (f32, f32) -> f32
    loop.yield %x1 : f32
  } else {
    // yield default value
    loop.yield %x_default : f32
  } : (i1) -> f32
  return %x : () -> f32

Lowering to Standard Dialect

For Example 1 above, the lowered standard dialect IR will be as follows

func @test(%arg0: i1, %arg1: f32, %arg2: f32) {
  cond_br %arg0, %thenBlock, %elseBlock
    %x1 = addf %arg1, %arg2
    br ^continueBlock(%x1)
    %x2 = subf %arg1, %arg2
    br ^continueBlock(%x2)
   return %x

The YieldOp translates to merging of the two values at the continue block.

Thanks Diego +10 on this too :slight_smile:

Re. examples I would go as far as showing how it combines with @herhut’s loop proposal and how we can encode reduction + conditionals without going through memory.

I anticipate this will also remove a lot of annoyances related to turning into predication and will be very useful.

Thanks a lot for pushing on this front!

Thanks @nicolasvasilache, this is Nagy actually :slight_smile:

Here’s a conditional sum reduction using both loop.for and if. I am assuming the syntax that @bondhugula suggested.

// Conditional Sum reduction of positive elements in a memref
func @reduce(%buffer: memref<1024xf32>, %lb: index, %ub: index, %step: index) {
  %sum_0 = std.constant {value = 0: f32}
  %c0 = std.constant {value = 0: f32}
  %sum = loop.for %iv = %lb to %ub step %step iter_args(%sum_iter = %sum_0) 
  -> (f32) {
	  %t = load %buffer[%iv]
	  %cond = cmpf "ugt", %t, %c0 
	  %sum_next = loop.if(%cond) {
       //update sum
	    %new_sum = addf %sum_iter, %t
        loop.yield %new_sum : f32
	  } else {
         // keep old sum 
		loop.yield %sum_iter : f32
	  } : (i1) -> f32
      loop.yield %sum_next : f32
  return %sum : () -> f32


Oops my apologies Nagy!
Thanks for sharing the example, if looks great and compelling to me!

This looks great, and a reduction with a conditional is a good general example to preserve for doc purposes.

Side note: A minor issue you may run into while parsing the proposed loop.for is that the custom parsing API AFAIK doesn’t readily support variadic assignments (%x1 = %y1, %x2 = %y2, …) forcing one to resort to (%x1, %x2, …) = (%y1, %y2, …) – the latter is trivial and supported, but the former form is more readable with more vars. At some point, the library should be updated to support parsing such a list of SSA assignments.

Thanks Nagy, this looks good to me! We considered a similar approach when designing the loop dialect initially, but decided against implementing it since we did not have the use case. Great to see you do.

One question: why put loop.yield before the loop.terminator? I would say it should rather be the terminator. Loop.terminator only exists because the general block semantics says there should be one. We can use loop.yield with no arguments as a terminator when there’s no value returned.

I’m quite sure I used the former syntax in the GPU dialect for blocks, threads and (the now removed) arguments to gpu.launch. You are welcome to factor the code out to common parser and reuse in the loop dialect.

The GPU ops’ region args (like for launch op) have fixed dimensionality (like three), not variadic. Iterator args are variadic.

The GPU ops’ region args (like for launch op) have fixed dimensionality (like three), not variadic. Iterator args are variadic.

Dimensionality does, arguments don’t. We used to have gpu.launch blocks(), threads(), args(...) with the latter being variadic. @herhut removed the args part recently, but the parsing code should still be valid.

I assume you are suggesting yield op as terminator for If/ForOp only ? No strong opinion here, but I proposed it this way to be consistent with other loop dialect ops, and reduce the impact of the change on other Ops. I will use it as a terminator. If the ops define no values, it will be inserted implicitly.

I had to add my own parseAssignmentList(…) to the custom parser to support %x1 = %y1, %x2 = %y2, ... list, if you know of a better way of doing this, please point me to it. Thanks.

I can’t parse what you mean above. :slight_smile: What I meant was: the number of items you have in the list of assignments for the GPU ops is constant, the last time I checked.

Looks like you’ve already solved the issue! This is a good addition to the custom parser library.

It seems to me that loop.for and loop.if should be consistent in how the types are specified, e.g.:

%sum = loop.for %iv = %lb to %ub step %step iter_args(%sum_iter = %sum_0)
-> (f32) {

%sum_next = loop.if(%cond) -> (f32) {

What do others think?

The only other op in the Loop dialect is loop.parallel, and it has a similar need to yield values, it has not been implemented yet. Since you are changing 2 out of 3 ops that have use loop.terminator, it’s the remaining op that should adapt to the majority.

With yield as terminator, the semantics is much cleaner: forward the values to the enclosing op, which decides where to forward them next. Having a separate, non-terminator yield would require something along the lines “register the values to be forwarded to the enclosing op when the terminator is reached” and require modifying the semantics of loop.terminator anyway to be aware of that.

I’m open to always have an explicit terminator. It’s a trade-off between simplicity (always have it, even when it’s trivial) and brevity.

I meant exactly what I said. I wrote that code, so I find it reasonable to think I know what it was doing. We used to have a variadic list of arguments. Stephan removed it in [MLIR] Make gpu.launch implicitly capture uses of values defined above. · llvm/llvm-project@283b5e7 · GitHub. It is trivial to copy-paste that code into the parser and use everywhere, it’s recent enough so it is not bitrotten.

Apparently there is another version already, fine with me.

I submitted a patch https://reviews.llvm.org/D74174

The yielding behavior in loop.parallel is modeled via loop.reduce and that is already implemented. The sequential yield operator does not really apply unless you want non-deterministic choice of result. So I would vote for keeping loop.terminator for that case to differentiate the different semantics.

The way I see it, loop.yield serves as both terminator and optionally passes results to the parent op. The specifics on how the results are passed is defined by the parent op. For loop.parallel, loop.yield is should have no operands and just serve as a region terminator.

By the way, looks like the example in LoopOps.td is missing a result for loop.parallel ?

   loop.parallel (%iv) = (%lb) to (%ub) step (%step) {
     %zero = constant 0.0 : f32
     loop.reduce(%zero) {
       ^bb0(%lhs : f32, %rhs: f32):
         %res = addf %lhs, %rhs : f32
         loop.reduce.return %res : f32
     } : f32

I posted this comment on https://reviews.llvm.org/D74174 as well.

Could we add a reduction_vars / reduction_args list just like iter_args on loop.for and completely get rid of loop.reduce / loop.reduce.return and just use yield from within for’s region to send out both reduction vars and general cross loop/loop live out vars? What would the pros/cons be?