[RFC] Adding operands and results to loop.for

The current structured for operation does not support loop carried dependencies without going through memory. This has been brought up as a limitation multiple times and I propose to add support. The proposal is a write up of ideas that were discussed in chat and conversations previously. It is a focused change to loop.for but we might extend to affine.for if desired.

The basic structure would be as follows

%res = loop.for %iv = %lb to %ub step %step args %arg0 {
  ^bb0(%0 : tensor<32xf32>):
    loop.yield %val0 : tensor<32xf32>
} : (tensor<32xindex>) -> tensor<32xf32>

with the expected constraint that %arg0, %val0, %0 and %res have the same type. The semantic would be that %0 is assigned the value of %arg0 before the first iteration and the value of %val0 at the end of iterations. On termination of the loop, the last value of %0 is assiged to %res.

The example above has a single carried value. It extends to multiple values in the expected way

%res:2 = loop.for %iv = %lb to %ub step %step args %arg0, %arg1 {
  ^bb0(%0 : tensor<32xf32>, %1 : tensor<16xf64>):
    loop.yield %val0, %val1 : tensor<32xf32>, tensor<16xf64>
} : (tensor<32xindex>, tensor<16xf64>) -> (tensor<32xf32>, tensor<16xf64>)

I assume the structure is non-controversial. I welcome suggestions on syntax.

1 Like

This looks good to me, but I think the syntax has inconsistencies - %iv is already an arg for the entry block and so you can’t have just %0 in the arg list (the custom form would be confusing in the face of the generic form). Also, the input type before the trailing return type appears inconsistent. Instead, you probably meant this (with some more syntactic sugar)?

  %res:2 = loop.for %iv = %lb to %ub step %step 
      iter_args (%0 = %arg0 : tensor<32xf32>, %1 = %arg1 : tensor<16xf64>)  
      -> (tensor<32xf32>, tensor<16xf64>) {
    loop.yield %val0, %val1 : tensor<32xf32>, tensor<16xf64>

+10 on this front, thanks Stephan for pushing this!

Thank you @bondhugula. I wanted to leave syntax open and created a mess. iter_args sounds good to me.

This is fine with me, as a nit I’d make sure %0 and %1 are named if possible.

@bondhugula: should we turn loop.yield into std.return? (it returns value now!)

Since the value being returned by the terminator here goes to the next iteration for the first %N - 1 iterations (and to the return value of the loop.for op only for the last iteration), I would avoid using std.return to return values from any regions where there isn’t a 1:1 match between the parent op’s result and what’s being returned from the region – just to avoid confusion from the name. (Using a special name like yield/pass looks better.) The same would be true for any region that is executed more than once as part of the holding op’s semantics.

I know that I’m continuously changing/evolving the guideline that you were asking about on the other patch! (D71961)

Thanks @herhut
I have an RFC ready that covers ForOp/IfOp, and was just about to submit it to that effect myself when I saw this :slight_smile:
We still need to enable the same idea for IfOp.

I have posted IfOp RFC here.

1 Like

Nice @nmostafa ! As the two features have some overlap (for one, both use loop.yield), I think it is better if one person works on it. Would you mind to also implement the part for loop.for?

Sounds good to me. I will work on it.

Seems reasonable to me. Although there seems like alot of duplication of the iter_args types? Maybe they could be elided entirely?

%res:2 = loop.for %iv = %lb to %ub step %step
iter_args (%0 = %arg0 : tensor<32xf32>, %1 = %arg1 : tensor<16xf64>) {

loop.yield %val0, %val1 : tensor<32xf32>, tensor<16xf64>

I think it’s useful to have them appear this way since it’s consistent with the way other things are (%0 = …, %1 = are the first appearances of %0, %1, and we have the types always appearing with terminators like return, cond_br because invalid IR where there is a return/yield type mismatch becomes obvious).

I fully agree with this. It precisely addresses my concerns about std.return being misinterpreted.

I’m generally supportive of the RFC. It has indeed been discussed several times, with the early versions dating back to the introduction of the Loop dialect. We just haven’t had a good enough reason to push through the implementation (loops were a common lowering target for Affine and Linalg, neither of which needed it at that point).

My suggestion for syntax would be something like

%42 = ...
%43 = ...
%res:2 = loop.for %iv = %lb to %ub step %step
         args(%arg1 = %42 : tensor<*xf32>, %arg2 = %43 : tensor<*xf32>) {
  // %arg1 and %arg2 are available here
  %57 = ...
  loop.yield %57, %57 : tensor<*xf32>, tensor<*xf32>

One note: forwarding values into a region-encolsing op seems to be a common pattern, so I would encourage you to come up with a common syntax that is general enough to be reused for different operations, e.g. (the former version of isolated-from-above) gpu.launch and the proposed inlined_call.

Similarly to the loop.if extension, I would suggest removing loop.terminator and replacing it with loop.yield in all cases. We can also require loop.yield to always appear in the custom syntax, even if when has no operands, for consistency. (Practically, I don’t think we have support for conditionally implicit terminators).

I think it is reasonable to have types listed before the region rather than in the training position of the enclosing. These types are necessary to construct the entry block of the region.

Nit on the value naming: %iv will become %arg0 because that’s how the printer names block arguments. So the following loop-carried arguments are %arg1, %arg2 and so on. Using %arg0 outside the region creates confusion. Having the “assignment” form %nameWithinRegion = %nameOutsideRegion : type looks the right approach to me, and it was discussed on the GPU dialect syntax at the time. I think it also addresses @mehdi_amini’s concern of having the arguments named. Practically, it is necessary to list the SSA values and their types before the region start if we want to omit the header (^bb0) for the entry block.

It actually dates back to pre - MLIR design spec itself. It was discussed, shelved, and listed under “design alternative/extensions” in the Rationale document here.

+1 for the loop carried dependencies proposal

I agree having common syntax would be nice (it also shows up in the stencil dialect).

Thanks everyone for the comments. We seem to be in agreement.

I have one comment on the common syntax for region arguments. We have a similar requirement for the loop.parallel where we also want initial values for reductions. However, other than in this proposal, those are used only optionally. If we use the same syntax everywhere, we also create the expectation that things behave the same. So I’d like to avoid common syntax for loop.parallel initials once we add them.

In the case of loop.for, args is pretty generic but not misleading and iter_args does not really convey more meaning. So I am fine with just using args.

Hi Stephan, I suggested iter_args because those arguments correspond to an iteration of a loop.for and not in some way to the entire loop.for operation. (As we know, the last iteration returns it to the loop.for’s result, and each iteration gets its arguments from the previous one’s yield.) iter_args vs args is a bikeshed though given that users will have to read the docs anyway here to understand semantics.

Common does not mean enforced everywhere. Only where the behavior is the same or otherwise unsurprising.

I don’t recall the details of the parallel loop proposal, but could speculate that you might want to use a similar syntax with an explicit flag for the absence of the optional value.