[RFC][OpenMP] Adding `omp.structured_region`

We would like to add an omp.structured_region operation to openmp dialect in MLIR. This represents any block of code within an OpenMP construct. The intended us of this operation is that within an OpenMP operation’s region, there would be a single omp.structured_region operation and a terminator operation as shown below.

// Example with `omp.sections`
omp.sections {
  omp.structured_region {
    call @foo : () -> ()
  }
  omp.terminator
}

Motivation

This operation is especially more useful in operations that use omp.yield as a terminator. For example, in the proposed canonical loop operation, this operation would help fix the arguments of the yield operation and it is not affected by branches in the region assosciated with the canonical loop. In cases where the yielded value has to be a compile time constant, this provides a mechanism to avoid complex static analysis for the constant value.

With the current implementation, we cannot put the SingleBlockImplicitTerminator trait on the canonical loop operation below, because that would conflict with conditionals and loops in LLVM IR Dialect. Having an omp.structured_region allows us to put those branches within the region to have a complie time constant yield.

omp.target {
  %outer,%inner = omp.canonical_loop %iv1 : i32 in [0, %tripcount) {
    omp.structured_region {
      %inner = omp.canonical_loop %iv2 : i32 in [0, %tc) {
        omp.structured_region {
          %a = load %arrA[%iv1, %iv2] : memref<?x?xf32>
          store %a, %arrB[%iv1, %iv2] : memref<?x?xf32>
        }
        omp.yield
      }
    }
    omp.yield(%inner : !omp.cli)
  }
  %collapsed_loopinfo = omp.collapse(%outer, %inner)
  omp.teams {
    call @foo() : () -> ()
    omp.distribute(%collapsed_loopinfo)
  }
}
2 Likes

Thanks @shraiysh for posting an RFC about the omp.structured_region.

The addition of the operation sounds OK to me.

Concerns are:

  • What happens when some hoisting is performed and code is moved outside the omp.structured_region and within the enclosing OpenMP operation? Will this be tolerated? If not how will it be enforced?
      %inner = omp.canonical_loop %iv2 : i32 in [0, %tc) {
         %a = load %arrA[%iv1, %iv2] : memref<?x?xf32>
        omp.structured_region {
          store %a, %arrB[%iv1, %iv2] : memref<?x?xf32>
        }
        omp.yield
      }
  • Since canonical loop is not an OpenMP construct per.se, the wording might have to be changed a bit.
  • The operation will increase nesting. I don’t think there is any issue here.
  • The motivation for this operation could also be specified as modelling the structured block concept in the OpenMP standard.

For the sake of correctness, I will disallow it for now, in the same fashion how target construct avoids this. I think we can allow it under OpenMP constructs but not between the omp.canonical_loop and omp.structured_region later.

  • Since canonical loop is not an OpenMP construct per.se, the wording might have to be changed a bit.
  • The operation will increase nesting. I don’t think there is any issue here.
  • The motivation for this operation could also be specified as modelling the structured block concept in the OpenMP standard.

I will edit the post to reflect this.

I had another concern though. Consider the following code

#pragma omp for
for(i = 0; i < N; i++) {
  foo();
  #pragma omp tile sizes(4)
  for(j = 0; j < M; j++) {
    bar();
  }
  baz();
}

I think the MLIR code for this should look like the following:

%outer, %tiled1, %tiled2= omp.canonical_loop [1, N) {
  omp.structured_region {
    "foo"() : () -> ()
    omp.terminator
  }
  %inner = omp.canonical_loop [1, M) {
    omp.structured_region {
      "bar"() : () -> ()
      omp.terminator
    }
    omp.yield
  }
  %tiled = omp.tile(%inner) { tile_sizes = [4] }
  omp.structured_region {
    "baz"() : () -> ()
    omp.terminator
  }
  omp.yield(%tiled#0, %tiled#1)
}

This is because, we want to be able to reference %inner (or some transform of it) in the yield operation. In the snippet I have written in the first post, the value %inner is out of scope for omp.yield.

During codegen, we will have to enforce that the transforms for an inner loop occur before the next structured region under it, and we will generate the code for the loop as soon as we encounter the next structured region.

This means that -

  • omp.canonical_loop can have omp.canonical_loop and omp.structured_region under it.
  • All the transforms for an omp.canonical_loop must be declared before another omp.structured_region or any other operation.
  • omp.canonical_loop will have only one basic block, with only one yield instruction.

Does this look okay? Any comments, concerns or suggestions for this are welcome.

CC: @Meinersbur @kiranchandramohan @jansjodi @raghavendhra for inputs about this.

I can imagine there is some issue with propagating the yields from the omp.structured_region operation. But I am not sure about the example you provided here.

Shouldn’t the representation look like the following instead?

%outer = omp.canonical_loop [1,N) {
  omp.structured_region {
    "foo"() : () -> ()
    %inner = omp.canonical_loop [1, M) {
      omp.structured_region {
        "bar"() : () -> ()
        omp.terminator
      }
      omp.yield
    }
    %tiled = omp.tile(%inner) { tile_sizes = [4] }
    "baz"() : () -> ()
    omp.terminator
  }
  omp.yield
}
omp.ws loop(%outer)

Yes, that seems natural way of representing it. But, if we want to yield the inner loop (can this situation arise, when we want to yield such an inner loop?), we cannot do that here. You can ignore my suggested IR for now - I was trying to suggest a solution to this problem, but first I want to confirm if this is a real issue or not.

The question is that is there a case where we would want to refer to the inner loop outside the outer canonical loop? It could be because of a directive or any other reason. I don’t have much experience with OpenMP and so I am looking for community wisdom about this. I have follow-ups in both the cases - whether the answer is yes or no.

I guess you are looking for the collapse or any other nested construct transformation.

!$omp do collapse(2)
Do I = 1 to N
DO J = 1 to M
END DO
END DO

In the following MLIR representation of the source above, the structured region around the inner loop prevents the loop identifier from being propagated.

%outer = omp.canonical_loop [1,N) {
  omp.structured_region {
    %inner = omp.canonical_loop [1, M) {
      omp.structured_region {
        omp.terminator
      }
      omp.yield 
    }
    omp.terminator
  }
  omp.yield
}
%collapsed = omp.collapse(%outer, %inner)
omp.ws loop(%collapsed)

If the omp.structured_region can propagate the loop identifier by using yield as the terminator, would that be sufficient? There is some subtle difference here between yields in omp.canonical_loop and omp.structured_region. The former will generate a unique identifier for the loop and propagate every contained loop identifier that needs to be propagated, the latter will only propagate. We can consider using different terminators if the behaviour has to be distinguished.

%outer, %inner1 = omp.canonical_loop [1,N) {
  %inner1 = omp.structured_region {
    %inner = omp.canonical_loop [1, M) {
      omp.structured_region {
        omp.yield
      }
      omp.yield
    }
    omp.yield %inner1
  }
  omp.yield
}
%collapsed = omp.collapse(%outer, %inner1)
omp.ws loop(%collapsed)

Was this what you were looking for? Is your solution different?

I think we can do this. This seems more elegant and straightforward than what I had suggested.

We can consider using different terminators if the behaviour has to be distinguished.

I don’t think we need different terminator for this. This looks good.

Thank you Kiran!

Hi @kiranchandramohan, the way I understand the suggested solution, following would have to be the IR to propagate using structured region:

%outer, %inner1 = omp.canonical_loop [1,N) {
  %inner1 = omp.structured_region {
    %inner = omp.canonical_loop [1, M) {
      omp.structured_region {
        omp.yield
      }
      omp.yield %inner // Change here - yield the loop.
    }
    omp.yield %inner1
  }
  omp.yield %inner1 // Change here - yield the loop.
}
%collapsed = omp.collapse(%outer, %inner1)
omp.ws loop(%collapsed)

If this is what you had intended, this has the same problem that we wanted to solve using omp.structured_region - statically finding all the yields and having branching at the same time. This solution wouldn’t work, because when there are branches within the structured region, we would have to do complex analysis to find yielded loops inside structured region.

In case you intended something else, please let me know.

The difference here is that every canonical loop is represented by a single yield and if you can find that yield you have found that unique canonical_loop. This might involve traversing the yields that are propagated, but ultimately you will end up at a single yield.
Note: In the previous case, there could be multiple yields terminating a single canonical loop.

I have not thought deeply into whether branching causes issues. What is the issue that you see?

I have not thought deeply into whether branching causes issues. What is the issue that you see?

Without structured_region, the MLIR could look something like this, and in this case it would require complex analysis to find the yielded loop variable -

%outer, %inner = omp.canonical_loop %i in [0, %tc) {
  %t = llvm.icmp sgt i32 %0, 0
  llvm.condbr %t, ^truebb(), ^falsebb()
^truebb:
  %inner = omp.canonical_loop {...}
  omp.yield %inner
^falsebb:
  %inner1 = omp.canonical_loop {...}
  omp.yield %inner1
}

We came up with omp.structured_region to avoid such cases and force a single basic block with one omp.yield. This would make sure we know at compile time which loop is yielded.

In the same scenario, with the structured region, the code would look like this -

%outer, %inner = omp.canonical_loop %i in [0, %tc) {
 %inner2 = omp.structured_region {
  %t = llvm.icmp sgt i32 %0, 0
  llvm.condbr %t, ^truebb(), ^falsebb()
^truebb:
  %inner = omp.canonical_loop {...}
  omp.yield %inner
^falsebb:
  %inner1 = omp.canonical_loop {...}
  omp.yield %inner1
 }
omp.yield %inner2
}

We now have a single omp.yield under the omp.canonical_loop but we have the same problem of statically figuring out at compile time, the yielded loop from the structured region.

Without structured_region, if we cannot have this problem, and the frontend makes sure we have a single omp.yield with a single possible canonical loop inside the region of canonical loop, even with branches in the region associated with the canonical loop, then we do not need a structured_region operation.

This is my understanding of the problem. Am I missing something?

Thanks @shraiysh for the explanation. I guess we will have to wait for @Meinersbur to chime in.

I don’t think the issue in this example is what the problem was because the llvm.condbr and associated blocks could easily be an scf.if operation with yields.

I think this is the difference, with an omp.structured_region operation, the single-exit at the end of the block property is guaranteed for every canonical-loop. This is also enforced by the presence of the omp.structured_region operation. So that once you find a canonical-loop you know where it starts and exits without any further analysis.

FYI: The requirement for a Single Block with a terminator was discussed in the Worksharing-Loop context as well. You can see @ftynse’s reply to that in OpenMP Worksharing Loop RFC - #12 by ftynse.

I only skimmed through the thread after a ping from it. Have you folks looked at scf.execute_region that looks related?

Yes, llvm.condbr could become scf.if but while lowering from flang, everything lowers to LLVM IR Dialect + OpenMP Dialect, we cannot have scf.if in that case, right?

I guess I am having trouble understanding how structured_region would help solve the problem. IMO it just moves the problem into the structured_region. Without structured_region we would have to solve it in the canonical loop and with structured_region we would have to solve it inside the region. The problem is the same - statically find the yielded loops CLIs. If the solution for structured region is straightforward (I don’t know the solution yet), why can we not do the same thing under canonical_loop directly without any structured_region at all.

Hi Alex, thanks for chiming in. We want to be statically able to tell which loop variables are yielded inside a canonical loop. Codegen does not start till we are done with all the loop transformations on loop info. For example, consider this case when the IR is in LLVM IR Dialect + OpenMP Dialect -

%outer, %inner = omp.canonical_loop %i in [0, %tc) {
  %t = llvm.icmp sgt i32 %0, 0
  llvm.condbr %t, ^truebb(), ^falsebb()
^truebb:
  %inner1 = omp.canonical_loop {...}
  omp.yield %inner1
^falsebb:
  %inner2 = omp.canonical_loop {...}
  omp.yield %inner2
}
%inner_tiled = omp.tile(%inner) {tile_size=4}
%collapse = omp.collapse(%outer, %inner_tiled#1)
omp.unroll(%inner#2)
omp.distribute(%collapse) // Codegen for loop starts here.

It could be more complicated than this - involving multiple branches, switch cases etc. To solve this, we tried to remove branches within canonical loop and push them into a structured_region but now I think we have the same problem - how to propagate loop infos from inside region to the outer canonical loop? The difference with scf.execute_region is that we need to know the yielded value at compile time and if I understand correctly, for scf.structured_region there is no such requirement - we can see it as a function returning value(s). Is this correct? And can we do something about this problem using the ideas from scf.execute_region?

Yes. What I was saying was that we will have the same problem even with structured regions having scf.if instead of llvm.condbr, like in the following example. Did I miss your point?

%outer, %inner = omp.canonical_loop %i in [0, %tc) {
  %t = llvm.icmp sgt i32 %0, 0
  %inner3 = scf.if (%t) {
    %inner1 = omp.canonical_loop {...}
     scf.yield %inner1
  } else {
    %inner2 = omp.canonical_loop {...}
    scf.yield %inner2
  }
  omp.yield %inner3
}
%inner_tiled = omp.tile(%inner) {tile_size=4}
%collapse = omp.collapse(%outer, %inner_tiled#1)
omp.unroll(%inner#2)
omp.distribute(%collapse) // Codegen for loop starts here.

Ah, I thought you were suggesting that scf.if could solve the problem. Sorry for the confusion :smile:

Thanks @ftynse for your reply. I added you since we discussed the single block requirement for loops in MLIR in a previous discussion.

Yes, the scf.execute_region is related. But the preference was for a separate operation since it captures the OpenMP concept of structured_regions. Also, using SCF operations in OpenMP is sometimes not possible since the OpenMP operations exist all the way to translation whereas SCF is not designed to stay that long. And, there are several passes that call the conversion of SCF operations to LLVM in the Flang and OpenMP flows.

In this discussion, there are three questions:

  1. Are single blocks required for OpenMP canonical loops?
  2. Is an omp.structured_region required to enforce/support the single block requirement?
  3. Does using omp.structured_region solve the problem of having to support multiple blocs in an omp.canonical_loop?

Can this case happen in the source code? It seems odd that compile-time directives can be affected by runtime values.

Thanks! I’m not necessarily saying that the OpenMP dialect would want to reuse the scf.execute_region, there is often value in having a slight variation of the existing operation to allow for independent evolution. What’s interesting here is the common need for wrapping “unstructured” control flow, which may potentially be factored out into a separate trait/interface. There was an interest in having affine.execute_region at some point, for example, with slightly more requirements than scf. We never pushed all the way through with structured/unstructured separation, but you may :slight_smile: