[RFC][MLIR] OpenMP Loop Transformation : tile and unroll directive operation support for omp.dialect

The post is aimed at starting a discussion on the design and implementation of the tile and unroll directives in the OpenMP dialect. The specification of the loop transformation constructs can be found in Section 9 of the OpenMP 5.2 specification document.

After an initial discussion with the Flang OpenMP community ( thanks @kiranchandramohan from ARM), I came up with the following definition (Only for tile, the same can be done for unroll) :

def TileLoopOp : OpenMP_Op<"tile", [AttrSizedOperandSegments,
                           AutomaticAllocationScope, RecursiveSideEffects,
                           AllTypesMatch<["lowerBound", "upperBound", "step"]>,
                           ReductionClauseInterface]> {
  let summary = "tile construct";
  let description = [{
    
  }];

  let arguments = (ins Variadic<IntLikeType>:$lowerBound,
                       Variadic<IntLikeType>:$upperBound,
                       Variadic<IntLikeType>:$step,
                       UnitAttr:$inclusive,
                       Variadic<IntLikeType>:$tilesize);

  let regions = (region AnyRegion:$region);

  let assemblyFormat = [{
    `size`  $tilesize
    `for` custom<LoopControl>($region, $lowerBound, $upperBound, $step,
                                  type($step), $inclusive) attr-dict
  }];

  let hasVerifier = 1;
}

However, there are cases that need to be considered in order to streamline the up definition. One such case ( from @kiranchandramohan ) is (Fortran code) :

!$omp tile
!$omp unroll
do i=1,100
end do

Then we have :

#pragma omp tile size (4,4)
#pragma omp tile size (5, 16)
 for (int i = 0, i < 100; i++)
  for (int j =0; j <128; ++j)
      A[i][j] = i*1000 + j;

Should the nesting be handled through a parent-child relationship (ParentOpType)? I also looked into other dialects. Transformation support loop transformation. I am not sure; can we benefit from this support?

1 Like

@abidmalikwaterloo Since the OpenMP dialect sits in MLIR, this is better posted in the MLIR section.

I think you can generalize the Fortran code by saying that a Loop Transformation constructs can be closely nested inside other loop Transformation constructs. The nested construct applies before the outer construct.

Thanks @abidmalikwaterloo for this post.

I think the ReductionClauseInterface is not needed for this operation.

The design looks OK for a single operation. But there will be differences between unroll and tile since unroll is only applied to the outermost loop, whereas tiling is for a loop nest. Also, as you point out, we will have issues when there are multiple operations and since these operations can be closely nested or chained.

The loop control has to be part of the operation to avoid other transformations happening to the loop. But combining the operations into a single operation (say omp.transform_loop) is probably not possible since the transformations can potentially happen to different sets of the nested loop.

From the looks of it, the transform dialect seems to model this. It might be worth looking into it. Also the motivational example provided in the transform dialect seems to talk of tiling and unrolling. Transform Dialect - MLIR.

May be @ftynse can provide better suggestions.

  • Quick question: Do you mean we need to use ReductionClauseInterface or do we need an interface similar to this for tile. I checked D122402 and my understanding is that it is very specific to reduction.
  • I am looking into the transform dialect. However, a person from the transform dialect team might be very helpful.

It should be possible to model these clauses as region-holding ops with additional verification. (Speaking about which, you may want SingleBlockRegion instead of AnyRegion for the body.) The verification can be along the following lines. Operations modeling transform-related clauses all implement some OpenMPTransformClauseOpInterface. Each of these also verifies that its body contains exactly one non-terminator operation and that this operation either implements the same interface (as a series of transformations) or is some loop-like construct to which it can be attached. The latter is potentially another interface that is attached to relevant ops through the ExternalModel mechanism.

Another alternative is to attach the transformation to loop-level operations that are modeled by the OpenMP dialect, such as omp.wsloop. This alternative comes with two flavors. (a) have an optional ArrayAttribute that contains a description of transformations to perform, in order, modeled as attributes, potentially with an attribute interface for additional verification.

omp.wsloop (%i) = (...)
attributes { transform = [#omp.tile<1,2,3>, #omp.tile<5,6>, #omp.unroll] } {
  "some_op"(%i) ...
}

(b) have an optional additional region that contains a description of transformations to perform as operations. The latter flavor, unlike the former and the nested-region approach, would allow for combining transformations in non-linear way by using SSA values as identifiers, but it is unclear to me whether OpenMP itself supports or intends to support such combinations.

omp.wsloop (%i) = (...)
transforms {
  omp.tile {sizes = [1,2,3]}
  omp.tile {sizes = [5,6]}
  omp.unroll
} {
  "some_op"(%i) ...
}

The transform dialect can be seen as a standalone form of the flavor (b) in the second proposal. It may be possible to use it to encode transformations but likely not to perform them as I expect that OpenMPIRBuilder should be in charge of that rather than some code in MLIR. The OpenMP dialect will likely need its own operations for tiling and unrolling that follow the spec and are decoupled from whatever design decisions we take for SCF or affine loop transformations using the transform dialect.

[cc @Meinersbur who works on this part of the spec}

You are in luck, https://github.com/llvm/llvm-project/blame/main/mlir/include/mlir/Dialect/Transform/IR/TransformOps.td shows my picture on every other line :wink:

Thanks @ftynse for the suggestions.

It is nice that you bring the cases where the the loop transformation constructs produce loops on which there can be enclosing OpenMP directives like do/for.

!$omp do 
!$omp tile sizes(1,2,3)
!$omp tile sizes(5,6)
!$omp unroll
do i=1,100
do j=1,100
do k=1,100
end do
end do
end do
!$omp end do

I am imagining that your point is that in the absence of such enclosing OpenMP directives, there can be a default OpenMP dialect operation (say omp.loop_transform) that will be present.

The only issue I can think of with Option (a) is that the OpenMP spec might add more clauses to the tile and unroll construct which will make the attributes complicated.
With Option (b) the issue is that the tile and unroll stand-alone operation might need the loop control attached to it, but when it is present in the transforms region of an operation it will not have the loop control.

Yes, the OpenMPIRBuilder implements the tile and unroll transformations and they can be chained.

Thinking about it again, we might need further changes or clarifications.

→ By default whenever we have more than one set of loops in omp.wsloop it is meant to be a collapse transformation. We will need to change and clarify the new meanings. Particularly we might have to introduce collapse as an attribute with an integer value.
→ The omp.wsloop’s set of loops will be more than the set of loops to which the worksharing applies since there can be attribute values like tile that can potentially apply to a larger set of nested loops.

I’m just exploring options…

This complexity will have to go somewhere. These days, it is almost equally easy to add a new attribute to an operation (if the directive is modeled as an operation) or a parameter to an attribute via AttrDef ODS (if the directive is modeled as an attribute).

I’m not sure to understand what “loop control” means. Could you give an example?

It is possible to also mix-and-match. Like have a single omp.loop_transform operation that you suggested above with a list of transformation directives applying to loops immediately nested in it as opposed to having multiple nested operations or attaching transformations directly to loops.

I was talking about the case where there is a standalone tile OpenMP construct. In this case for the tiling to happen the loop should be part of the MLIR operation. So for the following Fortran + OpenMP code,

!$ omp tile sizes(5,16)
do i=1,100
do j=1, 100
..
end do
end do

we will probably have an MLIR tile operation like the following

omp.tile {sizes = [5,6]} (%i, %j) = (%st_i, %st_j) to (%end_i, %end_j) step(%step_i,%step_j) {
...
}

Whereas in the case where there is a worksharing directive that is applied to the result of the tile operation like the following,

!$omp do 
!$omp tile sizes(5,6)
do i=1,100
do j=1,100
...
end do
end do
!$omp end do

as per my interpretation of your suggestion, we will have an MLIR wsloop operation like the following MLIR code. Notice that the tile operation here is without the loop indices. The loop indices of the wsloop capture two loops (although in the original source it is only for the outermost loop of the result of the tiling).

omp.wsloop (%i,%j) = (%start_i,%start_j) to (%end_i,%end_j) step (%step_i,%step_j)
transforms {
  omp.tile {sizes = [5,6]}
} {
 ...
}

The standard says that the loops involved should be of canonical loop nest form. Canonical loop nests are not allowed to have CYCLE, EXIT but does not say anything about a goto to a label inside the loop.

I was saying we do NOT need a ReductionClauseInterface.

Thanks! I see what you mean.

In this light, since we don’t have the op equivalent of canonical loop nest in the dialect, we can’t really attach the directives to something using attributes or regions because the foreign-dialect attributes are discardable and the own-dialect attributes or regions are intrusive in the client dialect. So the originally proposed approach to nest operations makes most sense.

That being said, giving loop transformation constructs some control flow semantics conflates things. This is especially noticeable if you take your two examples and put them in the single module. It also works for a combination of tile+unroll. Maybe even the original approach to workshape loops wasn’t quite right and we should treat workshare constructs and loop transform constructs the same way, that is, not giving them any control flow. An alternative approach using modern infrastructure would be to introduce an OpenMPCanonicalLoopOpInterface that gives the OpenMP dialect and lowering access to the loop aspects they need like the iteration variable and the body. SCF ForOp can then have an external implementation of the interface registered by the OpenMP dialect, FIR can take a similar route if it doesn’t want to depend on OpenMP with a fully separate registration. (Note that this was not technically possible when we added wsloop.) Then the ops that correspond to directives are basically tag ops with regions that require their bodies to have a single op with OpenMPCanonicalLoopOpInterface:

// This is fine.
omp.wsloop private(...) schedule(...) { // no loop here
  omp.tile sizes(4, 16) {        // no loop here either
    scf.for %i = ...             // this is the loop
  }
}

// So is this.
omp.tile sizes(4, 16) { // no workshare, just tiling
// the verifier of omp.tile looks inside its body to verify
// the nesting level using the interface
  scf.for %i = ... {
    scf.for %j = ... {
    }
  }
}

This breaks the current abstraction, but at the same time looks closer to the spec. WDYT?

I did not fully understand this point. From your original proposal, the attributes are always on the OpenMP dialect operations. For the workshare-loop it is on omp.wsloop, and so is it for the omp.transform_loop standalone operation. There are no foreign-dialect attributes here.

When we discussed this before, there were some strong objections against the nested approach and there was a requirement that worksharing loop be represented as a loop. Some of it is captured in this comment. Would adding the OpenMPLoopInterface to the scf.for or fir.do_loop be sufficient to prevent transformations of these loops by other passes?

Thank you @ftynse @kiranchandramohan for the thoughtful discussion. @Meinersbur Would like to chime in?

My proposal didn’t have omp.transform_loop, but that approach looks interesting.

It wouldn’t. Generic passes are not required to check if an operation implements an interface, and are not even expected to know about dialect-specific interfaces. I am also not entirely convinced that having a different operation is sufficient by itself to prevent all transformations.

I am rather repulsed by the idea of giving omp.tile optional loop semantics just to prevent some transformations. Maybe we could have an omp.canonical_loop that matches the definition in the spec and have frontends emit that instead of scf.for / fir.do_loop (potentially via an internal transform before any other passes can run).

Sorry for being late to the discussion.

The original and most-straightforward approach was/is to have each OpenMP directive directly represented in the OpenMP dialect, which would eventually lead to a call to the respective OpenMPIRBuilder method, with the intend that we can share the lowering implementation between Clang and MLIR/Flang and have consistent behaviour between them.

Lowering to scf.for/transform.loop.unroll/etc. would effectively be lowering in MLIR itself instead. That’s possible, but not the original idea. It also break when not done completely and already detailed here comment. There are semantic differences as well, like #pragma omp unroll does not require specifying an unroll factor while transform.loop.unroll does (unless I don’t read its specification directly). #pragma omp tile gives additional guarantees that allows to optimize remainder iterations separately.

A dedicated omp.canonical_loop would exactly fit into that model and is how the implementation in Clang does it (the OMPCanonicalLoop AST node, which guarantees conformance to OpenMP constraints). It would be possible to lower it to scf.for, but only if not associated with OpenMP directive. OpenMP loop transformations also would only apply to omp.canonical_loop.