[RFC] OpenACC dialect data operation improvements

The OpenACC dialect currently keeps track of data operations in various operand
lists which encode semantics based on the name of the list.

acc parallel copyin(var)
===>
acc.parallel copyin(%var) {
  // use %var
}

I would like to separate the data clauses into separate MLIR operations:

acc parallel copyin(var)
===>
%accvar = acc.copyin(%var)
acc.parallel dataOperands(%accvar) {
  // use %accvar
}

The motivation:

  1. Friendlier to add attributes such as MemoryEffects to a single operation. This will better reflect semantics (like the fact that an acc.copyin operation only reads). Additionally, operations can be moved/optimized (eg CSE).
  2. Correctly represent dataflow. Currently, a write to %var in the acc region appears as a mutation to the original variable. But when compiling this code for GPU, a mutation to var inside acc region is not visible outside of region (unless there is a copyout)
  3. Easier to keep track of debug information. Line location can point to the text representing the data clause instead of the construct (as it does right now). Additionally, we can use attributes and keep track of variable names in clauses without having to walk the IR tree in attempt to recover the information (this makes acc dialect more agnostic with regards to what other dialect it is used with).

Data clause operations which have semantics at both entry and exit of region will be decomposed:
acc copy => acc.copyin (before region) + acc.copyout (after region)
acc copyout => acc.create (before region) + acc.copyout (after region)
acc attach => acc.attach (before region) + acc.detach (after region)
acc create => acc.create (before region) + acc.delete (after region)

Here is the format I would like to recommend for acc data entry operations (acc.copyin, acc.deviceptr, acc.present, acc.create):

  let arguments = (ins OpenACC_PointerLikeTypeInterface:$varPtr,
                       Optional<OpenACC_PointerLikeTypeInterface>:$varPtrPtr,
                       Variadic<OpenACC_DataBoundsType>:$bounds,
                       DefaultValuedAttr<OpenACC_DataClauseEnum,clause>:$dataClause,
                       OptionalAttr<OpenACC_DataClauseEnum>:$decomposedFrom,
                       DefaultValuedAttr<BoolAttr, "true">:$structured,
                       DefaultValuedAttr<BoolAttr, "false">:$implicit,
                       OptionalAttr<StrAttr>:$name);
  let results = (outs OpenACC_PointerLikeTypeInterface:$accPtr);

Explanation of the non-evident various pieces:

  • varPtr - the address of the variable to copy.
  • varPtrPtr - Used for data structures with pointers. This is important for OpenACC due to implicit attach semantics on data clauses (2.6.4).
  • bounds - used when copying just slice of array or array’s bounds are not encoded in type
  • decomposedFrom - keeps track of original user specified clause such as “copy”
  • structured - flag to note whether this is associated with structured region (parallel, kernels, data) or unstructured (enter data, exit data). This is important due to spec specifically calling out structured and dynamic reference counters (2.6.7).

Data exit operations (acc.copyout, acc.delete, acc.detach) and acc.attach will have a similar format as the entry operations except they won’t produce a result.

  let arguments = (ins Optional<OpenACC_PointerLikeTypeInterface>:$varPtr,
                       Optional<OpenACC_PointerLikeTypeInterface>:$accPtr, …

So an acc.copyout operation will be modeled like a “store”:

acc parallel copy(var)
===>
%accvar = acc.copyin(%var)
acc.parallel dataOperands(%accvar) {
  // use %accvar
}
acc.copyout(%var, %accvar)

Currently the acc dialect also has no way of representing bounds information. When used with flang, a FIR descriptor is created to encode bounds and then the descriptor is placed in the copy operand. This coupling is not ideal since the bounds are an OpenACC language concept (2.7.1). Thus I would like to also recommend creating an acc.bounds operation (use of its result is shown in the data entry operation specification above):

  let arguments = (ins Optional<AnyType>:$lowerbound,
                       Optional<AnyType>:$upperbound,
                       Optional<AnyType>:$extent,
                       Optional<AnyType>:$stride,
                       DefaultValuedAttr<BoolAttr, "false">:$strideInBytes,
                       Optional<AnyType>:$startIdx);
  let results = (outs OpenACC_DataBoundsType:$result);

Explanation of non-evident pieces:

  • Both/either upperbound and extent can be specified. In the OpenACC spec 2.7.1, for C++ extent is used in the acc clause, while for Fortran upperbound is used. That said, FIR uses extent in its descriptor. So flexibility is useful.
  • The strideInBytes flag and startIdx are more useful for Fortran since a slice of derived type may be a stride that is not divisible by element size. Additionally, having startIdx allows the possibility of better error messages that match the user’s code (instead of assuming zero based array).

Let me know if you have any questions or suggestions.

6 Likes

Thanks for the proposition @razvan.lupusoru. It makes sense to me and this decomposition will help when translating to LLVM.

It was initially tried here https://github.com/flang-compiler/f18-llvm-project/pull/915 but having it earlier will make this legalization from FIR easier.

1 Like

Many thanks @razvan.lupusoru for sharing this design. Very interesting. Since similar constructs are being implemented in the OpenMP dialect this will be a very useful reference.

Of the three motivations that you provided, I am particularly intrigued by (2). From your explanation there is really a question of the visibility of the mutation outside the region and we might have overlooked this in OpenMP as well.

Is there a destroy operation as well to signify the end of the lifetime of the copy (%accvar)? Otherwise there can be a question about the value of %accvar in the code after the region or in a subsequent acc.parallel.

%accvar = acc.copyin(%var)
acc.parallel dataOperands(%accvar) {
  // mutate %accvar
}
// What is the value of %accvar here?
acc.parallel dataOperands(%accvar) {
// What is the value of %accvar here?
}
acc.parallel {
// What is the value of %accvar here?
}

The alternative here could be be move the copyin inside the region and have %accvar as an entry block argument. In this case, since %accvar is local to the region, there is no question about its value outside.

%accvar = acc.copyin(%var)
acc.parallel dataOperands(%accvar) {
  %accvar = acc.copyin(%var)
  // use %accvar
}

The bounds operation is also interesting, we really have an issue with finding out the value of these from the descriptors. I am not sure whether I missed the usage here. Is this additional information that is going to be added to a variable? Or are you recommending replacing the variable representation itself with the bounds?

Using block arguments is a nice idea to limit the lifetime to the region. Another idea is to blindly create an alloca for %accvar in the region. The lifetime is limited to the region. The optimizer can take care of finding a better solution.