[RFC] OpenACC dialect data operation improvements

The OpenACC dialect currently keeps track of data operations in various operand
lists which encode semantics based on the name of the list.

acc parallel copyin(var)
===>
acc.parallel copyin(%var) {
  // use %var
}

I would like to separate the data clauses into separate MLIR operations:

acc parallel copyin(var)
===>
%accvar = acc.copyin(%var)
acc.parallel dataOperands(%accvar) {
  // use %accvar
}

The motivation:

  1. Friendlier to add attributes such as MemoryEffects to a single operation. This will better reflect semantics (like the fact that an acc.copyin operation only reads). Additionally, operations can be moved/optimized (eg CSE).
  2. Correctly represent dataflow. Currently, a write to %var in the acc region appears as a mutation to the original variable. But when compiling this code for GPU, a mutation to var inside acc region is not visible outside of region (unless there is a copyout)
  3. Easier to keep track of debug information. Line location can point to the text representing the data clause instead of the construct (as it does right now). Additionally, we can use attributes and keep track of variable names in clauses without having to walk the IR tree in attempt to recover the information (this makes acc dialect more agnostic with regards to what other dialect it is used with).

Data clause operations which have semantics at both entry and exit of region will be decomposed:
acc copy => acc.copyin (before region) + acc.copyout (after region)
acc copyout => acc.create (before region) + acc.copyout (after region)
acc attach => acc.attach (before region) + acc.detach (after region)
acc create => acc.create (before region) + acc.delete (after region)

Here is the format I would like to recommend for acc data entry operations (acc.copyin, acc.deviceptr, acc.present, acc.create):

  let arguments = (ins OpenACC_PointerLikeTypeInterface:$varPtr,
                       Optional<OpenACC_PointerLikeTypeInterface>:$varPtrPtr,
                       Variadic<OpenACC_DataBoundsType>:$bounds,
                       DefaultValuedAttr<OpenACC_DataClauseEnum,clause>:$dataClause,
                       OptionalAttr<OpenACC_DataClauseEnum>:$decomposedFrom,
                       DefaultValuedAttr<BoolAttr, "true">:$structured,
                       DefaultValuedAttr<BoolAttr, "false">:$implicit,
                       OptionalAttr<StrAttr>:$name);
  let results = (outs OpenACC_PointerLikeTypeInterface:$accPtr);

Explanation of the non-evident various pieces:

  • varPtr - the address of the variable to copy.
  • varPtrPtr - Used for data structures with pointers. This is important for OpenACC due to implicit attach semantics on data clauses (2.6.4).
  • bounds - used when copying just slice of array or array’s bounds are not encoded in type
  • decomposedFrom - keeps track of original user specified clause such as “copy”
  • structured - flag to note whether this is associated with structured region (parallel, kernels, data) or unstructured (enter data, exit data). This is important due to spec specifically calling out structured and dynamic reference counters (2.6.7).

Data exit operations (acc.copyout, acc.delete, acc.detach) and acc.attach will have a similar format as the entry operations except they won’t produce a result.

  let arguments = (ins Optional<OpenACC_PointerLikeTypeInterface>:$varPtr,
                       Optional<OpenACC_PointerLikeTypeInterface>:$accPtr, …

So an acc.copyout operation will be modeled like a “store”:

acc parallel copy(var)
===>
%accvar = acc.copyin(%var)
acc.parallel dataOperands(%accvar) {
  // use %accvar
}
acc.copyout(%var, %accvar)

Currently the acc dialect also has no way of representing bounds information. When used with flang, a FIR descriptor is created to encode bounds and then the descriptor is placed in the copy operand. This coupling is not ideal since the bounds are an OpenACC language concept (2.7.1). Thus I would like to also recommend creating an acc.bounds operation (use of its result is shown in the data entry operation specification above):

  let arguments = (ins Optional<AnyType>:$lowerbound,
                       Optional<AnyType>:$upperbound,
                       Optional<AnyType>:$extent,
                       Optional<AnyType>:$stride,
                       DefaultValuedAttr<BoolAttr, "false">:$strideInBytes,
                       Optional<AnyType>:$startIdx);
  let results = (outs OpenACC_DataBoundsType:$result);

Explanation of non-evident pieces:

  • Both/either upperbound and extent can be specified. In the OpenACC spec 2.7.1, for C++ extent is used in the acc clause, while for Fortran upperbound is used. That said, FIR uses extent in its descriptor. So flexibility is useful.
  • The strideInBytes flag and startIdx are more useful for Fortran since a slice of derived type may be a stride that is not divisible by element size. Additionally, having startIdx allows the possibility of better error messages that match the user’s code (instead of assuming zero based array).

Let me know if you have any questions or suggestions.

6 Likes

Thanks for the proposition @razvan.lupusoru. It makes sense to me and this decomposition will help when translating to LLVM.

It was initially tried here https://github.com/flang-compiler/f18-llvm-project/pull/915 but having it earlier will make this legalization from FIR easier.

1 Like

Many thanks @razvan.lupusoru for sharing this design. Very interesting. Since similar constructs are being implemented in the OpenMP dialect this will be a very useful reference.

Of the three motivations that you provided, I am particularly intrigued by (2). From your explanation there is really a question of the visibility of the mutation outside the region and we might have overlooked this in OpenMP as well.

Is there a destroy operation as well to signify the end of the lifetime of the copy (%accvar)? Otherwise there can be a question about the value of %accvar in the code after the region or in a subsequent acc.parallel.

%accvar = acc.copyin(%var)
acc.parallel dataOperands(%accvar) {
  // mutate %accvar
}
// What is the value of %accvar here?
acc.parallel dataOperands(%accvar) {
// What is the value of %accvar here?
}
acc.parallel {
// What is the value of %accvar here?
}

The alternative here could be be move the copyin inside the region and have %accvar as an entry block argument. In this case, since %accvar is local to the region, there is no question about its value outside.

%accvar = acc.copyin(%var)
acc.parallel dataOperands(%accvar) {
  %accvar = acc.copyin(%var)
  // use %accvar
}

The bounds operation is also interesting, we really have an issue with finding out the value of these from the descriptors. I am not sure whether I missed the usage here. Is this additional information that is going to be added to a variable? Or are you recommending replacing the variable representation itself with the bounds?

Using block arguments is a nice idea to limit the lifetime to the region. Another idea is to blindly create an alloca for %accvar in the region. The lifetime is limited to the region. The optimizer can take care of finding a better solution.

Hi @razvan.lupusoru ,
thank you for your RFC. It looks interesting and I think that it is the step in the right direction.

I am interested how do you want to mark host variables and their copies on the device if the user allows to execute given kernel on the host or on the device. For example:

acc kernels copy(var) if (cond)
//kernel code

My understanding of your proposal is that we should generate the following code:

%accvar = acc.copyin(%var)
acc.parallel dataOperands(%accvar) if (%cond)
//use of %accvar
acc.copyout(%var, %accvar)

IMO this MLIR code implicitly assumes that the copy operation takes place even if the kernel is executed on the host when the if clause is evaluated to false.

How do you want to mark, that the copy operations are not necessary for kernels executed on host?

First of all, I would like to thank you every commenter so far for taking a look :slight_smile:

This dataflow model should work well for OpenMP target mapping operations. I would be happy to provide additional clarifications if you see this as something you’d like to mimic in the omp dialect.

Yes. There are 3 “data exit” operations that take %accvar: acc.copyout, acc.delete, acc.detach. So this model does indeed use a structured marker for lifetime.

That said, all of these operations are intended to match OpenACC semantics, including those about counters. What I mean is that a delete operation does not mean the data will be destroyed (and I wanted to clarify this since you specifically mentioned destroy operation).

Consider the following example:

%accvar1 = acc.copyin(%var)
acc.data dataOperands(%accvar1) {
  %accvar2 = acc.copyin(%var)
  acc.parallel dataOperands(%accvar2) {
  }
  acc.delete(%accvar2)
  %accvar3 = acc.copyin(%var)
  acc.parallel dataOperands(%accvar3) {
  }
  acc.delete(%accvar3)
}
%acc.delete(%accvar1)

In the above example, %accvar1, %accvar2, %accvar3 will all point to the same memory location (aka these memory references alias). Meaning acc.delete(%accvar2) does not actually delete data from offload device but simply decrements counter.

In this example, without a data exit operation after any of the regions, the value of %accvar is exactly the same (aka it is the same pointer). From a dataflow perspective, using %accvar in multiple regions does mean that we are reading/mutating the same memory location.
From a verifier perspective (which I still need to fully implement), operations with the “structured” flag should require a matching exit operation. So your example would need a final exit operation to be well-formed.

I would also like to add that your example is one of the reasons I decided to model them outside of the compute region. I envisioned being able to apply optimizations like CSE so that we don’t have data upload per region. Getting data to oflload device should not have to be coupled with the compute.

The acc.bounds operation is not associated with a variable but with a data operation.

Let me share an example to demonstrate its usage and also to clarify its purpose and semantics. I think this example is probably more than you wanted to see but it highlights a few interesting considerations I took when thinking about how to design the acc.bounds operation.

Consider the following Fortran example (since you mentioned descriptors, I assume you were referring to FIR)

program main
  integer, allocatable :: array(:)
  integer :: arraysize
  allocate(array(10))
  !$acc data copy(array)
    !$acc serial copy(array(5:10)) copyout(arraysize)
      do ii = 1, 10
        array(ii) = ii
      end do
      arraysize = size(array)
    !$acc end serial
  !$acc end data
  print *, array
  print *, arraysize
end program

This example highlights a few interesting points:

  • It copies the whole array first and then in a nested region, it copies a slice.
  • It accesses the full array even though its immediate parent region only copied slice.
  • It reads array size from descriptor.

Running this without OpenACC prints:

 1 2 3 4 5 6 7 8 9 10
 10

What about with OpenACC targeting offload device? Well let’s break this down.

  • First copy moves array data (10 elements) to offload device
  • Second copy increments the structured reference counter since data is already on device. This behavior also matches OpenMP spec as far as I can tell (page 153 of 5.2 spec)
  • Iterating through the array from first element is legal and not an out of bounds violation - the whole array is on the device.
  • What about the size of the array? OpenACC specification is primarily focused on the data. It only provides this guidance for the descriptor: “For Fortran array pointers and allocatable arrays, this includes copying any associated descriptor (dope vector) to the device copy of the pointer”.
  • So what should the value be? We obviously have the whole array on the device.
  • Well, the associated descriptor with the array has a size of 10. So on device we should also have a size of 10.

(I also just tested with two offload implementations nvfortran and gfortran and confirmed the results).

  !$acc data copy(array)
  !$acc serial copy(array(5:10)) copyout(arraysize)

==>

  %0 = fir.address_of(@_QFEarray) : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>
  %6 = fir.load %0 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>
  %7:3 = fir.box_dims %6, %c0 : (!fir.box<!fir.heap<!fir.array<?xi32>>>, index) -> (index, index, index)
  %8 = arith.subi %7#1, %c1 : index
  %9 = acc.bounds   lowerbound(%c0 : index) upperbound(%8 : index) stride(%7#2 : index) startIdx(%7#0 : index) {strideInBytes = true}
  %10 = fir.box_addr %6 : (!fir.box<!fir.heap<!fir.array<?xi32>>>) -> !fir.heap<!fir.array<?xi32>>
  %11 = acc.copyin varPtr(%10 : !fir.heap<!fir.array<?xi32>>)   bounds(%9) -> !fir.heap<!fir.array<?xi32>> {dataClause = 3 : i64, name = "array"}
  acc.data   dataOperands(%11 : !fir.heap<!fir.array<?xi32>>) {
    %27 = fir.load %0 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>
    %28:3 = fir.box_dims %27, %c0 : (!fir.box<!fir.heap<!fir.array<?xi32>>>, index) -> (index, index, index)
    %29 = arith.subi %c5, %28#0 : index
    %30 = arith.subi %c10, %28#0 : index
    %31 = acc.bounds   lowerbound(%29 : index) upperbound(%30 : index) stride(%28#2 : index) startIdx(%28#0 : index) {strideInBytes = true}
    %32 = fir.box_addr %27 : (!fir.box<!fir.heap<!fir.array<?xi32>>>) -> !fir.heap<!fir.array<?xi32>>
    %33 = acc.copyin varPtr(%32 : !fir.heap<!fir.array<?xi32>>)   bounds(%31) -> !fir.heap<!fir.array<?xi32>> {dataClause = 3 : i64, name = "array(5:10)"}
    %34 = acc.create varPtr(%1 : !fir.ref<i32>)   -> !fir.ref<i32> {dataClause = 4 : i64, name = "arraysize"}
    acc.serial   dataOperands(%33, %34 : !fir.heap<!fir.array<?xi32>>, !fir.ref<i32>) {
      ...
      %50 = fir.load %0 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>
      %51:3 = fir.box_dims %50, %c0 : (!fir.box<!fir.heap<!fir.array<?xi32>>>, index) -> (index, index, index)
      %52 = fir.convert %51#1 : (index) -> i32
      fir.store %52 to %1 : !fir.ref<i32>
      ...

So the points I wanted to highlight:

  • Notice how the acc.bounds operation is associated with the data action (acc.copyin). The original variable “array” is not reboxed. OpenACC slice semantics are not the same as slice semantics on the source language.
  • The two copyin operations produce %11 and %33 respectively. These point to the same memory location. The IR here makes it more evident because it keeps the bounds separate from the “varPtr”. And in both cases, the varPtr is the loaded address from the same descriptor.
  • The FIR inside the region is unaffected by the slicing operation. The original descriptor is still used.
1 Like

I imagined for the data operations to be decoupled from the compute regions, so to me it did make sense to have them outside of the region. I also imagined being able to optimize/move these operations (CSE, hoisting), in which case it made sense to have them dominate the compute region.

Dominik, this is a great example. This scenario you mentioned is actually not just consideration for the “if” clause, but other clauses on compute/data constructs as well including: wait, async, self.

The way I imagined this, though I am open to suggestions, is that we would have something like:

%acccond = acc.if(%cond)
%accvar = acc.copyin(%var) if(%acccond)
acc.parallel dataOperands(%accvar) if(%acccond)
//use of %accvar
acc.copyout(%var, %accvar)

Meaning the same condition is associated with all of the data actions.

Let me simplify your example without the if condition to highlight a point. OpenACC spec allows for a multicore CPU to be a device (3.3, section 1.2). It also clarifies in section 1.3, that “If a device shares memory with the local thread, its device data environment will be shared with the local thread. In that case, the implementation need not create new copies of the data for the device and no data movement need be done”. So “shared memory” from OpenACC perspective has this specific meaning.

So the question is what should a frontend generate when it sees and we are compiling a program for CPU:
acc kernels copyin(var)

Well I think it should generate this, regardless of target:

%accvar = acc.copyin(%var)
acc.kernels dataOperands(%accvar)

Namely, the acc dialect should be representative of the user code. At this point, there is no marking here that the operations are not necessary. I strongly believe in “progressive lowering” phylosophy of MLIR - so we should not interpret these actions earlier than we need to.

However, later on we should do further optimizations when we are actually wanting to start interpreting the OpenACC semantics. For example, in a CPU multicore memory system, we should determine that %accvar and %var alias to same memory location, in which case this copy is unnecessary and can be removed.

I would say that a similar thing should happen in your example with the if condition. Assuming this condition is compile time computable (often it may not be) and that it evaluates to false, then a later optimization pass should determine that %accvar and %var are the same and remove the copy then.

For other “shared memory” systems (as per OpenACC), we may decide to not remove it. Consider NVIDIA unified memory. We should not need to copy the data to device since it should be migrated automatically. But what if we wanted to insert hints that the data will be used on device, and to prefetch it before use. In this case, it is still useful to do something with that copy action. So the point I am making here, is that early marking that the operations are not encessary is not useful. But removal or another action later might be useful.

@razvan.lupusoru thank you for you explanation. I work for OpenMP and I strongly support your idea of progressive lowering, but let me ask a few questions:

  1. Do you need to refactor a lot of code to support your idea?
  2. Do you think that your code can be reused in OpenMP as well? If yes, could you implement it in a common directory so that OpenMP can reuse it?

Razvan, you might consider that it doesn’t have to be either/or. You can have a high-level operation that folds the copy-in, etc. semantics and is easy to target for the front end. Then also define a number of more primitive operations that can be used to describe the computational steps at a lower-level of granularity.

Ultimately, you probably just end up with an allocation, some sort of copy/initialization/load-store, and a copy-back/deallocation and let the optimizer handle the nuts and bolts.

  1. Do you need to refactor a lot of code to support your idea?

Yes we did refactor a lot of things. This is not a light change but it was needed.

  1. Do you think that your code can be reused in OpenMP as well? If yes, could you implement it in a common directory so that OpenMP can reuse it?

It can probably be used in OpenMP if you define data entry/data exit operation that carry the OpenMP semantic for the data mapping. The portion of the code that gather the bounds and base address is of course FIR specific and can be reuse. It is already implemented in flang/lib/Lower/OpenACC.cpp. It could be shared if needed.