[RFC] OpenACC dialect data operation improvements

razvan.lupusoru · June 2, 2023, 7:08pm

First of all, I would like to thank you every commenter so far for taking a look

This dataflow model should work well for OpenMP target mapping operations. I would be happy to provide additional clarifications if you see this as something you’d like to mimic in the omp dialect.

Yes. There are 3 “data exit” operations that take %accvar: acc.copyout, acc.delete, acc.detach. So this model does indeed use a structured marker for lifetime.

That said, all of these operations are intended to match OpenACC semantics, including those about counters. What I mean is that a delete operation does not mean the data will be destroyed (and I wanted to clarify this since you specifically mentioned destroy operation).

Consider the following example:

%accvar1 = acc.copyin(%var)
acc.data dataOperands(%accvar1) {
  %accvar2 = acc.copyin(%var)
  acc.parallel dataOperands(%accvar2) {
  }
  acc.delete(%accvar2)
  %accvar3 = acc.copyin(%var)
  acc.parallel dataOperands(%accvar3) {
  }
  acc.delete(%accvar3)
}
%acc.delete(%accvar1)

In the above example, %accvar1, %accvar2, %accvar3 will all point to the same memory location (aka these memory references alias). Meaning acc.delete(%accvar2) does not actually delete data from offload device but simply decrements counter.

In this example, without a data exit operation after any of the regions, the value of %accvar is exactly the same (aka it is the same pointer). From a dataflow perspective, using %accvar in multiple regions does mean that we are reading/mutating the same memory location.
From a verifier perspective (which I still need to fully implement), operations with the “structured” flag should require a matching exit operation. So your example would need a final exit operation to be well-formed.

I would also like to add that your example is one of the reasons I decided to model them outside of the compute region. I envisioned being able to apply optimizations like CSE so that we don’t have data upload per region. Getting data to oflload device should not have to be coupled with the compute.

The acc.bounds operation is not associated with a variable but with a data operation.

Let me share an example to demonstrate its usage and also to clarify its purpose and semantics. I think this example is probably more than you wanted to see but it highlights a few interesting considerations I took when thinking about how to design the acc.bounds operation.

Consider the following Fortran example (since you mentioned descriptors, I assume you were referring to FIR)

program main
  integer, allocatable :: array(:)
  integer :: arraysize
  allocate(array(10))
  !$acc data copy(array)
    !$acc serial copy(array(5:10)) copyout(arraysize)
      do ii = 1, 10
        array(ii) = ii
      end do
      arraysize = size(array)
    !$acc end serial
  !$acc end data
  print *, array
  print *, arraysize
end program

This example highlights a few interesting points:

It copies the whole array first and then in a nested region, it copies a slice.
It accesses the full array even though its immediate parent region only copied slice.
It reads array size from descriptor.

Running this without OpenACC prints:

 1 2 3 4 5 6 7 8 9 10
 10

What about with OpenACC targeting offload device? Well let’s break this down.

First copy moves array data (10 elements) to offload device
Second copy increments the structured reference counter since data is already on device. This behavior also matches OpenMP spec as far as I can tell (page 153 of 5.2 spec)
Iterating through the array from first element is legal and not an out of bounds violation - the whole array is on the device.
What about the size of the array? OpenACC specification is primarily focused on the data. It only provides this guidance for the descriptor: “For Fortran array pointers and allocatable arrays, this includes copying any associated descriptor (dope vector) to the device copy of the pointer”.
So what should the value be? We obviously have the whole array on the device.
Well, the associated descriptor with the array has a size of 10. So on device we should also have a size of 10.

(I also just tested with two offload implementations nvfortran and gfortran and confirmed the results).

  !$acc data copy(array)
  !$acc serial copy(array(5:10)) copyout(arraysize)

==>

  %0 = fir.address_of(@_QFEarray) : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>
  %6 = fir.load %0 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>
  %7:3 = fir.box_dims %6, %c0 : (!fir.box<!fir.heap<!fir.array<?xi32>>>, index) -> (index, index, index)
  %8 = arith.subi %7#1, %c1 : index
  %9 = acc.bounds   lowerbound(%c0 : index) upperbound(%8 : index) stride(%7#2 : index) startIdx(%7#0 : index) {strideInBytes = true}
  %10 = fir.box_addr %6 : (!fir.box<!fir.heap<!fir.array<?xi32>>>) -> !fir.heap<!fir.array<?xi32>>
  %11 = acc.copyin varPtr(%10 : !fir.heap<!fir.array<?xi32>>)   bounds(%9) -> !fir.heap<!fir.array<?xi32>> {dataClause = 3 : i64, name = "array"}
  acc.data   dataOperands(%11 : !fir.heap<!fir.array<?xi32>>) {
    %27 = fir.load %0 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>
    %28:3 = fir.box_dims %27, %c0 : (!fir.box<!fir.heap<!fir.array<?xi32>>>, index) -> (index, index, index)
    %29 = arith.subi %c5, %28#0 : index
    %30 = arith.subi %c10, %28#0 : index
    %31 = acc.bounds   lowerbound(%29 : index) upperbound(%30 : index) stride(%28#2 : index) startIdx(%28#0 : index) {strideInBytes = true}
    %32 = fir.box_addr %27 : (!fir.box<!fir.heap<!fir.array<?xi32>>>) -> !fir.heap<!fir.array<?xi32>>
    %33 = acc.copyin varPtr(%32 : !fir.heap<!fir.array<?xi32>>)   bounds(%31) -> !fir.heap<!fir.array<?xi32>> {dataClause = 3 : i64, name = "array(5:10)"}
    %34 = acc.create varPtr(%1 : !fir.ref<i32>)   -> !fir.ref<i32> {dataClause = 4 : i64, name = "arraysize"}
    acc.serial   dataOperands(%33, %34 : !fir.heap<!fir.array<?xi32>>, !fir.ref<i32>) {
      ...
      %50 = fir.load %0 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>
      %51:3 = fir.box_dims %50, %c0 : (!fir.box<!fir.heap<!fir.array<?xi32>>>, index) -> (index, index, index)
      %52 = fir.convert %51#1 : (index) -> i32
      fir.store %52 to %1 : !fir.ref<i32>
      ...

So the points I wanted to highlight:

Notice how the acc.bounds operation is associated with the data action (acc.copyin). The original variable “array” is not reboxed. OpenACC slice semantics are not the same as slice semantics on the source language.
The two copyin operations produce %11 and %33 respectively. These point to the same memory location. The IR here makes it more evident because it keeps the bounds separate from the “varPtr”. And in both cases, the varPtr is the loaded address from the same descriptor.
The FIR inside the region is unaffected by the slicing operation. The original descriptor is still used.

Topic		Replies	Views
RFC for omp.target construct MLIR	13	1420	July 1, 2021
[RFC] Privatisation in OpenMP dialect MLIR	19	1642	February 17, 2022
[RFC] Prevent optimization/analysis across omp.target region boundaries MLIR	42	1103	June 27, 2023
Dialect for data locality/sharing specifiers/clauses in OpenMP, OpenACC, and `do concurrent` Flang mlir , openmp	9	268	May 5, 2025
[OpenMP] Parallel Operation design issues MLIR	14	1843	January 11, 2022

[RFC] OpenACC dialect data operation improvements

Related topics