First of all, I would like to thank you every commenter so far for taking a look ![]()
This dataflow model should work well for OpenMP target mapping operations. I would be happy to provide additional clarifications if you see this as something you’d like to mimic in the omp dialect.
Yes. There are 3 “data exit” operations that take %accvar: acc.copyout, acc.delete, acc.detach. So this model does indeed use a structured marker for lifetime.
That said, all of these operations are intended to match OpenACC semantics, including those about counters. What I mean is that a delete operation does not mean the data will be destroyed (and I wanted to clarify this since you specifically mentioned destroy operation).
Consider the following example:
%accvar1 = acc.copyin(%var)
acc.data dataOperands(%accvar1) {
%accvar2 = acc.copyin(%var)
acc.parallel dataOperands(%accvar2) {
}
acc.delete(%accvar2)
%accvar3 = acc.copyin(%var)
acc.parallel dataOperands(%accvar3) {
}
acc.delete(%accvar3)
}
%acc.delete(%accvar1)
In the above example, %accvar1, %accvar2, %accvar3 will all point to the same memory location (aka these memory references alias). Meaning acc.delete(%accvar2) does not actually delete data from offload device but simply decrements counter.
In this example, without a data exit operation after any of the regions, the value of %accvar is exactly the same (aka it is the same pointer). From a dataflow perspective, using %accvar in multiple regions does mean that we are reading/mutating the same memory location.
From a verifier perspective (which I still need to fully implement), operations with the “structured” flag should require a matching exit operation. So your example would need a final exit operation to be well-formed.
I would also like to add that your example is one of the reasons I decided to model them outside of the compute region. I envisioned being able to apply optimizations like CSE so that we don’t have data upload per region. Getting data to oflload device should not have to be coupled with the compute.
The acc.bounds operation is not associated with a variable but with a data operation.
Let me share an example to demonstrate its usage and also to clarify its purpose and semantics. I think this example is probably more than you wanted to see but it highlights a few interesting considerations I took when thinking about how to design the acc.bounds operation.
Consider the following Fortran example (since you mentioned descriptors, I assume you were referring to FIR)
program main
integer, allocatable :: array(:)
integer :: arraysize
allocate(array(10))
!$acc data copy(array)
!$acc serial copy(array(5:10)) copyout(arraysize)
do ii = 1, 10
array(ii) = ii
end do
arraysize = size(array)
!$acc end serial
!$acc end data
print *, array
print *, arraysize
end program
This example highlights a few interesting points:
- It copies the whole array first and then in a nested region, it copies a slice.
- It accesses the full array even though its immediate parent region only copied slice.
- It reads array size from descriptor.
Running this without OpenACC prints:
1 2 3 4 5 6 7 8 9 10
10
What about with OpenACC targeting offload device? Well let’s break this down.
- First copy moves array data (10 elements) to offload device
- Second copy increments the structured reference counter since data is already on device. This behavior also matches OpenMP spec as far as I can tell (page 153 of 5.2 spec)
- Iterating through the array from first element is legal and not an out of bounds violation - the whole array is on the device.
- What about the size of the array? OpenACC specification is primarily focused on the data. It only provides this guidance for the descriptor: “For Fortran array pointers and allocatable arrays, this includes copying any associated descriptor (dope vector) to the device copy of the pointer”.
- So what should the value be? We obviously have the whole array on the device.
- Well, the associated descriptor with the array has a size of 10. So on device we should also have a size of 10.
(I also just tested with two offload implementations nvfortran and gfortran and confirmed the results).
!$acc data copy(array)
!$acc serial copy(array(5:10)) copyout(arraysize)
==>
%0 = fir.address_of(@_QFEarray) : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>
%6 = fir.load %0 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>
%7:3 = fir.box_dims %6, %c0 : (!fir.box<!fir.heap<!fir.array<?xi32>>>, index) -> (index, index, index)
%8 = arith.subi %7#1, %c1 : index
%9 = acc.bounds lowerbound(%c0 : index) upperbound(%8 : index) stride(%7#2 : index) startIdx(%7#0 : index) {strideInBytes = true}
%10 = fir.box_addr %6 : (!fir.box<!fir.heap<!fir.array<?xi32>>>) -> !fir.heap<!fir.array<?xi32>>
%11 = acc.copyin varPtr(%10 : !fir.heap<!fir.array<?xi32>>) bounds(%9) -> !fir.heap<!fir.array<?xi32>> {dataClause = 3 : i64, name = "array"}
acc.data dataOperands(%11 : !fir.heap<!fir.array<?xi32>>) {
%27 = fir.load %0 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>
%28:3 = fir.box_dims %27, %c0 : (!fir.box<!fir.heap<!fir.array<?xi32>>>, index) -> (index, index, index)
%29 = arith.subi %c5, %28#0 : index
%30 = arith.subi %c10, %28#0 : index
%31 = acc.bounds lowerbound(%29 : index) upperbound(%30 : index) stride(%28#2 : index) startIdx(%28#0 : index) {strideInBytes = true}
%32 = fir.box_addr %27 : (!fir.box<!fir.heap<!fir.array<?xi32>>>) -> !fir.heap<!fir.array<?xi32>>
%33 = acc.copyin varPtr(%32 : !fir.heap<!fir.array<?xi32>>) bounds(%31) -> !fir.heap<!fir.array<?xi32>> {dataClause = 3 : i64, name = "array(5:10)"}
%34 = acc.create varPtr(%1 : !fir.ref<i32>) -> !fir.ref<i32> {dataClause = 4 : i64, name = "arraysize"}
acc.serial dataOperands(%33, %34 : !fir.heap<!fir.array<?xi32>>, !fir.ref<i32>) {
...
%50 = fir.load %0 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>
%51:3 = fir.box_dims %50, %c0 : (!fir.box<!fir.heap<!fir.array<?xi32>>>, index) -> (index, index, index)
%52 = fir.convert %51#1 : (index) -> i32
fir.store %52 to %1 : !fir.ref<i32>
...
So the points I wanted to highlight:
- Notice how the acc.bounds operation is associated with the data action (acc.copyin). The original variable “array” is not reboxed. OpenACC slice semantics are not the same as slice semantics on the source language.
- The two copyin operations produce %11 and %33 respectively. These point to the same memory location. The IR here makes it more evident because it keeps the bounds separate from the “varPtr”. And in both cases, the varPtr is the loaded address from the same descriptor.
- The FIR inside the region is unaffected by the slicing operation. The original descriptor is still used.