[RFC][OpenMP] supporting delayed task execution with firstprivate variables

Context

OpenMP tasks are executed asynchronously. By default, most variables will be firstprivate: meaning the initial value of the private variable has to be copied from the source variable. Our current OpenMP codegen puts the firstprivate copy region(s) inside of the task’s body. This is not safe to execute some time later because the source variables may now be out of scope. Clang solves this problem by storing some variable data in the task descriptor. The data in the task descriptor are initialized synchronously, therefore avoiding the problem.

program test
  integer :: x(1)
  !$omp parallel
    call repo(1)
  !$omp end parallel
contains
  subroutine repo(n)
    ! n has to be a dummy argument to reproduce observable issue
    integer :: n
    ! x is implicitly firstprivate
    !$omp task private(n)
      x(n) = 1
    !$omp end task
    ! a taskwait here takes the bug away
    ! otherwise if the task is not executed quickly enough, x is no longer in
    ! scope and may become deinitialized (when the outlined parallel function
    ! returns). This causes indexing into random memory.
    ! OpenMP 5.2 section 12.5 only says it is the programmer's responsibility
    ! to ensure that storage does not reach the end of its lifetime when that
    ! storage is shared.
  end subroutine
end program

OpenMPIRBuilder already has some support for storing data in a task descriptor. It gathers all live in-values to the task body into a structure, which is initialized when the task is created. The OpenMP runtime passes this structure to the task when it is eventually scheduled. We need to reuse this OpenMPIRBuilder infrastructure to play nicely with clang (there are also code structure reasons that would make it hard to re-implement without breaking clang). Unfortunately, in flang private variables are always modeled as passed by reference into task regions (which makes good sense for the design of (HL)FIR because it allows us to generate (hl)fir.declare operations). OpenMPIRBuilder dutifully copies the pointer value into the structure and passes that to the task. The task then uses the pointers to access memory which may no longer be in-scope.

I intend to structure the solution something like this:

  1. Allocate a structure containing the information we need on the heap
  2. Copy variable data into that structure using the copy region of omp.private
  3. Insert code at the start of the task body to extract each firstprivate variable from the structure
  4. Map all references to the firstprivate variables in the task body to the variables stored in the structure allocated in step 1
  5. Deallocate the structure (from step 1) at the end of the task body.

This plays well with OpenMPIRBuilder because it will see the pointer to the structure (from step 1) as the only live input value to the task body and then allocate its own structure containing only this pointer. It is unfortunate that we end up with this double allocation and double pointer dereference, but it is hard to work around with the current design of OpenMPIRBuilder. Perhaps there could be an optimization added to OpenMPIRBuilder at some future time where if the structure would only be up to the length of one pointer, it is stored directly instead. I don’t think this optimization would get in clang’s way.

Problem

Step 1 above. The existing allocation regions of omp.private always allocate using alloca. Instead we need the type to put inside of the structure. Some of the allocation regions also initialize the variables: e.g. for an allocatable private variable its allocation status must match the allocation status of the source private variable (required by the OpenMP standard).

Potential solutions

  1. Redefine omp.private to give a type that should be allocated, and then replace the alloc region with an optional initialization region for the above cases. This has the advantage of making it easier to group all of the allocas together when not supporting firstprivate variables for tasks. The disadvantage here is a lot of code churn.
  2. Always allocate private variables on the heap. The pointer returned by the omp.private alloc region could then be copied into the structure and the memory freed at the end of the task using omp.private’s deallocation region. This incurs the runtime cost of heap allocation but involves a lot less code churn. Implicit in this is the assumption that omp.private always uses heap pointers even when lowering things that aren’t Fortran. The structure allocated and initialized by OpenMPIRBuilder is probably sufficient in this case: therefore avoiding the double wrapping in my solution discussed above.
  3. Programatically modify the alloc region to support initializing a structure member (the type can be found from the alloca). This shouldn’t be too hard to get working for flang’s cases but is very brittle in general.

My preferred solution is (2) but I am very open to other ideas.

CC: @bob.belcher

Thanks @tblah for picking up this work. This is probably the last remaining major work for OpenMP 3.1.

Just for strictness, there are undeferred/included tasks depending on how the task is created or the if clause evaluates to false.

I am probably reading this wrong. n is private right? Or did you mean it to be firstprivate? Isn’t the problem that the array x is firstprivate and uses values from outside the task region for initializing and whose lifetime might have ended by the time the task executes? n is allocated in the task’s stack, so it’s lifetime matches the lifetime of the task.

How will we represent this type? By the time it is llvm dialect, all that info will probably not be useful. Do you mean to create a structure/descriptor?

This is probably the easiest way to make progress here. But there could be synchronization overheads for malloc from different tasks along with the cost of heap allocation. Also, if the Clang implementation is different, the OpenMPIRBuilder code cannot be shared. I am assuming you will extend the existing OpenMPIRBuilder code for tasks to have the kmp_private_t structure as well.

Worth checking with @alexey-bataev about the implementation of deferred execution of tasks in Clang and whether heap allocation is reasonable. Also @mjklemm for the runtime/standards perspective.

Thanks for taking a look Kiran.

You are right. I will update the description.

The omp.private operation would need to know what the mlir type that should be allocated is. E.g. i32, fir.box<>, etc. Then we will need to make sure that type is appropriately lowered to the llvm dialect (for flang this might need some special handling to ensure we get the box structure not a pointer to it). For flang, the omp.private type is !fir.ref<SOMETHING>. Here the type stored would be that SOMETHING (this has to be stored separately because by the time we reach the llvm dialect fir.ref<> is an opaque pointer).

The whole point here is not to modify the existing OpenMPIRBuilder code. Doing so (without effecting clang) would be quite challenging.

Currently OpenMPIRBuilder already does allocate a structure containing the private variables, which are populated with a bit-for-bit copy of the llvm value used inside of the task body. The problem for flang is that the private variables are all pointers to stack allocated variables or descriptors. Therefore, if the task is scheduled after the function returns, these pointers will no longer be valid. OpenMPIRBuilder works something like this

  1. Generate the task body at the location that the task directive was encountered
  2. Return from OpenMPIRBuilder and process other directives
  3. Finalize OpenMPIRBuilder: this performs all outlining. The outlined functions package live input values in a structure argument.
  4. After outlining takes place, a callback scans for the allocation of the structure generated by function outlining, then packages this into the runtime calls for scheduling the task. The structure is populated by doing a memcpy of each variable into it.

In solutions 1 and 3, we set things up so that OpenMPIRBuilder sees a pointer to the structure containing our firstprivate variables as the only live input value. It is safe for it to just duplicate the pointer because we control the lifetime of the structure manually.

In solution 2, we make sure each individual firstprivate variable is only accessed via a heap pointer (who’s lifetime we control). This way the pointers can be safely copied into the structure passed to the outlined task body (as OpenMPIRBuilder already does).

Progress update:

Option 2 was difficult to implement for Flang because fir.box<> gets lowered to a pointer to a stack allocation. As this isn’t modeled in MLIR at all, I couldn’t find a clean way to work around this (so that the box is on the heap so that the pointer can be kept alive until the end of the task execution). The main attraction of this approach was that I hoped it wouldn’t be too disruptive to implement. This stops looking like a good option if it needs intrusive changes to non-openmp flang codegen.

So I have switched to work option 1. This is how it looks for a simple case:

subroutine test
  integer :: arg
  !$omp parallel private(arg)
  call use_integer(arg)
  !$omp end parallel
end subroutine
  omp.private {type = private} @_QFtestEarg_private_ref_i32 : i32

  func.func @_QPtest() {
    %0 = fir.alloca i32 {bindc_name = "arg", uniq_name = "_QFtestEarg"}
    %1:2 = hlfir.declare %0 {uniq_name = "_QFtestEarg"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
    omp.parallel private(@_QFtestEarg_private_ref_i32 %1#0 -> %arg0 : !fir.ref<i32>) {
      %2:2 = hlfir.declare %arg0 {uniq_name = "_QFtestEarg"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
      fir.call @_QPuse_integer(%2#1) fastmath<contract> : (!fir.ref<i32>) -> ()
      omp.terminator
    }
    return
  }

Note:

  • The type stored in the omp.private is now the type that must be allocated for this variable (usually by llvm.alloca, but in the case of firstprivate task variables this type would be allocated as part of a structure containing context required by the task.
  • There is no allocation region in the omp.private because the allocation is now handled by openmp to llvmir conversion
  • For types that require some runtime initialization (e.g. allocatables), an initialization region would replace the allocation region. This region has the same signature as the copy region.

The allocation is added implicitly to the llvmir (see %omp.private.alloc). In this case, the only changes to the LLVMIR are that the omp.reduction.latealloc region has been renamed to omp.reduction.init because it only performs initialization.

; Function Attrs: nounwind
define internal void @test_..omp_par(ptr noalias %tid.addr, ptr noalias %zero.addr) #1 {
omp.par.entry:
  %tid.addr.local = alloca i32, align 4
  %0 = load i32, ptr %tid.addr, align 4
  store i32 %0, ptr %tid.addr.local, align 4
  %tid = load i32, ptr %tid.addr.local, align 4
  %omp.private.alloc = alloca i32, align 4
  br label %omp.reduction.init

omp.reduction.init:                               ; preds = %omp.par.entry
  br label %omp.private.init

omp.private.init:                                 ; preds = %omp.reduction.init
  br label %omp.par.region

omp.par.region:                                   ; preds = %omp.private.init
  br label %omp.par.region1

omp.par.region1:                                  ; preds = %omp.par.region
  call void @use_integer_(ptr %omp.private.alloc)
  br label %omp.region.cont

omp.region.cont:                                  ; preds = %omp.par.region1
  br label %omp.par.pre_finalize

omp.par.pre_finalize:                             ; preds = %omp.region.cont
  br label %omp.par.outlined.exit.exitStub

omp.par.outlined.exit.exitStub:                   ; preds = %omp.par.pre_finalize
  ret void
}

There are details to be worked out around how unboxed array types should be represented.

Please get in touch if you have any or questions or feedback about how I am redefining omp.private.

Progress update:

I am working on the redefinition of omp.private so that type information is available to generate the task context structure. This is nearly there, with only a handful of Fujitsu tests still failing.

My work-in-progress branch is here: Comparing llvm:main...tblah:ecclescake/delayed-task-execution-init-rebased · llvm/llvm-project · GitHub

I just saw this, and reading over it, I’m not sure which option is the “do what clang does” option?

Thanks for taking a look. Flang already does what clang does (so far as I understand). Roughly what happens is that all live in values for the task region are copied verbatim into the structure wrapping the arguments to the outlined task body. This works great for the simple types allowed in C (e.g. a scalar integer).

The problem comes for the more complex data types which may be privatized in Fortran such as arrays. In this case, the existing infrastructure will just make a copy of a pointer to a stack allocated type descriptor (which will contain a pointer to either stack or heap memory). If the task is not executed until after that stack frame is gone, the copied pointer will point to deallocated memory.

PR for omp.private operation definition changes [mlir][OpenMP][flang] make private variable allocation implicit in omp.private by tblah · Pull Request #124019 · llvm/llvm-project · GitHub

Draft PR implementing the rest of the RFC: WIP: [mlir][OpenMP] Pack task private variables into a heap-allocated context struct by tblah · Pull Request #125307 · llvm/llvm-project · GitHub