Issues with the current implementation of privatization in OpenMP with Fortran

The issue

While lowering the firstprivate clause, the copying of data takes place inside the omp.parallel region. This can lead to data races.

The following example demonstrates the current lowering of firstprivate clause in Fortran with a possible unintentional read-write data race. The data race happens when a slower thread has still not started to copy the value but a faster thread executes foo on the pointer:

subroutine baz
  integer, pointer :: p1
  integer, target :: t1
  p1=>t1
  !$omp parallel firstprivate(t1)
  call foo(p1) ! foo modifies value at p1
  call bar(t1)
  !$omp end parallel
end subroutine baz
module attributes {fir.defaultkind = "a1c4d8i4l4r4", fir.kindmap = "", llvm.target_triple = "x86_64-unknown-linux-gnu"} {
  func @_QPbaz() {
    %0 = fir.alloca !fir.box<!fir.ptr<i32>> {bindc_name = "p1", uniq_name = "_QFbazEp1"}
    %p1_addr = fir.alloca !fir.ptr<i32> {uniq_name = "_QFbazEp1.addr"}
    %2 = fir.zero_bits !fir.ptr<i32>
    fir.store %2 to %p1_addr : !fir.ref<!fir.ptr<i32>>
    %t1_addr = fir.alloca i32 {bindc_name = "t1", fir.target, uniq_name = "_QFbazEt1"}
    %t1_addr_ = fir.convert %t1_addr : (!fir.ref<i32>) -> !fir.ptr<i32>
    fir.store %t1_addr_ to %p1_addr : !fir.ref<!fir.ptr<i32>>
    omp.parallel   {
      %t1_addr_priv = fir.alloca i32 {bindc_name = "t1", fir.target, pinned, uniq_name = "_QFbazEt1"}
      %t1 = fir.load %t1_addr : !fir.ref<i32>
      fir.store %t1 to %t1_addr_priv : !fir.ref<i32>
      %p1 = fir.load %p1_addr : !fir.ref<!fir.ptr<i32>>
      %p1_ = fir.convert %p1 : (!fir.ptr<i32>) -> !fir.ref<i32>
      fir.call @_QPfoo(%p1_) : (!fir.ref<i32>) -> ()
      fir.call @_QPbar(%t1_addr_priv) : (!fir.ref<i32>) -> ()
      omp.terminator
    }
    return
  }
  func private @_QPfoo(!fir.ref<i32>)
  func private @_QPbar(!fir.ref<i32>)
}

Proposed Solution

I have not thought a lot about the solution for this plus I do not have a lot of experience dealing with intricacies of OpenMP.

I think a valid path forward would be to pass firstprivate values as arguments to the region. The addresses to those values can be the operands of omp.parallel operation (these operands could also be values instead of those addresses). We should return lastprivate values as results of the omp.parallel operation. This would also bring privatization clauses to MLIR and abstract the privatization to be handled by MLIR to LLVM IR phase.

%omp.parallel firstprivate(%t1_addr : !fir.ref<mytype>) {
^bb0(%t1_priv: mytype):
  // ... work with %t1_priv
}
// OR
%omp.parallel firstprivate(%t1 : mytype) {
^bb0(%t1_priv: mytype):
  // ... work with %t1_priv
}

Questions

I am not too sure if this solution would work for all cases, and so I would love to hear everyone’s opinion on this.

  • Is this really an issue or did I misunderstand something?
  • Does the proposed solution actually solve the problem? Would this work for both C/C++ and Fortran? If there are some concerns, it would be nice to discuss them early.

Thoughts?

Thanks @shraiysh for bringing up this issue. I think there is an issue of data-race here, but we can wait for opinion from the others before confirmation.

The standard specifically says the following.

To avoid data races, concurrent updates of the original list item must be synchronized with the read of the original list item that occurs as a result of the firstprivate clause.

Would a simpler solution be to,

  • create a copy of the original variable outside the parallel region and then use this for firstprivatisation in the region? (this is probably similar to passing as an argument)
  • insert a barrier after privatisation (which generally is the first thing to happen in the parallel region)?

Regarding the proposed solution:

  • I think the result of privatisation is still a variable, so it might still need to have some memory associated with it. This is also important if arguments are passed by reference to functions called in the parallel region.
  • Where can the variable be alloced? Either in the sequential region or in the parallel region. If it is in the sequential portion then there is a sequential overhead to allocate a private variable for each thread and copy the value to each private variable, particularly in the case of arrays.
  • In general, privatisation of Fortran variables is easier/natural at the lowering stage or when FIR is available. This is because all the information about Fortran types is available here. When it has been converted to llvm dialect (which is the point where we interface with the OpenMPIRBuilder), we do not know for e.g whether the reference is to a scalar, array or some other type, what if the type has constructors, destructors (finalizers in fortran) etc? I think one way to handle this is to convert all Fortran/FIR types to some well defined structure and then use that to perform the privatisation. I think @clementval has been trying to do something like this for some other constructs ([openacc] Conversion of FIR type data operands to llvm type by clementval · Pull Request #915 · flang-compiler/f18-llvm-project · GitHub, âš™ D102170 [mlir][openacc] Conversion of data operand to LLVM IR dialect). I think this has the disadvantage that types will be lowered differently for OpenMP and non-OpenMP flow.

Longer-term, I would personally like privatisation clauses to be modelled in the Dialect.

1 Like

create a copy of the original variable outside the parallel region and then use this for firstprivatisation in the region?

This sounds more reasonable to me.

insert a barrier after privatisation (which generally is the first thing to happen in the parallel region)

I am concerned if it leads to some overhead if copy a large array.

  1. We don’t need to do this for all data types. For fortran, we only need to do this for pointer/(allocatable?), variables in equivalence and the equivalenced varaible is defined in parallel region, and variables in associate construct. Maybe more scenarios? For large array without these usage, copying a large array causes unexpected overhead. So considering the specific language usage, it seems better to do this in FIR for fortran.

  2. Another issue I noticed is that currently the private space uses stack. What if the stack is not enough? Maybe use fir.allocmem to use the heap and add one option similar to -fstack-arrays to use stack for better performance?

We want both. If it is a value that fits in a register, “make it available” prior to the parallel region and pass it in. If it doesn’t pass the “address” in and use a barrier. Lastprivate needs a barrier too, among other things

You are specifically talking about codegen (FIR → LLVM IR) here, correct?

Constructing a sort of anonymous aggregate to collect and manage the set of privatized variables/values for a region sounds like the right direction. Later is potentially better if variable to value promotion is desirable. The plan is that FIR will eventually perform that transformation (again).

1 Like

This makes me think that there is no way to avoid a barrier. If that is true, for the short-term we can insert a barrier in every case to ensure correctness. In the long term, for the values that fit in a register, we can copy them via value as an optimization.
Does that sound reasonable?

I think you are right about this and I like the approach, but please wait for others’ opinion if you decide to work on this.

Yes, that is what I meant.

Sounds good to me.

Is that what other compilers do? Besides the overheads of allocating memory from the heap this might also affect the promotion to values by LLVM and hence other optimisations. fstack-arrays is generally switched ON only with Ofast, meaning we might not get the benefits at O3. HPC systems usually have unlimited stack switched ON. Anyway there is definitely a point to having two versions, one on the stack and other on the heap switchable by a flag, we can decide the policy later.

Yes. If the barrier is marked “aligned” it would even already be eliminated by the openmp-opt pass in llvm if there is not side-effect before it. Though, we never tested this on the host.

Alright, I am working on this fix. I am not aware of the aligned attribute and so I don’t fully understand what you mean by marking the barrier as aligned. Can you please give an example/link for reference @jdoerfert ? Thank you!

Is that what other compilers do? Besides the overheads of allocating memory from the heap this might also affect the promotion to values by LLVM and hence other optimisations. fstack-arrays is generally switched ON only with Ofast, meaning we might not get the benefits at O3. HPC systems usually have unlimited stack switched ON. Anyway there is definitely a point to having two versions, one on the stack and other on the heap switchable by a flag, we can decide the policy later.

I didn’t check other compilers. I tested allocation of one large string (character in Fortran code) on stack failed on my computer. Not sure if it will fail for the privatized array. I agree this is not one priority owrk. We can delay it and check if it will fail when we test a lot of workloads.

Don’t worry about this now. This is what we do in the GPU device runtime ⚙ D112153 [OpenMP] Introduce aligned synchronization into the new device RT. As I said, this is not really something we have looked at on the Host yet.