[RFC] Privatisation in OpenMP dialect

OpenMP constructs can define a variable to be a copy of another variable so that each thread executing the construct gets a separate copy and there are no dependencies. This is called privatisation in OpenMP.

We can handle privatisation in two different ways,

  1. Privatisation clauses can be dissolved to allocas/allocs by the frontend preceding MLIR.
  • Advantages:
    • Simpler engineering
    • Other layers need not be aware of privatisation.
  • Disadvantages:
    • All frontends would need to handle privatisation clauses themselves
    • Some clauses (allocate) specify that the allocation in privatisation has to be performed by the openmp runtime, this would mean insertion of a runtime call by the frontend.
  1. Privatisation clauses can be represented in the OpenMP MLIR dialect and handled appropriately.
  • Advantages:
    • Representation in MLIR would mean frontends can leave it to MLIR to handle privatisation
    • MLIR dialect will be more representative of OpenMP and can do some checks.
  • Disadvantages:
    • Additional engineering effort to make another layer aware of privatisation
    • Have to handle copying, constructor, destructor etc, differences in C++/Fortran privatisation

Handling in the frontend is fairly obvious, hence will discuss privatisation in the OpenMP dialect. One way to represent a private clause is by having an operand which is of the same type as the original variable. This operand will be an argument to the entry block. Then a transformation pass can perform the privatisation transformation. This is probably straightforward.

Things become a bit more involved when there are constructors, destructors or if the variable is marked as allocatable (in which case it might have to allocated during runtime). This would probably require the following,

  1. The constructor, destructor functions.
  2. The operation (call op) to call the constructor/destructor.
  3. The operation to perform the allocation, deallocation.

I guess (1) can be stored in the type. The pass which performs the transformation can be aware of (2) and (3) but this would restrict the transformation to be performed at that particular dialect (e.g FIR for Fortran and not in LLVM dialect or during translation.

For discussion:

  • Should MLIR have a representation for private clauses?
  • Is it OK to represent a private clause as an operand which is a basic block argument of the entry block?
  • Should it be a transformation pass which performs the privatisation?
  • Can this transformation pass be generic or should it sit with the source dialects?
  • Is storing constructor, destructor information in the type the right approach?

A simple Fortran example with FIR and OpenMP dialects is given below. We have a fortran program with an OpenMP Parallel loop with the loop index marked as private. Initially the representation will have a private operand. After transformation the operand will be replaced with an alloca.

Fortran Source MLIR (FIR + OpenMP) MLIR (After privatisation)
program pvt
    integer :: i
    integer :: arr(10)
    !$OMP PARALLEL PRIVATE(i)
    do i=1, 10
      arr(i) = i
    end do
    !$OMP END PARALLEL
end program
func @_QQmain() {
  %0 = fir.address_of(@_QEarr) : !fir.ref<!fir.array<10xi32>>
  %1 = fir.alloca i32 {bindc_name = "i", uniq_name = "_QEi"}
  omp.parallel private(%pvt_i : !fir.ref<i32>) {
    %c1_i32 = constant 1 : i32
    %2 = fir.convert %c1_i32 : (i32) -> index
    %c10_i32 = constant 10 : i32
    %3 = fir.convert %c10_i32 : (i32) -> index
    %c1 = constant 1 : index
    %4 = fir.do_loop %arg0 = %2 to %3 step %c1 -> index {
      %6 = fir.convert %arg0 : (index) -> i32
      fir.store %6 to %pvt_i : !fir.ref<i32>
      %7 = fir.load %pvt_i : !fir.ref<i32>
      %8 = fir.convert %7 : (i32) -> i64
      %c1_i64 = constant 1 : i64
      %9 = subi %8, %c1_i64 : i64
      %10 = fir.coordinate_of %0, %9 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
      %11 = fir.load %pvt_i : !fir.ref<i32>
      fir.store %11 to %10 : !fir.ref<i32>
      %12 = addi %arg0, %c1 : index
      fir.result %12 : index
    }
    %5 = fir.convert %4 : (index) -> i32
    fir.store %5 to %1 : !fir.ref<i32>
    omp.terminator
  }
  return
}
func @_QQmain() {
  %0 = fir.address_of(@_QEarr) : !fir.ref<!fir.array<10xi32>>
  %1 = fir.alloca i32 {bindc_name = "i", uniq_name = "_QEi"}
  omp.parallel {
    %pvt_i = fir.alloca i32 {uniq_name = "i"}
    %c1_i32 = constant 1 : i32
    %2 = fir.convert %c1_i32 : (i32) -> index
    %c10_i32 = constant 10 : i32
    %3 = fir.convert %c10_i32 : (i32) -> index
    %c1 = constant 1 : index
    %4 = fir.do_loop %arg0 = %2 to %3 step %c1 -> index {
      %6 = fir.convert %arg0 : (index) -> i32
      fir.store %6 to %pvt_i : !fir.ref<i32>
      %7 = fir.load %pvt_i : !fir.ref<i32>
      %8 = fir.convert %7 : (i32) -> i64
      %c1_i64 = constant 1 : i64
      %9 = subi %8, %c1_i64 : i64
      %10 = fir.coordinate_of %0, %9 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
      %11 = fir.load %pvt_i : !fir.ref<i32>
      fir.store %11 to %10 : !fir.ref<i32>
      %12 = addi %arg0, %c1 : index
      fir.result %12 : index
    }
    %5 = fir.convert %4 : (index) -> i32
    fir.store %5 to %1 : !fir.ref<i32>
    omp.terminator
  }
  return
}

MLIR : @ftynse @schweitz @jeanPerier
OpenMP : @jdoerfert @Meinersbur
Team : @clementval @abidmalikwaterloo @kirankumartp @SouraVX

2 Likes

My overall take on this is that having a first-class representation in MLIR for a concept is necessary if we want MLIR to reason about this concept (i.e., perform analyses or transformations). The converse is not necessarily true: a representation can still make sense from a layering or simplicity perspective even if it there are no transformation that require it. In the former case, the design of the representation should be driven by the analysis needs; in the latter case, by simplicity.

I also caution against involving the notion of a frontend in the design. We are not in the situation of a classical “frontend - ‘mlir IR’ - backend” compiler in general, the stack is deeper and more heterogeneous than that. As a concrete example, Polygeist has a C++ frontend that produces a mix of Affine, SCF, ex-Standard and LLVM dialect ops. Affine gets parallelized within MLIR and lowered to SCF, which may or may not be converted to the OpenMP dialect. If we decide that “the frontend” is expected to produce some form of the IR, we need to define what “the frontend” is (it appears that Polygeist would have to do some things twice: first time on the input C++ with potential pragmas, and second time when introducing new OpenMP constructs) and make sure that this form persists across all layers of the representation stack. A more extreme example would be a TF → HLO → MHLO → Linalg → Affine → SCF → OpenMP pipeline. That being said, individual pipelines need not necessarily reimplement all of the privatization every time, the dialect can provide utility functions for them to use.

I’m inclined to say yes, but can accept it either way.

This is one possibility.

Another one is to have an operation omp.mlir.privatize that can appear in the region of an operation to which the clause is attached. The benefit of having such an operation is the ease of lowering it differently based on almost arbitrary criteria (operand types, attributes, etc.) using patterns. The drawback is a weird semantics should it appear under control flow, i.e. outside of the entry block of the region.

I’m generally in favor of having small passes that are easy to test and replace. However, having such a pass means being able to represent the IR before and after it, which means we would ultimately implement both the first-class support for privatization clauses and the dissolved-to-allocas form.

Having it with source dialects may not compose. Imagine having several dialects that each perform privatization differently for their types (simple example, a mix of FIR and memref dialects). The representation with a dedicated operation I mentioned above is a workaround. Another scalable alternative is a generic pass that operates on type interfaces with each supported type implementing the interface.

I’m not certain I understand what is intended here. Store function pointers to a constructor and a destructor in the runtime representation of a type? This would make it impossible to privatize any built-in types because they are not going to be modified just to support this.

Type interfaces sound like a solution here, again. The type that are willing to be privatizable by OpenMP can implement an interface that takes an OpBuilder and some extra parameters and emits the IR for construction, destruction, copying, etc. in a dialect-specific way.

1 Like

Thanks @ftynse for the reply.

I am not aware of any transformations/analysis that will immediately benefit from the privatisation representation (other than privatisation itself).
Another factor that we are considering for the representation in the dialect is whether the information in the representation is needed for creating the OpenMP runtime calls. AFAIU, the privatisation information is not needed for this purpose.
Not having a representation for privatisation will mean that we cannot perform some semantic checks that are specified in the OpenMP standard in the dialect.

I felt that it is unlikely that a lowering from another dialect will use privatisation. This can happen only if there is an analysis pass which determines that privatising an SSA value (corresponding to some variable) will help in parallelisation. And unless that dialect is specifically created for lowering to OpenMP it will have other mechanisms to do the privatisation like transformation
I believe the likely use case is when frontends lower to the MLIR representation. Will Polygeist use the privatisation representation in the OpenMP dialect if it is there?

Yes, the semantics will be weird and the conversion will have to hoist the allocas into the entry block of the region. Finding the entry block when multiple dialects are present is also going to be a problem. In OpenMP there are operations which are going to the outlined (like parallel, task) and these will have entry-blocks. An interface is being created for these Ops as per an earlier suggestion of yours. What about other dialects?

I was referring to high-level language types (like classes in C++) which have custom constructors and destructors. The high-level dialect will hopefully store information about the constructor, destructor etc somewhere (I was assuming it has to be in the type). If it does not then it will not be possible to insert calls to the constructor and destructor when the privatised copy is created. For builtin types are there constructors/destructors? Anyway this information about constructors/destructors is optional.

OK

The lastprivate clause is implemented with a runtime call in OpenMP worksharing loop. The code generated below uses the omp.is_last variable (initially used in __kmpc_for_static_init_4) to check whether it is the last iteration and then stores the value of the lastprivate copy to the original variable in .omp.lastprivate.then: basic block.

call void @__kmpc_for_static_init_4(%struct.ident_t* nonnull @1, i32 %4, i32 34, i32* nonnull
%.omp.is_last, i32* nonnull %.omp.lb, i32* nonnull %.omp.ub, i32* nonnull %.omp.stride, i32 1, i32 1) #3
...
...
omp.loop.exit:                                    ; preds = %omp.inner.for.body, %entry
  %x1.0.lcssa = phi i32 [ undef, %entry ], [ %call, %omp.inner.for.body ]
  call void @__kmpc_for_static_fini(%struct.ident_t* nonnull @1, i32 %4)
  %11 = load i32, i32* %.omp.is_last, align 4, !tbaa !6
  %.not = icmp eq i32 %11, 0
  br i1 %.not, label %.omp.lastprivate.done, label %.omp.lastprivate.then
.omp.lastprivate.then:                            ; preds = %omp.loop.exit
  store i32 %x1.0.lcssa, i32* %x, align 4, !tbaa !6
  br label %.omp.lastprivate.done
.omp.lastprivate.done:                            ; preds = %.omp.lastprivate.then, %omp.loop.exit
  call void @__kmpc_barrier(%struct.ident_t* nonnull @2, i32 %4)
  ret void

Also, if code has to be inserted in the header (firstprivate) or footer (lastprivate) rather than in the loop body then the privatisation transformation has to be performed while the CFG is generated for the loop. CFG is currently generated in translation.
Alternative is to insert conditional loads/stores in the body of the loop for firstprivate and lastprivate.

We are leaning towards implementing privatisation outside the OpenMP dialect. This is because of the hope that it will lead to a simpler implementation, don’t have to deal with constructors, finalizers/destructors in the dialect. I was performing an audit of privatisation to check whether there are places where the privatisation information is needed for making runtime calls. If some information is needed for creating runtime calls then that information should be represented in the OpenMP dialect. I came across the lastprivate clause in worksharing loop (and a few others) where the update to the original variable happens based on the is_last variable which is set by the runtime in the Clang generated code. The code generated as shown below uses the omp.is_last variable (initially used in __kmpc_for_static_init_4) to check whether it is the last iteration and then stores the value of the lastprivate copy to the original variable in .omp.lastprivate.then: basic block.

call void @__kmpc_for_static_init_4(%struct.ident_t* nonnull @1, i32 %4, i32 34, i32* nonnull
%.omp.is_last, i32* nonnull %.omp.lb, i32* nonnull %.omp.ub, i32* nonnull %.omp.stride, i32 1, i32 1) #3
...
...
omp.loop.exit:                                    ; preds = %omp.inner.for.body, %entry
  %x1.0.lcssa = phi i32 [ undef, %entry ], [ %call, %omp.inner.for.body ]
  call void @__kmpc_for_static_fini(%struct.ident_t* nonnull @1, i32 %4)
  %11 = load i32, i32* %.omp.is_last, align 4, !tbaa !6
  %.not = icmp eq i32 %11, 0
  br i1 %.not, label %.omp.lastprivate.done, label %.omp.lastprivate.then

.omp.lastprivate.then:                            ; preds = %omp.loop.exit
  store i32 %x1.0.lcssa, i32* %x, align 4, !tbaa !6
  br label %.omp.lastprivate.done

.omp.lastprivate.done:                            ; preds = %.omp.lastprivate.then, %omp.loop.exit
  call void @__kmpc_barrier(%struct.ident_t* nonnull @2, i32 %4)
  ret void

I was not sure whether the is_last variable set by the runtime call is required and was considering the following lowering for privatisation. This will lead to an additional comparison for the last iteration and also will introduce an assignment (or a call to an assignment function and the destructor/finalizer) in the body of the loop. Was thinking whether this additional comparison and assignment might interfere with the optimisations of the loop.

@Meinersbur was suggesting to have a variable is_last and then representing that variable as an argument of the entry block (or operand) of the OpenMP wsloop operation and using that variable in the runtime calls and also to use it as the predicate for the last private update.

Another possibility is to introduce an additional region for constructs which have lastprivate or which need finalization. This region can have is_last as the argument of the entry block. This additional region will contain the lastprivate update, finalization/destructor of lastprivate variables. While lowering during translation, this region can be fitted into the exit block of the worksharing loop. This would avoid introducing the lastprivate update into the body of the loop.

I guess we could also model the terminator op to have a region and then include the lastprivate update in that region.

@Meinersbur also raised the point whether the private variables could be values and this would be good for optimisations. This was also something that @ftynse raised in the reduction RFC.

@ftynse Are any of these approaches OK. Did I miss something simpler? Or should we go on to have a representation for private clauses in the OpenMP dialect.

FYI @jdoerfert .
CC @clementval , @kirankumartp , @SouraVX , @abidmalikwaterloo.

Source Source (After privatisation)
integer :: x
!$omp.parallel
!$omp do lastprivate(x)
do i=1,N
...
end do
!$omp end do
integer :: x
!$omp.parallel
integer :: x_priv !Not real code
!$omp do
do i=1,N
…
if (i .eq. N) then
  x = x_priv
end if
end do
!$omp end do

In the flang_OpenMP call, we discussed something like this

  omp.parallel {
    %priv_x = fir.alloca i32 {bindc_name = "x", uniq_name = "_QEx"}
    ...
    omp.wsloop (%arg0) : i32 = (%c1_i32) to (%c9_i32) step (%c1_i32_0) inclusive {
      fir.store %arg0 to %priv_x : !fir.ref<i32>
      ...
      omp.yield {
        %last_x_val = fir.load %priv_x : !fir.ref<i32>
        fir.store %last_x_val to %1 : !fir.ref<i32>
        omp.yield
      }
    }
    omp.terminator
  }

which I found overly complicated, especially with an omp.yield within an omp yield and just firstprivate might not be worth adding another construct.

A dedicated clause would be e.g.

omp.wsloop (%arg0) : i32 = (0) to (N) step (1) inclusive lastprivate(%priv_x -> %last_x_val) { ...

which handles the writeback when lowering and can only be used for that.

Since lastprivate is not that common and therefore may not be worth specifically recognizing in an omp dialect optimizer, just lower it on the fly as @kiranchandramohan already mentioned my suggestion was

omp.wsloop (%arg0) : i32 = (0) to (N) step (1) inclusive {
  %cmp = %arg0 eq N ; 
  mlir.if (%cmp) {
    fir.store %last_x_val to %1
  }
}

or

omp.wsloop (%arg0, %is_last) : i32 = (0) to (N) step (1) inclusive {
  mlir.if (%is_last) {
    fir.store %last_x_val to %1
  }
}

This was the only representation i could think of with all the last-private update info (update, calling the destructor/finalizer etc) and happening in the last iteration. While interfacing with the OpenMPIRBuilder we can pluck this region out and put in the exit path.

The issue with a clause like operand for lastprivate is that when this is converted to the LLVM Dialect the lastprivate operation might not just be an update and can involve calling the assignment operator, calling the destructor/finalizer etc. Holding all these information requires a region. Also there is no guarantee that high level dialects like FIR stores information about the destructor/finalizer in the type. FIR as of now does not have the finalizer info in derived types (struct like).

I am OK with proceeding with an if-operation in the body. As you say this is not a common clause.
(The point i failed to mention in the meeting was that lastprivate clause is there in other constructs like for e.g omp simd, loop etc. Will inserting the if, the load/store, and call to the destructor/finalizer in the body spoil the transformation intended by omp simd for example?)

Good point.

However, if arbitrary code can be executed that there is no point in special-casing lastprivate for the purpose if optimization. An analyzer can itself look for code snippets that are executed only in the last iteration protected by a condition of either is_last or testing the induction variable for a specific value.

It’s a chicken-and-egg problem, unfortunately. It may or may not use the privatization if it is available, but we won’t know for sure until there is at least a concrete proposal on how it is implemented.

However, following the usual MLIR philosophy, I’d expect and encourage privatization to happen at the same level as loop parallelization (e.g., affine or scf + memef) and not below.

This is actually one of the argument I have against using OpenMPIRBuilder in MLIR translation. Despite the original goal of not duplicating code, we start introducing features that second-guess what OpenMPIRBuilder will do…

For the general case, we have the AutomaticAllocationScope trait. These would be the natural place to put allocas.

I suppose what you want is something like

!llvm.struct<"some-type", (
  actual-data-elements,
  func<void (ptr<struct<"some-type">>)> // destructor
)>

The type that lowers to this doesn’t have any knowledge about the destructor, only the lowered type does by carrying around the pointer to the destructor.

In general, MLIR has no knowledge of constructors/destructors for types nor does it have extensive lifetime control. For custom types with a destructor, I suggest having a “type.destruct” custom op that calls the destructor and optionally a scoping op that gets lowered to its body follows by “type.destruct” for any value of type that goes out of scope.

Does the spec require the iteration with the smallest/largest induction variable value performs the load/store, or the first/last actually executed (e.g., in case of dynamic scheduling)? The latter would require an auxiliary operation that is equivalent to reading %.omp.is_last (and I would consider an operation preferable to a magic variable at MLIR level).

If you go in this direction, I would very strongly encourage you not to do this on the fly. Have a proper pass instead. The translation complexity of the OpenMP is already significantly higher than the rest of translations we have.

Thanks for the reply.

The standard says that the value from the sequentially last iteration should be used to update the original item (Not the first/last actually executed).

With “on the fly” I meant to not lower it during the flang AST->fir translation, i.e. not having a dedicated construct in the OpenMP ir dialect and emit it as if (last iteration) { do_lastprivate() }.

I read in the following llvm doc that allocas should be placed in the entry block if possible. And that some llvm optimisations will not work if it is found else where. Hence the preference to hoist to the entry block.
https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas

OK. I would like to wait and see the implementation of types with finalizers/destructors in FIR before proceeding.

This discussion is deferred till we have something from FIR?

Yes.
On the Flang side, we are proceeding with privatisation outside the MLIR layer. Once we have a better picture of how FIR models data types with finalizers, locality specifiers (similar to data-sharing clauses) in do-concurrent loops etc we can come back to handling privatisation in the OpenMP dialect.

I’d like to challenge these points:

How do other dialects handle the same issue? Let’s say, I want to transfer an object from host to device with GPU dialect, using OpenCL as target backend. These are the options to do such a memory transfer:

  1. OpenCL 1.2 provides buffers, which are referenced as cl_mem objects, and one has to explicitly call APIs for that.
  2. One can use USM or SVM, do placement new and use that pointer instead of an original object.

In both cases we have to deal with a number of platform restrictions:

  1. Such data transfer breaks C++ RAII idiom, and there’s no way around it.
  2. If the original data has value semantics, we need to somehow dereference our data. This is a problem with custom operator=. Should we call it? Should we treat our data as POD?
  3. Some programming models actually allow you to ignore points 1 and 2 and pretend, that your type is a POD. You won’t get full access to all class member functions, but you’ll be able to copy your data over to GPU and somehow use it in a manner, that does not require you to call proper copy constructor and destructor.

All of the above may not seem like a problem if we have built our IR from scratch. But if our frontend generates some high-level constructs, it is not aware of these restrictions. And when an optimizer later finds parallelism in generated code, it has to deal with all of the complexity of data transfers.

That being said, I think it is more convenient to treat this problem as a contract between the type, the dialect, and the target environment:

  1. The type should generally be aware of whether it’s a trivial data type, or a container, or something else. It also should know how to generate a code, that would perform a copy. All that semantics can be moved to a type interface, and types, that do not implement that interface can either be treated as plain types, or diagnosed as unsupported.
  2. The dialect should be aware of what kinds of transfers it can do. For classic OpenMP it is possible to honor all those constructor/destructor semantics of C++/Fortran, and for GPU offloading only POD types are generally allowed.

So, whenever one wants to re-use existing OpenMP infrastructure for enabling parallelism in their code, privatization and other stuff may be a tough call, and it would be great if Dialect could offer some help in generating the correct code. WDYT?

Thanks @alexbatashev for replying to this thread. Personally, I am in favour of supporting privatisation in the OpenMP dialect. It is just that,

  1. I could not find a good use case for it.
  2. Could not agree on the approach to implementing privatisation.

The cases that you are identifying here does not seem to be directly related to privatisation. These probably correspond to the target, target data constructs. And the implementation of these will require maintaining information about the variables being mapped to and from the device.

The requirement of an interface has been discussed in one of the above replies. I agree that there should be some contract between the type and the dialect.

If you can modify your reply to talk in terms of privatisation then it will be helpful to make a strong case.

@kiranchandramohan one of the things I’m thinking about is that some analysis pass (or maybe even the frontend) can prove variable to be private, and add that clause. This may not make much sense on CPUs, but on other accelerators the compiler may decide to put private data into a different physical memory, which may improve latencies at the cost of lack data sharing between threads. As far as I understand, Flang is not able to handle such data movements today: even if there was support for OMP target offload, the produced IR has no additional information about allocations, that would allow the compiler to generate optimal code. One could teach the frontend to mark allocations with address spaces, but that would require frontend to obtain some secret knowledge about the target hardware, which is a) hard to do in general (it’s not enough to simply know the kind of hardware, but often requires to know the exact architecture to make sure we’re not going above the HW limits) b) would require each frontend to duplicate that mechanics.

There’s also another way of marking private variables by adding some attributes to operation arguments, rather than defining a new argument. That seems to be less intrusive, but it’ll probably make IR either harder to read or parse.

OpenMP has the allocate directive through which we can specify where the memory should be allocated for variables. This could be one way to achieve what you want. The allocate directive is not implemented fully in Flang and OpenMP dialect.
https://www.openmp.org/spec-html/5.1/openmpsu62.html

The privatization clauses are being handled in the flang frontend. The data copying clauses are not being handled anywhere for now. Once we have a better picture of how to handle these clauses in OpenMP Dialect, we can add these. For the time being, we are removing the unneeded clauses from OpenMP Dialect in D120029.