[OpenMP] Parallel Operation design issues

kiranchandramohan · March 9, 2020, 10:31pm

This post covers the design issues in the design and implementation of the parallel operation in the OpenMP dialect. The post is written to layout all the issues and to discuss and take decisions.

The specification of the parallel construct can be found in Section 2.6 of the OpenMP 5.0 specification document. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf
A trivial example of an OpenMP parallel construct in Fortran is given below.

  integer :: a, b, c
  !$omp parallel shared(a,b,c)
  c = a + b
  !$omp end parallel

The design issues to consider here are,

Representation of the Parallel Construct
The construct can be represented by an operation (omp.parallel) with one region.
Representation of the clauses
i) Clauses whose values are constants
default and proc_bind clauses are optional and they take a constant argument. These clauses can be represented as Optional attributes, possibly StrEnumAttributes.
ii) Clauses with values as integer/logical expression
Clauses if and num_threads take in integer/logical expressions as input. These can be modelled as Operands with I1 and AnyInteger (or i32) type. These clauses are also optional and we should have a way of modelling the optional part.
iii) Clauses having a list of variables
Clauses private, firstprivate, shared, copyin, allocate take in a list of variables as input. These clauses are also optional.
a) These clauses should be modelled as operands since they are a list of variables. I don’t see a way of modelling as attributes.
b) The type of the variable
What should be the type of the variable?
b1) Should it be a ref type? Or the type of the value referenced?
b2) Should it be types from all co-existing dialect types like (fir, std, llvm)?
b3) Should it just be the standard dialect? We can mandate that there be conversion operations from the respective dialects to standard dialect.
c) How to model the list of variables?
→ Variadic (with segment length)
→ Tuples
Some related discussion Handling optional input/output arguments
d) Examples below
d1) using tuples, ref type, omp dialect co-existing with fir types.

module {
  func @MAIN_() {
    %0 = fir.alloca i32 {name = "c"}
    %1 = fir.alloca i32 {name = "b"}
    %2 = fir.alloca i32 {name = "a"}
    %shared = tuple %0, %1, %2
    omp.parallel(%shared : tuple<!fir.ref<i32>, !fir.ref<i32>, !fir.ref<i32>>) {
       %3 = tuple_get %shared, 2
       %4 = tuple_get %shared, 1
       %5 = tuple_get %shared, 0
       %6 = fir.load %3 : !fir.ref<i32>
       %7 = fir.load %4 : !fir.ref<i32>
       %8 = addi %6, %7 : i32
       fir.store %8 to %5 : !fir.ref<i32>
       omp.terminator
    }
    return
  }
}

d2) using tuples, ref type, omp dialect co-existing only with standard type.

module {
  func @MAIN_() {
    %0 = fir.alloca i32 {name = "c"}
    %1 = fir.alloca i32 {name = "b"}
    %2 = fir.alloca i32 {name = "a"}
    %std_0 = fir.convert %0
    %std_1 = fir.convert %1
    %std_2 = fir.convert %2
    %shared = tuple %std_0, %std_1, %std_2
    omp.parallel(%shared : tuple<memref<i32>, memref<i32>, memref<i32>>) {
       %3 = tuple_get %shared, 2
       %4 = tuple_get %shared, 1
       %5 = tuple_get %shared, 0
       %6 = std.load %3 : memref<i32>
       %7 = std.load %4 : memref<i32>
       %8 = addi %6, %7 : i32
       std.store %8, %5[] : memref<i32>
       omp.terminator
    }
    return
  }
}

iv) Reduction clause
TBD. The GPU dialect has a reduction

Normalized operation
i) Define a normalized operation as one with minimal number of clauses and expanded to cover all variables.
Terminator
i) There is an implicit barrier at the end of a parallel region and all threads in the parallel region will hit this barrier as the last operation. Should this implicit barrier be modelled explicitly?
ii) Should a dummy parallel operation be added for this operation?
Where should transformations like privatisation be performed?
i) During lowering to MLIR from parse-tree.
ii) In MLIR.
iii) During translation to LLVM IR (using OpenMPIRBuilder).
Optimisations with parallel region
i) Constant propagation into parallel region.
ii) Removal of adjacent barriers in a parallel region (unlikely to occur in practice).
iii) ?
An example representation

def ClauseDefault : StrEnumAttr<
    "ClauseDefault",
    "default clause",
    [ClauseDefaultPrivate, ClauseDefaultFirstPrivate, ClauseDefaultShared, ClauseDefaultNone]> {
  let cppNamespace = "::mlir::omp";
}

def ClauseProcBind : StrEnumAttr<
    "ClauseProcBind",
    "procbind clause",
    [ClauseProcMaster, ClauseProcClose, ClauseProcSpread]> {
  let cppNamespace = "::mlir::omp";
}

def ParallelOp : OpenMP_Op<"parallel">,
  Arguments<(ins I1:$if_expr_var,
             AnyInteger:$num_threads_var,
             OptionalAttr<ClauseDefault>:$default_val,
             TupleOf<[AnyMemRef]>:$private_vars,
             TupleOf<[AnyMemRef]>:$firstprivate_vars,
             TupleOf<[AnyMemRef]>:$shared_vars,
             TupleOf<[AnyMemRef]>:$copyin_vars,
             OptionalAttr<ClauseProcBind>:$proc_bind_val)> {
  let summary = "parallel construct";
  let description = [{
    The parallel construct includes a region of code which is to be executed
    by a team of threads.
  }];

  let regions = (region AnyRegion:$region);
  let parser = [{ return parseParallelOp(parser, result); }];
  let printer = [{ printParallelOp(p, *this); }];
}

ftynse · March 16, 2020, 10:24am

Hi Kiran, and thanks for pushing on this!

Generally, I would encourage you to design the OpenMP dialect in a way that is composable with other dialects. As a particular example, we may be targeting OpenMP from Linalg on tensors that has absolutely no knowledge of FIR. A couple of specific comments below that go in this direction.

Indeed, if you need to refer to SSA values, you should use operands. The common way of doing so in a complex operation is to have variadic operands and use the variadic segments mechanism to specify which operands have which semantics.

I’d argue that it cannot be anything else than the type of the SSA value you take as operand. While it may not make sense in OpenMP context to have a value with non-referential semantics be shared (since SSA values are not variables and cannot be assigned), we should not unnecessarily restrict the dialect. More importantly, OpenMP lives in core MLIR and should not be anyhow bound to FIR. I would suggest these clauses to accept any type since it cannot know whether a type defined in an external dialect has reference semantics (unless we implement something like type interfaces). This way, you can introduce OpenMP around FIR types, then lower FIR to Standard if you want. I would place the limitation into the conversion: the conversion from the OpenMP to the LLVM dialect can only support standard memrefs; it is up to the user to ensure that their types are converted to memrefs if they want to use the conversion to LLVM (they may decide to, e.g., emit C instead).

Variadic operands seems better suited. With tuples, you would have to also introduce tuple packing/unpacking operations. We used that for gpu.launch back in the day where it required explicilt argument forwarding and it worked just fine.

I’d rather suggest to look into the Loop dialect. GPU reduction ops are specific to GPU

This seems like a transformation or a canonicalization pattern, which is orthogonal from the op definition itself.

It sounds like the synchronization can be included in the semantics of the openmp.parallel operation. The built-in semantics of a region is that the terminator can transfer the control back to the enclosing operation. The operation can than require that all parallel executions of a region did so before transferring the control to its control-flow successor.

This sounds frontend-specific. Generally, we encourage optimizing transformations to happen within MLIR. I heavily discourage non-trivial transformations when translating the MLIR LLVM dialect to LLVM IR.

If you need to model the explicit private clause in the input code, I would suggest to directly construct the corresponding IR.

Similarly to canonicalization above, I would suggest to consider transformations later, unless they affect the structure of the operation. On a side note, we have a small pass that inlines constants and some other non-side-effecting operations into gpu.launch. It would make sense to generalize it.

kiranchandramohan · March 17, 2020, 10:25am

Thanks @ftynse as always for the helpful comments.

Yes, we would also like to keep it generic and composable. Thanks for this point about using any type and placing restrictions during the conversion process. I was a bit confused about where to place the restrictions.

Thanks, we will investigate and switch to variadic operands if possible. I was aware of the segment mechanism through some other discussion in discourse but I guess the documentation page for variadic operands is a bit incomplete in this regard. The documentation currently says most operations have no variadic operands or just one operand.
https://mlir.llvm.org/docs/OpDefinitions/

Sure.

Where can this requirement for all parallel executions be enforced? Do you mean here that all control flow paths in the region should have this property? And are you also saying that this will be the default behaviour? I had to add a terminator operation to avoid. Are you suggesting here that this terminator should be placed on all control paths as the last operation. Is the property of transferring control back to the enclosing operation the default behaviour of a terminator operation that we add? Is this different for functions and operations?

OK.

The reason for including it here was for us to implement and ensure/convince ourselves that the way we have defined the operation does not affect the ability to perform transformations optimisations.

I have made a note of this. And we will definitely consider generalizing it for our use.

ftynse · March 17, 2020, 10:56am

Ping @antiagainst

I’m not sure I completely understand what you need here. My understanding is that OpenMP parallel mandates implicit synchronization of all threads at the end of the parallel region. My proposition is to model to embed that into the semantics of omp.parallel directly. That is, the semantics of omp.parallel operation is that it executes the attached in parallel by several threads and waits for all threads to exit the region before the omp.parallel operation is considered “executed completely”. This way, you don’t have to attach any special semantics to terminators inside omp.parallel, and you can “exit” from the region in several places.

See MLIR Language Reference - MLIR for the relation between terminators and enclosing operations. In short, a terminator can either transfer control to another block, or transfer it to the operation to which the immediately enclosing region is attached. For OpenMP, the latter makes most sense. You can have multiple terminators that transfer the control back to the immediately enclosing omp.parallel and omp.parallel does the synchronization.

Now that I think about it, you may also want to consider what you do in case of nested omp.parallel.

I totally agree with the motivation! It’s more about making sure we don’t block the introduction of the operation on transformation being developed.

Don’t hesitate to ping me or @herhut on this.

abidmalikwaterloo · April 30, 2020, 5:08pm

Is this work still in progress? What is the status?

clementval · April 30, 2020, 7:05pm

This is still win progress. Don’t know about the current status. Probably @kiranchandramohan can have a better answer.

kiranchandramohan · April 30, 2020, 7:58pm

Yes, we have updated the prototype to use variadic operands. Will be making a review in phabricator soon.

Thanks @clementval.

abidmalikwaterloo · May 15, 2020, 5:36pm

@kiranchandramohan Sorry for the late. reply . I was expecting update on my e-mails but did not get any. I am looking into how the DNN computational graph can be ported to OpenMP runtime for portability and efficient resource utilization especially. after OpenMP 5.0 support for various offloading strategies. Are you looking also in this direction?

kiranchandramohan · May 18, 2020, 9:43pm

Thanks @abidmalikwaterloo, Our primary interest is support for OpenMP in the Flang Fortran Frontend of LLVM.

giuseros · August 11, 2021, 10:01am

Hi @kiranchandramohan ,
I was trying to use the OpenMP dialect, but getting some errors in lowering omp.parallel operation to LLVM and saw this post. May I ask you what is the status of this operation?

Thank you in advance,
Giuseppe

kiranchandramohan · August 11, 2021, 10:54am

Hello @giuseros,
The omp.parallel operation with if, num_threads, and proc_bind clauses are expected to work. The other clauses are not supported.

Can you provide some more details (like the error message, a reproducer etc) so that I can have a look later today?

Thanks,
Kiran

giuseros · August 11, 2021, 11:50am

Hi @kiranchandramohan ,
Thanks for the prompt reply. I am an MLIR newbie, so I am sorry in advance for anything wrong I might say . So I am trying to parallelize the outer for loop of a GEMM operation (from linalg). This means a omp.parallel with a scf.for loop inside. From your message, I gather this is not supported yet. Am I correct?

Thanks,
Giuseppe

kiranchandramohan · August 11, 2021, 12:59pm

Thanks, @giuseros for the reply. No worries, thanks for using the dialect.

I guess you are trying the following. This should work, so I am interested in seeing the error that you are getting. But this might not be what you actually want since the following will run the entire “nested for loops” in each of the threads.

omp.parallel {
  scf.for .... {
    scf.for ... {
      scf.for ... {
      }
    }
  }
}

If you want to parallelise the outermost for loop of GEMM then you have to replace the outermost for with an omp.wsloop operation like given below,

omp.parallel {
  omp.wsloop ... {
    scf.for ... {
      scf.for ... {
      }
    }
}

Joejiong · January 11, 2022, 10:24am

what if I have some return from the scf.for loop
so I mean since I use iterargs in %some = scf.for …

kiranchandramohan · January 11, 2022, 12:00pm

Is the return in the scf.for that is meant to be parallelised? Is that because you have a loop carried dependence? If so I think it is likely not a good candidate for parallelisation.

Topic		Replies	Views
About OpenMP dialect in MLIR LLVM Dev List Archives	23	232	February 19, 2020
RFC: OpenMP dialect in MLIR MLIR	10	3509	April 17, 2020
[RFC] Disambiguation between loop and block associated omp::ParallelOp MLIR openmp	21	353	August 2, 2024
RFC for omp.target construct MLIR	13	1243	July 1, 2021
[RFC] Clause-based representation of OpenMP dialect operations MLIR	7	401	June 6, 2024

[OpenMP] Parallel Operation design issues

Related topics