Flang Technical Call : Summary of presentation on OpenMP for Flang

Thanks, Eric for the clarification.

Also, sharing this write up of the flow through the compiler for an OpenMP construct. The first one (Proposed Plan) is as per the presentation. The second one (Modified Plan) incorporates Eric’s feedback to lower the F18 AST to a mix of OpenMP and FIR dialect.

I Proposed plan

  1. Example OpenMP code

    !$omp parallel
    c = a + b
    !$omp end parallel

  2. Parse tree (Copied relevant section from -fdebug-dump-parse-tree)
    | | ExecutionPartConstruct → ExecutableConstruct → OpenMPConstruct → OpenMPBlockConstruct

OmpBlockDirective → Directive = Parallel
OmpClauseList →
Block

ExecutionPartConstruct → ExecutableConstruct → ActionStmt → AssignmentStmt

Variable → Designator → DataRef → Name = ‘c’
Expr → Add

Expr → Designator → DataRef → Name = ‘a’
Expr → Designator → DataRef → Name = ‘b’
OmpEndBlockDirective → OmpBlockDirective → Directive = Parallel

  1. The first lowering will be to FIR dialect and the dialect has a pass-through operation for OpenMP. This operation has a nested region which contains the region of code influenced by the OpenMP directive. The contained region will have other FIR (or standard dialect) operations.
    Mlir.region(…) {
    %1 = fir.x(…) …
    %20 = fir.omp attribute:parallel {
    %1 = addf %2, %3 : f32
    }
    %21 =

    }

  2. The next lowering will be to OpenMP and LLVM dialect. The OpenMP dialect has an operation called parallel with a nested region of code. The nested region will have llvm dialect operations.
    Mlir.region(…) {
    %1 = llvm.xyz(…) …
    %20 = omp.parallel {
    %1 = llvm.fadd %2, %3 : !llvm.float
    }
    %21 =

    }

  3. The next conversion will be to LLVM IR. Here the OpenMP dialect will be lowered using the OpenMP IRBuilder and the translation library of the LLVM dialect. The IR Builder will see that there is a region under the OpenMP construct omp.parallel. It will collect all the basic blocks inside that region and then generate outlined code using those basic blocks. Suitable calls will be inserted to the OpenMP API.

define @outlined_parallel_fn(…)
{

%1 = fadd float %2, %3

}
define @xyz(…)
{
%1 = alloca float


call kmpc_fork_call(…,outlined_parallel_fn,…)
}

II Modified plan

The differences are only in steps 3 and 4. Other steps remain the same.

  1. The first lowering will be to a mix of FIR dialect and OpenMP dialect. The OpenMP dialect has an operation called parallel with a nested region of code. The nested region will have FIR (and standard dialect) operations.
    Mlir.region(…) {
    %1 = fir.x(…) …
    %20 = omp.parallel {
    %1 = addf %2, %3 : f32
    }
    %21 =

    }

  2. The next lowering will be to OpenMP and LLVM dialect
    Mlir.region(…) {
    %1 = llvm.xyz(…) …
    %20 = omp.parallel {
    %1 = llvm.fadd %2, %3 : !llvm.float
    }
    %21 =

    }

Thanks,
Kiran

A walkthrough for the collapse clause on an OpenMP loop construct is given below. This is an example where the transformation (collapse) is performed in the MLIR layer itself.

1)Fortran OpenMP code with collapse
!$omp parallel do private(j) collapse(2)

do i=lb1,ub1
do j=lb2,ub2


end do
end do

  1. The Fortran source with OpenMP will be converted to an AST by the F18 parser. Parse tree not shown here to keep it short.

3.a)The Parse tree will be lowered to a mix of FIR and OpenMP dialects. There are omp.parallel and omp.do operations in the OpenMP dialect which represents parallel and OpenMP loop constructs. The omp.do operation has an attribute “collapse” which specifies the number of loops to be collapsed.
omp.parallel {
omp.do {collapse = 2} %i = %lb1 to %ub1 : !fir.integer {
fir.do %j = %lb2 to %ub2 : !fir.integer {

}
}
}

3.b) A transformation pass in MLIR will perform the collapsing. The collapse operation will cause the omp.do loop to be coalesced with the loop immediately following it. Note : There exists loop coalescing passes in MLIR transformation passes. We should try to make use of it.

omp.parallel {
%ub3 =
omp.do %i = 0 to %ub3 : !fir.integer {

}
}

  1. Next conversion will be to a mix of LLVM and OpenMP dialect.

omp.parallel {
%ub3 =
omp.do %i = 0 to %ub3 : !llvm.integer {

}
}

  1. Finally, LLVM IR will be generated for this code. The translation to LLVM IR can make use of the OpenMP IRBuilder. LLVM IR not shown here to keep it short.

Thanks,
Kiran

Where can we more details of FIR (or a dialect of MLIR)?

thanks,
-Prashanth

Hello Prashanth,

You can find some information about the FIR dialect from the following page.
https://github.com/flang-compiler/f18/blob/master/documentation/Investigating-FIR-as-an-MLIR-dialect.md

Two patches which lists the types and FIR operations are under review or merged.
https://github.com/flang-compiler/f18/pull/668
https://github.com/flang-compiler/f18/pull/696

Thanks,
Kiran

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Eric suggested nesting the outer loop inside the omp.do operation rather than having it as part of the omp.do operation. This will help maintain the semantics of the original source program if region inlining happens and the OpenMP operations are removed. I agree and have updated step 3 to the following.

3a) Mix of OpenMP and MLIR dialects after F18 AST lowering to MLIR

omp.parallel {

omp.do {collapse = 2} {

fir.do %i = %lb1 to %ub1 : !fir.integer {

fir.do %j = %lb2 to %ub2 : !fir.integer {

}

}

}

}

3b) After collapsing the loops

omp.parallel {
omp.do {
fir.do %i = 0 to %ub3 : !fir.integer {

}
}
}

Thanks,
Kiran

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

This mail summarises the handling of the simd construct.

!$omp simd: The simd construct tells the compiler that the loop can be vectorised. Since vectorisation is performed by LLVM (See Note 1), the frontend passes the simd information to LLVM through metadata. Since the simd construct is all handled by metadata we can skip the OpenMP IRBuilder for handling this construct (See Note 2).

  1. Consider the following source which has a loop which adds two arrays and stores the result in another array. Assume that this loop is not trivially vectorisable due to some alias issues (the arrays being pointers for e.g). An omp simd construct is used to inform the compiler that this loop can be vectorised.

!$omp simd simdlen(4)

do i=1,n

c(i) = a(i) + b(i)

end do

  1. The Fortran program will be parsed and represented as a parse tree. Skipping the parse tree representation to keep it short.

  2. The parse tree is lowered to a mix of OpenMP and FIR dialects in MLIR. A representation for this code mix is given below. We have an operation omp.simd in the dialect which represents OpenMP simd. It has attributes for the various constant clauses like simdlen, safelen etc. Reduction if present in the omp simd statement can be represented by another operation omp.reduction. Any transformation necessary to expose reduction operations/variables (as specified in the reduction clause) can be performed in OpenMP MLIR layer itself. The fir do loop is nested inside the simd region.

omp.simd {simdlen=4} {
fir.do %i = 1 to %n : !fir.integer {
%a_val = fir.load %a_addr[%i] : memref
%a_val = fir.load %a_addr[%i] : memref
%c_val = addf %a_val, %b_val : !fir.float
fir.store %c_val, %c_addr[%i] : memref
}
}

  1. For this construct, the next step is to lower the OpenMP and FIR dialects to LLVM dialect. During this lowering, information is added via attributes to the memory instructions and loop branch instruction in the loop.
    a) the memory access instructions have an attribute which denotes that they can be executed in parallel.
    b) the loop branch instruction has attributes for enabling vectorisation, setting the vectorisation width and pointing to all memory access operations via the access_group which can be parallelised.

^body:

%a_val = llvm.load %a_addr : !llvm<“float*”> {access_group=1}
%b_val = llvm.load %b_addr : !llvm<“float*”> {access_group=1}

%c_val = llvm.fadd %a_val, %b_val : !llvm.float
llvm.store %c_val, %c_addr : !llvm<“float*”> {access_group=1}

llvm.cond_br %check, ^s_exit, ^body {vectorize_width=4, vectorize_enable=1,parallel_loop_accesses=1}

^s_exit:

llvm.cond_br %7, ^bb6, ^bb7

  1. The LLVM MLIR is translated to LLVM IR. In this stage, all the attributes from (4) will be translated to metadata.

body:

%a_val = load float, float *a_addr, !llvm.access.group !1
%b_val = load float, float *b_addr, !llvm.access.group !1
%c_val = fadd %a_val, %b_val
store float %c_val, float *c_add, !llvm.access.group !1

br i1 %check, label %s_exit, label %body, !llvm.loop !2

s_exit:

!1 = !{}
!2 = !distinct{!2,!3,!4,!5}
!3 = !{!“llvm.loop.vectorize.width”, i32 4}
!4 = !{!“llvm.loop.vectorize.enable”, i1 true}
!5 = !{!“llvm.loop.parallel_accesses”, !1}

Note:

  1. There is support for vectorization in MLIR also, I am assuming that it is not as good as the LLVM vectoriser and hence not using MLIR vectorization.
  2. For this construct, we have chosen to not use the OpenMP IRBuilder. There is still one possible reason for using the OpenMP IRBuilder even for this simple use case. The use case being, if LLVM decided to change the loop metadata then they have to change it only in the OpenMP IRBuilder. If we do not use the IRBuilder then the developers will have to change the metadata generation in Clang and Flang. I assume that this happens rarely and hence is OK.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Handling of Target construct

Hi Kiran-

So, the entire lowering to LLVM IR will happen in the frontend itself right ? That would mean that high level opts on OpenMP loops will now need to happen in the frontend ?

-Thx

dd

Hi Kiran-

So, the entire lowering to LLVM IR will happen in the frontend itself right ? That would mean that high level opts on OpenMP loops will now need to happen in the frontend ?

I don't know that we know yet where these need to happen, but based on my discussions with Johannes, the loop transformations can be implemented in the OpenMPIRBuilder.

-Hal

-Thx
dd

Hello Dibyendu,

MLIR supports several loop transformation passes, as can be seen from the following link. So we can perform all the supported loop transformations in this layer.
https://github.com/tensorflow/mlir/tree/master/lib/Transforms

As Hal says, it can be done in the OpenMP IRBuilder also if it supports it. But this would require the loop information to be carried in the OpenMP dialect. All other dialects will be converted to the LLVM dialect before translation to LLVM IR and the LLVM dialect does not have loops.

What loop transformations are you thinking about here?

Thanks,
Kiran

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Thanks Kiran. As a start I was thinking of fusion/fission/blocking which seem to be covered by MLIR (we must look at the efficacy though). Some other stuff as outlined in https://www2.cs.arizona.edu/~ianbertolacci/publications/IWOMP18_paper_Extending_OpenMP_to_Facilitate_Loop_Optimization.pdf

Also depend clauses for openmp tasks can create task graphs which we may want to optimize in some form.

I’m traveling and did not yet go though the entire thread, so please take this as me thinking out loud based in what I’ve read so far:

  1. we also want/have to optimize OpenMP that is produced by clang so solutions in LLVM-ir are, at least for the foreseeable future, preferable.

  2. we can do most transformations that are actually legal in LLVM-IR with the right abstractions:

a) for scalar transformations see my LLVM-dev talk last year and the iwomp and lcpc papers last year.

b) for actual optimization of parallel code in llvm mainline there are two patches under review that will already give you that (through the Attributor, featured at this year’s LLVM-dev If you’re interested).

c) my iwomp papers ( last and this year) describe parallelism specific optimizations and problems you have due to the dynamic nature of OpenMP as well as solutions to overcome problems with the current encoding of OpenMP.

D) keep in mind that OpenMP often dictates certain behavior so you need to be aware when you reuse (high level) transformations. With the exception of maybe the loop construct and taskloop, most directives provide the user with strong execution guarantees, e.g. omp parallel doesn’t mean there are no dependences and omp for loops cannot be generally peeled.

I’ll be back in the office next week. I’d love to discuss problems and opportunities wrt OpenMP-specific optimizations in detail with all interested :slight_smile:

Cheers,

Johannes

[+Michael, who is driving the loop-optimization effort for the OpenMP standard now]

It seems to me that the future will bring OpenMP directive combinations which combine loop transformations and parallelism transformations in arbitrary orderings. Thus, at least for the directive-driven transformations, doing these inside the OpenMPIRBuilder may make the most sense. For cost-model-driven, or canonicalization, transformations, we can have these performed at the MLIR level, at the LLVM level ,or both. Hopefully, regardless, we’ll be able to share a lot of infrastructure for dependence analysis and the like.

-Hal