Introduction
The workdistribute
construct is scheduled to be officially released later this year in OpenMP 6.0. (see ver 6.0 preview)
Below is an example of how it looks like:
real, dimension(n, n) :: x, y, z, tmp
real :: a, any_less
!$omp target teams workdistribute
y = a * x + y
!$omp end target teams workdistribute
!$omp target teams workdistribute
y(1:n/2,1:n) = 1.0
y = y + x
tmp = a * matmul(x, y + 1.0)
any_less = any(tmp(1:n/2,1:n/3) < 1.0)
!$omp end target teams workdistribute
In short, all teams share the work contained in the workdistribute construct, while
preserving the semantics of the fortran code. (e.g. the ordering of statements
is enforced, the RHS of assignments must appear to complete prior to the
assignment to the LHS)
Early prototype
I have implemented an early proof-of-concept prototype here
This works for simple cases on AMD GPUs.
Notably missing are strided array assignments and most of the array intrinsics.
(It also uses the old name coexecute
. This has been renamed to
workdistribute
in the current OpenMP 6.0 draft)
Restrictions
workdistribute
must be closely nested inside a teams
, and these are the
statements that are legal to be nested in a workdistribute
:
- array assignments
- scalar assignment
- calls to pure and elemental procedures
According to the standard, statements are divided into units of works as such:
- Evaluation of individual elements in array expressions (assignments and elementals) are units of work
- Instrinsics (
matmul
,any
, etc) can be divided into any number of units of work
Implementation
Trivial implementation
A trivial implementation would for each statement: allocate temporaries for each
RHS, execute a separate kernel for each expression in the RHS, and finally
assign the result to the LHS in a separate kernel.
This can be implemented in the Fortran → MLIR frontend, and will be fairly
robust. However, it would suffer from poor preformance.
Optimizations
There is big room for optimizations such as removal of temporaries, and merging
of kernels which we would like to do. Especially for GPU targets we would also
like to launch optimized versions of intrinsic functions such as matmul
(e.g.
cublas for CUDA targets).
Proposed pipeline
As we can see from the types of statements that can be included in a
workdistribute
, nearly all of them can be modelled by hlfir
. Optimizations on the
hlfir
level have an opportunity to greatly impact the performance of the
resulting code.
Thus, we would like to perform the division into units of work described above
only after we have performed high level optimizations on hlfir
.
Thus, I propose the following pipeline for handling workdistribute
:
We introduce a new workdistribute
MLIR operation, and we emit the contents of the
fortran construct in it. We use the existing frontend codegen for this.
omp.target {
omp.teams {
omp.workdistribute {
<mix of temp allocations, array exprs (elementals), intrinsics>
}
}
}
Then, we reuse the existing high-level optimizations to optimize the contents of
the workdistribute
and then bufferize hlfir
. This will already give us many of
the optimizations that we want (buffer elimination, kernel merging).
This leaves us in a state like this:
omp.target {
omp.teams {
omp.workdistribute {
fir.allocmem ...
fir.do_loop ... unordered {
...
}
fir.call RTMatmul(...)
fir.do_loop ... unordered {
}
RTAssign(...)
fir.freemem ...
}
}
}
Now, we need to chunk this computation into appropriate kernels. Because we want
to be able to replace intrinsic calls such as RTMatmul
with appropriate runtime
calls from the host, we need to split the target region. This is also needed in
order to allocate temporary memory (the fir.allocmem
) and to execute code that
needs to be executed on a single thread such as scalar assignments.
Thus, we fission the workdistribute
into either different target
regions or
regions executed on the host:
%a = omp_target_allocmem ...
omp.target {
omp.teams {
omp.workdistribute {
fir.do_loop ... unordered {
...
}
}
}
}
omp_target_RTMatmul(...)
omp.target {
omp.teams {
omp.workdistribute {
fir.do_loop ... unordered {
...
}
}
}
}
omp_target_RTAssign(...)
omp_target_freemem ...
This process is iterative and starts from the top of the original workdistribute
region. We identify loop nests that we would like to parallelize and we split
them off in their own target
regions. We identify what needs to be executed on
the host (temporary memory allocation and frees, array intrinsic operations) and
we put that on the host. For any other operations we either put them on the host
if we can compute them there, or construct single-threaded target
regions if
they have memory effects which need to be preserved on the device. This happens
if there is for example a read from device memory which is then used to
construct the array descriptor for an array later in the computation.
Not that we also need to allocate new variables which get mapped to/from the
device in order to communicate values which are used across the now split
regions.
Note also how when we hoisted array instrinsic runtime calls we converted them
to appropriate OpenMP target-enabled runtime calls (omp_target_*
).
Now, we can transform the teams{workdistribute{do_loop{A}}}
nests into a teams{distribute{parallel{wsloop{A}}}}
:
%a = omp_target_allocmem ...
omp.target { omp.teams { distribute { parallel { wsloop {...}}}}}
omp_target_RTMatmul(...)
omp.target { omp.teams { distribute { parallel { wsloop {...}}}}}
omp_target_RTAssign(...)
omp_target_freemem ...
Note that alternatively, we can have a custom lowering for the hlfir.elemental
operations in case they are in a workdistribute
region. We can lower them
directly to parallel{wsloop{}}
nests instead of going to fir.do_loop ... unordered
and converting them to a parallel{wsloop{}}
afterwards and I think
that should be how it works but the prototype reuses the lowering in place for
simplicity.
With which we have successfully converted a workdistribute
to existing OpenMP constructs which can be lowered to llvm.
Fissioning omp.target and implications on the host/target interface
As we can see from the above example, by splitting the target region, we
generated new target regions. We also need to allocate temporary variables to
pass values from one target region to the next one in the split. This means
that we also added new variable mappings (arguments to each target region) which
means we changed the interface between the host and the target.
I think the need to change the host/target interface during the compilation is
unavoidable if a high-performance implementation is desired.
My current prototype tries its best to be deterministic so that the host module
and the target module will get transformed in the same way and we will end up
with the same matching interface on the host and device module. However, a more
robust solution would be to bring in the host module and all device modules we
are compiling in the same process and ensure during the target splitting that we
generate a valid interface.
That, however, may have a substantial effect on memory usage of the compilation.
It can also potentially result in a lot of added complexity and consequently
maintenance problems so I would be interested to people’s thought are on this.
OpenMP enabled array intrinsic runtime
As I showed above, we convert array intrinsics which are inside target regions
to OpenMP-enabled versions of the same array intrinsics but called from the
host.
This enables us to use vendor libraries such as cublas and rocblas in them, and
if that is not possible, we can fallback to a generic OpenMP target
implementation.
An example simple rocblas implementation for a subset of matmuls can be found
here.
Conclusion
I would love to hear back from the community about this and would very much
appreciate if anyone has any feedback on the pipeline, overall approach or
anything else.
Please feel free to ask any questions as I may have missed important details.
I can start work on upstreaming parts of this if people deem it useful or we can
work together on making the approach better.