[RFC] OpenMP workdistribute construct implementation in Flang

Introduction

The workdistribute construct is scheduled to be officially released later this year in OpenMP 6.0. (see ver 6.0 preview)

Below is an example of how it looks like:

real, dimension(n, n) :: x, y, z, tmp
real :: a, any_less

!$omp target teams workdistribute
      y = a * x + y
!$omp end target teams workdistribute

!$omp target teams workdistribute
      y(1:n/2,1:n) = 1.0
      y = y + x
      tmp = a * matmul(x, y + 1.0)
      any_less = any(tmp(1:n/2,1:n/3) < 1.0)
!$omp end target teams workdistribute

In short, all teams share the work contained in the workdistribute construct, while
preserving the semantics of the fortran code. (e.g. the ordering of statements
is enforced, the RHS of assignments must appear to complete prior to the
assignment to the LHS)

Early prototype

I have implemented an early proof-of-concept prototype here
This works for simple cases on AMD GPUs.
Notably missing are strided array assignments and most of the array intrinsics.

(It also uses the old name coexecute. This has been renamed to
workdistribute in the current OpenMP 6.0 draft)

Restrictions

workdistribute must be closely nested inside a teams, and these are the
statements that are legal to be nested in a workdistribute:

  • array assignments
  • scalar assignment
  • calls to pure and elemental procedures

According to the standard, statements are divided into units of works as such:

  • Evaluation of individual elements in array expressions (assignments and elementals) are units of work
  • Instrinsics (matmul, any, etc) can be divided into any number of units of work

Implementation

Trivial implementation

A trivial implementation would for each statement: allocate temporaries for each
RHS, execute a separate kernel for each expression in the RHS, and finally
assign the result to the LHS in a separate kernel.

This can be implemented in the Fortran → MLIR frontend, and will be fairly
robust. However, it would suffer from poor preformance.

Optimizations

There is big room for optimizations such as removal of temporaries, and merging
of kernels which we would like to do. Especially for GPU targets we would also
like to launch optimized versions of intrinsic functions such as matmul (e.g.
cublas for CUDA targets).

Proposed pipeline

As we can see from the types of statements that can be included in a
workdistribute, nearly all of them can be modelled by hlfir. Optimizations on the
hlfir level have an opportunity to greatly impact the performance of the
resulting code.

Thus, we would like to perform the division into units of work described above
only after we have performed high level optimizations on hlfir.

Thus, I propose the following pipeline for handling workdistribute:

We introduce a new workdistribute MLIR operation, and we emit the contents of the
fortran construct in it. We use the existing frontend codegen for this.

omp.target {
  omp.teams {
    omp.workdistribute {
      <mix of temp allocations, array exprs (elementals), intrinsics>
    }
  }
}

Then, we reuse the existing high-level optimizations to optimize the contents of
the workdistribute and then bufferize hlfir. This will already give us many of
the optimizations that we want (buffer elimination, kernel merging).

This leaves us in a state like this:

omp.target {
  omp.teams {
    omp.workdistribute {
      fir.allocmem ...
      fir.do_loop ... unordered {
        ...
      }
      fir.call RTMatmul(...)
      fir.do_loop ... unordered {
      }
      RTAssign(...)
      fir.freemem ...
    }
  }
}

Now, we need to chunk this computation into appropriate kernels. Because we want
to be able to replace intrinsic calls such as RTMatmul with appropriate runtime
calls from the host, we need to split the target region. This is also needed in
order to allocate temporary memory (the fir.allocmem) and to execute code that
needs to be executed on a single thread such as scalar assignments.

Thus, we fission the workdistribute into either different target regions or
regions executed on the host:

%a = omp_target_allocmem ...
omp.target {
  omp.teams {
    omp.workdistribute {
      fir.do_loop ... unordered {
        ...
      }
    }
  }
}
omp_target_RTMatmul(...)
omp.target {
  omp.teams {
    omp.workdistribute {
      fir.do_loop ... unordered {
        ...
      }
    }
  }
}
omp_target_RTAssign(...)
omp_target_freemem ...

This process is iterative and starts from the top of the original workdistribute
region. We identify loop nests that we would like to parallelize and we split
them off in their own target regions. We identify what needs to be executed on
the host (temporary memory allocation and frees, array intrinsic operations) and
we put that on the host. For any other operations we either put them on the host
if we can compute them there, or construct single-threaded target regions if
they have memory effects which need to be preserved on the device. This happens
if there is for example a read from device memory which is then used to
construct the array descriptor for an array later in the computation.

Not that we also need to allocate new variables which get mapped to/from the
device in order to communicate values which are used across the now split
regions.

Note also how when we hoisted array instrinsic runtime calls we converted them
to appropriate OpenMP target-enabled runtime calls (omp_target_*).

Now, we can transform the teams{workdistribute{do_loop{A}}} nests into a teams{distribute{parallel{wsloop{A}}}}:

%a = omp_target_allocmem ...
omp.target { omp.teams { distribute { parallel { wsloop {...}}}}}
omp_target_RTMatmul(...)
omp.target { omp.teams { distribute { parallel { wsloop {...}}}}}
omp_target_RTAssign(...)
omp_target_freemem ...

Note that alternatively, we can have a custom lowering for the hlfir.elemental
operations in case they are in a workdistribute region. We can lower them
directly to parallel{wsloop{}} nests instead of going to fir.do_loop ... unordered and converting them to a parallel{wsloop{}} afterwards and I think
that should be how it works but the prototype reuses the lowering in place for
simplicity.

With which we have successfully converted a workdistribute to existing OpenMP constructs which can be lowered to llvm.

Fissioning omp.target and implications on the host/target interface

As we can see from the above example, by splitting the target region, we
generated new target regions. We also need to allocate temporary variables to
pass values from one target region to the next one in the split. This means
that we also added new variable mappings (arguments to each target region) which
means we changed the interface between the host and the target.

I think the need to change the host/target interface during the compilation is
unavoidable if a high-performance implementation is desired.

My current prototype tries its best to be deterministic so that the host module
and the target module will get transformed in the same way and we will end up
with the same matching interface on the host and device module. However, a more
robust solution would be to bring in the host module and all device modules we
are compiling in the same process and ensure during the target splitting that we
generate a valid interface.

That, however, may have a substantial effect on memory usage of the compilation.
It can also potentially result in a lot of added complexity and consequently
maintenance problems so I would be interested to people’s thought are on this.

OpenMP enabled array intrinsic runtime

As I showed above, we convert array intrinsics which are inside target regions
to OpenMP-enabled versions of the same array intrinsics but called from the
host.

This enables us to use vendor libraries such as cublas and rocblas in them, and
if that is not possible, we can fallback to a generic OpenMP target
implementation.

An example simple rocblas implementation for a subset of matmuls can be found
here.

Conclusion

I would love to hear back from the community about this and would very much
appreciate if anyone has any feedback on the pipeline, overall approach or
anything else.
Please feel free to ask any questions as I may have missed important details.

I can start work on upstreaming parts of this if people deem it useful or we can
work together on making the approach better.

@kiranchandramohan @jdoerfert

1 Like

Thanks for writing this RFC. Since this involves factors relating to OpenMP, device offloading, HLFIR etc it might be good to gather some feedback from more people. A good venue for this would be either the flang call this Wednesday or the technical call next Monday (both at 4.30pm UK time). Let me know if you are interested.

Of immediate interest is whether your work can be reused for implementing the workshare construct. It is probably more simpler there since there is no host and device issues.

Is it always the case that memory allocations should be performed on the host? If so, we will be hitting similar issues already in OpenMP/OpenACC target/device regions. @agozillon did you face this issue?

This increased data transfer between the host and device might kill all the benefits.

This is an interesting alternative. How will we handle calls to the runtime assign function?

Currently, there is separate host and device compilation due to preprocessing differences for host and device.

We have to check whether what is proposed here works with the pipeline that is set up currently. @skatrak @agozillon @jansjodi would the proposed changes work correctly with the pipeline?

I think it may be different here because we are talking about array-size temporary allocations and not just scalar variables, but I do not know what other places may have this.

This happens in the following example

real, dimension(n, n) :: x, y, z
!$omp target teams workdistribute
      z = matmul(x, sqrt(y))
!$omp end target teams workdistribute

There will be two kernels here, one which does the matmul and one for sqrt. We need a temporary array for the result of sqrt. Then this array needs to be communicated to the next kernel which does the add.

In this case the current implementation does omp_target_alloc on the host and passes the array as an argument to both sqrt and matmul.

I am not sure how ubiquitous array allocation capabilities are across vendors and accellerators which is why I went for this approach.

Here, the need is for small allocations of scalars. These will stay on the device if they are not needed on the host. For example here:

real, dimension(n) :: x, y, z, t
real :: a

a = max(x)
y = a * x

I expect a to stay on the device as there is no need for it on the host.

The places I have observed we would definitely need to bring values back to the host is in such cases:

a = max(x)
y(1:a,1:a) = matmul(x(1:a,1:a),z(1:a,1:a))

Where the vendor *blas implementation needs to know the matrix size before calling the gpu kernel. This a will get passed back to the host using a map (from:)

I also think there are optimizations that can be done to limit the need for this or even reduce the amount of transfer that was generated by the frontend if we have both the device and host code in a single module.

Assign would still result in a runtime call in this case. I was thinking of this alternative lowering for hlfir.elemental operations.

(I was actually considering implementing a loop-like lowering for hlfir.assign which would make life a lot easier since we do not need to implement an OpenMP version of the assign rt call but I did not find an easy way to do that and it may need a conditional for the aliasing/non-aliasing cases. However I did not find an easy way to do this immediately so I have put it on-hold.)

What does preprocessing here refer to? I assume the frontend preprocessor?

this transformation happens at the end of hlfir/start of fir pipeline so difference after that in the pipeline should be fine but we would need to be deterministic across host and device until then. I think this may be too flaky, so in my opinion the best option would be to bring in the device and host modules in the same process/module (after we are done with preprocessing?) so that we can transform them at the same time, however I am not sure if that is acceptable due to compile-time memory and time constraints and implementation complexity considerations.

I like this idea for an alternative lowering of hlfir.elemental. I think you could also do something similar for transformational intrinsic operations like hlfir.matmul - thereby giving a bit more control over how these are bufferized (instead of having to work with whatever was done for the single-threaded host runtime library implementation).

Thank you @ivanradanov for this work and the detailed explanation. I think the overall approach makes sense. I have a couple of general questions and comments.

With regards to the splitting of target regions, what would happen to operations inside of the teams construct but outside of workdistribute? And if there are multiple workdistribute constructs in the same target region, possibly with other statements in between? I can’t see anything in the spec to disallow that, but let me know if I’m wrong, since I’m not that familiar with this construct.

!$omp target teams
  !$omp distribute parallel do
  do ...
  end do
  !$omp end distribute parallel do
  !$omp workdistribute
    ...
  !$omp end workdistribute
  call foo()
  !$omp workdistribute
    ...
  !$omp end workdistribute
  call bar()
!$omp end target teams

Regarding the data transfers concerns raised by @kiranchandramohan as a result of splitting target regions, wouldn’t these OpenMP instrinsics optimized for the device take buffers already allocated and updated in the device? We wouldn’t want to add to or from map clauses when going back and forth between host and device for these intermediate buffers if the host is not going to read from or write to them. I’m guessing there is a set of OpenMP constructs/clauses we could generate to reduce this overhead to the time it takes to allocate/deallocate buffers and the time lost to launching each kernel. The only data transfers would be those that we’d have to do regardless of this splitting.

Also, if array sizes for intermediate results are known in advance, we could avoid splitting some parts of target regions containing none of these optimized intrinsics by just hoisting allocations out and passing them as map(alloc) clauses or something similar and reduce a bit of the splitting overhead.

I think the idea of merging host and device compilation into a single compiler invocation is a complex one to implement. Current support for offloading is already based on having separate compilation processes for the host and each offload target, which is shared with clang. The lowering process is about the same for both, and they diverge in specific points mainly based on the omp.is_target_device MLIR module attribute. It’s not perfect, and we do need to be careful about making sure that kernel interfaces for both host and device match, but the bulk of the code is shared and differences are relatively localized.

In my view, in addition to what Kiran mentioned, the main problem transitioning to a single invocation is that the current approach is already shared between flang and clang, and if we were to change it I think we wouldn’t want them to diverge. Maybe it’d give us the ability of noticing host/device interface mismatches earlier during compilation, rather than getting linker or runtime errors, but this change wouldn’t really give us any features we need to support workdistribute or any other constructs. So, my opinion on that is that it could be discussed separately involving the clang community (of which I think there’s precedent of), but to avoid depending on it for this work and just base it on the current approach.

TR12 says the following

The coexecute construct must be a closely nested construct inside a teams construct.

So having operations between teams and workdistribute would be illegal, and there can only be one workdistribute in a teams.

I am sorry that seems to be a detail I missed when I wrote the initial post. In fact, we first wrap the omp.target with an omp.target_data construct which has the same mappings as the original omp.target so as to not move the device array back and forth between each fissioned omp.target. Something like this (please excuse the incorrect syntax):

omp.target map %big_array(tofrom) { // original
  A
  B
}

Becomes

omp.target_data map %big_array(tofrom) {
  omp.target map %big_array(tofrom) { // original
    A
    B
  }
}

and then

omp.target_data map %big_array(tofrom) {
  omp.target map %big_array(tofrom) {
    A
  }
  omp.target map %big_array(tofrom) {
    B
  }
}

Which means that no copies will be made between the two omp.targets, and only after we exit the omp.target_data.

This omp.target_data acts as a way to preserve the original memory movement semantics of the workdistribute as a whole. (there is some need to edit the clauses a bit, for example we need to get rid of always modifier for the inner omp.target and there may be something else as well)

I agree on the complexity. There are other benefits than just more robust workdistribute though, it would give us the ability to do host-device optimizations such as propagating constants into kernels or removing unneeded mappings etc. but the complexity will be (too?) high. (It would also massively increase the discussion on this)

We can opt to go for this separate invocation implementation, and if in the future flang and/or clang opt to get a single-module representation we should be able to edit this pipeline to make use of it.

Yes, but as far as I understand it being closely nested only means there’s no other constructs nested in-between them. So this is not allowed:

!$omp teams
  !$omp parallel
    !$omp workdistribute
      ...
    !$omp end workdistribute
  !$omp end parallel
!$omp end teams

But having other statements and constructs at the same nesting level as workdistribute seems to be allowed to me, as long as they are allowed inside of a teams region. Maybe @mjklemm can give us a clarification on that.

Thank you for the explanation, that was the kind of thing I was thinking about. The only thing I’d add to that is that this addresses the original map clauses of the target region, but not the intermediate buffers. If their sizes are known in advance, perhaps their allocations could be hoisted outside of the target region and then mapped in a way to never make data copies.

I agree that this would be the best way forward on that issue. We don’t want to block this work based on it.

You can have some code (e.g., omp_get_num_teams()) between teams and workdistribute. But AFAIR, we have specified that only one workdistribure can be inside a teams construct.