[RFC] Adding support for OpenMP GPU target offload

TLDR;

This RFC proposes adding GPU compilation support for OpenMP Target offload constructs within MLIR, i.e., not requiring flang or clang to compile OMP MLIR programs.

The idea is to leverage the existing compilation infrastructure in the GPU dialect to enable OpenMP compilation.

Edit:
This proposal it’s not about lowering OMP operations to the GPU dialect, instead it’s about using the GPU dialect compilation infrastructure to get to executable using only MLIR, the OpenMPIRBuilder, libomptarget*.bc and the existing OpenMP runtime.

Why?

Currently, there is no way to compile MLIR OMP target offload ops for GPU targets without using flang or clang. This lack of GPU support has multiple consequences:

  • There are no integration tests for OMP target offload, as testing it 100% within MLIR is impossible.
  • Dialect development becomes more complicated than needed, as one requires flang to support the OMP constructs in the front end and then test them, creating a development barrier.
  • The OMP dialect should be almost fully supported within MLIR.

Proposal:

Major:

  • Add the OffloadEmbedding GPU compilation attribute. This attribute translates GPU binaries in a way that’s compatible with Libomptarget, CUDART and HIP RT. This attribute could then be used in combination with project offload to have a more general GPU runtime.
  • Add the omp.tgt_entry_info attribute for representing Target Entry information. This makes the entry explicit in the IR, making the mapping between host and device symbols easier.
  • Add the omp-target-outline-to-gpu pass. This pass outlines omp.target ops to a GPU module, making it possible to leverage GPU compilation infrastructure.

All:

The proposed set of PRs would also enable JIT compilation with mlir-cpu-runner*, meaning that integration tests would now be possible.

  • There’s a small bug where the cl option march gets registered twice, one by mlir-cpu-runner and the other as a consequence of libomptarget. But, if the double registration is avoid then it works.

Small example:

Host only module:

module attributes {omp.is_target_device = false, omp.is_gpu = false} {
  func.func @targetFn() -> () attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>} {
    return
  }
  llvm.func @main() {
    omp.target {
      func.call @targetFn() : () -> ()
      omp.terminator
    }
    llvm.return
  }
}

After applying mlir-opt --omp-target-outline-to-gpu the omp.target ops get outlined, as well as the declare target symbols. Furthermore the entry information becomes explicit:

module attributes {gpu.container_module, omp.is_gpu = false, omp.is_target_device = false} {
  gpu.module @omp_offload <#gpu.offload_embedding<omp>> attributes {omp.is_gpu = true, omp.is_target_device = true} {
    func.func @main() attributes {omp.outline_parent_name = "main"} {
      omp.target info = #omp.tgt_entry_info<deviceID = 64771, fileID = 12453258, line = 6, section = @omp_offload> {
        func.call @targetFn() : () -> ()
        omp.terminator
      }
      return
    }
    func.func @targetFn() attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>} {
      return
    }
  }
  func.func @targetFn() attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>} {
    return
  }
  llvm.func @main() {
    omp.target info = #omp.tgt_entry_info<deviceID = 64771, fileID = 12453258, line = 6, section = @omp_offload> {
      func.call @targetFn() : () -> ()
      omp.terminator
    }
    llvm.return
  }
}

Example from flang:

I took:

program main
  integer :: x;
  integer :: y;
  x = 0
  y = 1
!$omp target map(from:x)
    x = 5 + y
!$omp end target
  print *, "x = ", x
end program main
  • Saved the *-llvmir.mlir file produced by flang. Result: test-lvmir.mlir.
  • Applied the outlining pass with mlir-opt. Result: test-outlined.mlir
  • Applied GPU dialect compilation passes with mlir-opt, and then translated to llvm. Result: omp.ll
  • Compiled the IR and then ran it on a NVIDIA V100. Results: exec.log

Github PR:

Shoutout to @jhuber6 for the linker-wrapper work. As this proposal relies heavily on it.

CC’ing people that might be interested in the proposal:
@jdoerfert
@kiranchandramohan
@clementval
@jeanPerier
@banach-space
@ftynse
@mehdi_amini
@grypp

1 Like

These are all important points. And I support the proposal in spirit.

I have not looked into the proposal in detail. How do you make the OpenMP runtime available? Could you expand the flow a bit more and detail whether the execution can be achieved starting from a single module that is then split into a host and device module, or whether two modules are required.

It would be good to discuss this proposal in detail with the engineers working exclusively on the OpenMP target offload side. @skatrak @DominikAdamski @agozillon @TIFitis @jansjodi @kparzysz.

1 Like

At this moment that’s the only hard dependency. One needs to pass the libomptarget*.bc as command line option or in the IR. But linking against libomptarget*.bc is handled by the GPU compilation infra.

It starts with the host module. Then the omp-target-outline-to-gpu creates the device module and adds it to the host module, which at that point the GPU dialect takes over to compile to binary.

The omp-target-outline-to-gpu goes through all the target ops and outlines them to the GPU module, it also outlines declare_target functions, making the device module self contained.

See this gist for an example of compilation from a single module:

However, it should be possible to use all of this work to combine two separate modules without using omp-target-outline-to-gpu.
For example the following is an example of that:

module attributes {omp.is_target_device = false, omp.is_gpu = false} {
  gpu.module @module <#gpu.offload_object<"omp">>
      attributes {omp.is_target_device = true, omp.is_gpu = true} {
    llvm.func @main() {
      omp.target info = #omp.tgt_entry_info<deviceID = 0, fileID = 0, line = 0> {
        omp.parallel {
          %0 = gpu.thread_id x
          %csti8 = arith.constant 2 : i8
          %cstf32 = arith.constant 3.0 : f32
          gpu.printf "Hello from %lld, %d, %f\n" %0, %csti8, %cstf32  : index, i8, f32
          omp.terminator
        }
        omp.terminator
      }
      llvm.return
    }
  }
  llvm.func @main() {
    omp.target info = #omp.tgt_entry_info<deviceID = 0, fileID = 0, line = 0, offloadModule = @module> {
      omp.terminator
    }
    llvm.return
  }
}
1 Like

Hi,
thanks for your proposal. I have couple of questions:

  1. OpenMP directives can be applied to CPU and GPU. How would you like to lower CPU-related clauses?
  2. How do you want to support compilation of omp target if user specifies both AMD and Nvidia GPU as target GPUs for OpenMP target?
  3. What changes are required for Flang driver to enable MLIR GPU codegen?
  4. How do you want to express OpenMP implicit logic? Currently we use OMPIRBuilder which generates (roughly) the same *.ll code as Clang does. Generated LLVM IR contains calls to OpenMP runtime which reflect implicit synchronization barriers, or worksharing loops. The LLVM IR code can be heavily optimized by OpenMP-aware LLVM IR optimization pass and it’s common for Clang and Flang.

Could you provide description of tests which are missed? We have end-to-end GPU tests for Fortran: example . We verify MLIR lowering to LLVM IR for GPU as well.

Could you describe what do you mean? Which components are missed?

I have no strong opinion on this proposal, but just want to make sure we set expectations straight…

This is similar for LLVM and it’s not a problem per-se. I’m not saying it’s ideal, but I want to make sure we don’t create silo’d technology in MLIR just because of this non-problem.

Basically:

  1. If a front-end needs to check the IR they emit, they can do so on their own tests.
  2. If a front-end depends on LLVM/MLIR, they can create LLVM IR/MLIR tests.
  3. If an LLVM IR functionality needs to be created by a front-end, we use the front-end to create IR and use that as a test (this is not perfect, I know).
  4. If an MLIR dialect is created by some front-end, the test need to be in the front-end.
  5. If an MLIR dialect has local passes, we can repeat the pattern and create the input IR from the front-end and have tests in MLIR.
  6. If we want to create tests that span across multiple projects, we need to discuss this on a wider audience to avoid creating local solutions that are incompatible (silo’d) from the rest of the project.

(3) and (5) above are not perfect solutions, but they’re the current solution. If we want a new one we need to have a wider discussion (6).

If you have another reason to your proposal, please make it clear in the RFC above. If this is just for the sake of testing, then this needs to be a wider discussion.

1 Like

Preface: this proposal it’s not about lowering OMP operations to the GPU dialect, instead it’s about using the GPU dialect compilation infrastructure to get to executable using only MLIR, the OpenMPIRBuilder, libomptarget*.bc and the existing OpenMP runtime.

The proposal doesn’t add lowerings, everything (translation-related) it’s still managed by the OpenMPIRBuilder and OMP translation to LLVM.

GPU infrastructure already has some support for multiple targets, for example, in GPU we have full support for binaries with many targets:

gpu.binary @omp_binary [
  #gpu.object<#rocdl.target<chip="gfx90a">, bin = "BINARY">,
  #gpu.object<#nvvm.target<chip="sm_90">, bin = "BINARY">,
  #gpu.object<#nvvm.target<chip="sm_70">, bin = "BINARY">
]

However, the pass converting the device module to a binary (gpu-module-to-binary ) does have some restrictions as it uses the same device module to get to binary (some of those restrictions can be lifted). For example, the following will fail because the module contains intrinsics for different targets:

gpu.module @module [#nvvm.target, #rocdl.target] {
  llvm.func @func() {
    nvvm.barrier0
    rocdl.barrier
  }
}

However, AFAIK all target specific intrinsics in OpenMP are introduced by libomptarget*.bc, and if that’s the case then multiple targets are fully supported, because one can specify target specific linking libraries. For example, lets say nvptx.bc and amdgpu.bc have a function called barrier calling the correct intrinsic for each target, then the following IR can be compiled for multiple targets using gpu-module-to-binary:

gpu.module @module [#nvvm.target<libs = ["nvptx.bc"]>, #rocdl.target<libs = ["amdgpu.bc"]>] {
  llvm.func @barrier() ()->()
  llvm.func @func() {
    @llvm.call @barrier() ()->()
  }
}

If flang were to use this path then the biggest change is there wouldn’t be invocations of the compiler for each target, instead the pass manager would need to be adapted to run all necessary passes in a single invocation, and that shouldn’t be that hard. All device specific passes would have to be explicitly scheduled to run on the GPU module op.

All of that it’s still managed by the OpenMPIRBuilder, this proposal is only to get to executable.

There are no integration tests in MLIR llvm-project/mlir/test/Integration/Dialect at main · llvm/llvm-project · GitHub , as it stands right now flang is the source of truth for testing the OMP dialect for offload targets.

What I mean is that IMHO It should be possible to get to executable without requiring the use of flang or clang, and that in consequence the mlir-cpu-runner should be capable of JIT-ing MLIR OMP offload code.

My bad, the title should’ve been clearer, this proposal is only for compilation, there’s no new set of lowerings or paths that are unique to MLIR.

So IMO It’s not silo’d technology, it’d be adding support for something that can be supported in MLIR with some lines of code.
All the heavy lifting of OpenMP code generation is still performed by the OpenMPIRBuilder, the proposal only adds compilation support.

That’s why I included CC’ed people from OpenMP, Flang and MLIR.

If execution of the dialect is fully supported within MLIR, it means that dowstream users can also use LLVM OpenMP in their projects without flang or clang.

It also means that potentially the OMP dialect and OpenMPIRBuilder development could go faster, because there’s no dependency with the flang frontend. For example, the teams clause is not supported in device compilation in flang, the question at that point becomes, is it because the front-end doesn’t support it or something else. When I tried to use the teams clause with this path, it turns out that the OpenMPIRBuilder is not emitting the correct code for the device side.

I don’t know a lot about OpenMP, but this seems like a reasonable use of the target/“call a separate LLVM to compile an offload” target

1 Like

I should’ve said this in the RFC more explicitly, but, most of the changes are not related to OpenMP or are happening outside OpenMP. The biggest change is the introduction of the OffloadEmbeddingAttr (see GH PR #78117).

The OffloadEmbeddingAttr allows the usage of the CUDA, HIP and LibOMPTarget runtime for calling kernels. Instead of loading a module and getting a kernel ptr every time there’s a kernel call, this attribute registers kernels at program startup time with the runtime, and then uses traditional runtime functions like cudaLaunchKernel or hipLaunchKernel to launch the kernel.

I believe OffloadEmbeddingAttr could be of interest for MLIR outside of being the cornerstone for providing OpenMP support in MLIR.

2 Likes

I’ll note I’m overall without objection to the new offload embedding attr so long as I can get a standalone binary that works with hipModuleLoad() and company, because that’s what my library needs to generate.

1 Like