[RFC] Adding support for OpenMP GPU target offload

fabianmc · January 16, 2024, 7:59pm

TLDR;

This RFC proposes adding GPU compilation support for OpenMP Target offload constructs within MLIR, i.e., not requiring flang or clang to compile OMP MLIR programs.

The idea is to leverage the existing compilation infrastructure in the GPU dialect to enable OpenMP compilation.

Edit:
This proposal it’s not about lowering OMP operations to the GPU dialect, instead it’s about using the GPU dialect compilation infrastructure to get to executable using only MLIR, the OpenMPIRBuilder, libomptarget*.bc and the existing OpenMP runtime.

Why?

Currently, there is no way to compile MLIR OMP target offload ops for GPU targets without using flang or clang. This lack of GPU support has multiple consequences:

There are no integration tests for OMP target offload, as testing it 100% within MLIR is impossible.
Dialect development becomes more complicated than needed, as one requires flang to support the OMP constructs in the front end and then test them, creating a development barrier.
The OMP dialect should be almost fully supported within MLIR.

Proposal:

Major:

Add the OffloadEmbedding GPU compilation attribute. This attribute translates GPU binaries in a way that’s compatible with Libomptarget, CUDART and HIP RT. This attribute could then be used in combination with project offload to have a more general GPU runtime.
Add the omp.tgt_entry_info attribute for representing Target Entry information. This makes the entry explicit in the IR, making the mapping between host and device symbols easier.
Add the omp-target-outline-to-gpu pass. This pass outlines omp.target ops to a GPU module, making it possible to leverage GPU compilation infrastructure.

All:

See the Github PR section.

The proposed set of PRs would also enable JIT compilation with mlir-cpu-runner*, meaning that integration tests would now be possible.

There’s a small bug where the cl option march gets registered twice, one by mlir-cpu-runner and the other as a consequence of libomptarget. But, if the double registration is avoid then it works.

Small example:

Host only module:

module attributes {omp.is_target_device = false, omp.is_gpu = false} {
  func.func @targetFn() -> () attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>} {
    return
  }
  llvm.func @main() {
    omp.target {
      func.call @targetFn() : () -> ()
      omp.terminator
    }
    llvm.return
  }
}

After applying mlir-opt --omp-target-outline-to-gpu the omp.target ops get outlined, as well as the declare target symbols. Furthermore the entry information becomes explicit:

module attributes {gpu.container_module, omp.is_gpu = false, omp.is_target_device = false} {
  gpu.module @omp_offload <#gpu.offload_embedding<omp>> attributes {omp.is_gpu = true, omp.is_target_device = true} {
    func.func @main() attributes {omp.outline_parent_name = "main"} {
      omp.target info = #omp.tgt_entry_info<deviceID = 64771, fileID = 12453258, line = 6, section = @omp_offload> {
        func.call @targetFn() : () -> ()
        omp.terminator
      }
      return
    }
    func.func @targetFn() attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>} {
      return
    }
  }
  func.func @targetFn() attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>} {
    return
  }
  llvm.func @main() {
    omp.target info = #omp.tgt_entry_info<deviceID = 64771, fileID = 12453258, line = 6, section = @omp_offload> {
      func.call @targetFn() : () -> ()
      omp.terminator
    }
    llvm.return
  }
}

Example from flang:

I took:

program main
  integer :: x;
  integer :: y;
  x = 0
  y = 1
!$omp target map(from:x)
    x = 5 + y
!$omp end target
  print *, "x = ", x
end program main

Saved the *-llvmir.mlir file produced by flang. Result: test-lvmir.mlir.
Applied the outlining pass with mlir-opt. Result: test-outlined.mlir
Applied GPU dialect compilation passes with mlir-opt, and then translated to llvm. Result: omp.ll
Compiled the IR and then ran it on a NVIDIA V100. Results: exec.log

Github PR:

github.com/llvm/llvm-project

[mlir][OpenMP] Add outlining pass for TargetOp

llvm:main ← fabianmcg:omp-outlining

opened 07:04PM - 16 Jan 24 UTC

fabianmcg

+1606 -147

This patch adds a pass to outline OpenMP target operations into a GPU module, a…llowing them to be compiled using the GPU dialect compilation infrastructure. The pass works by traversing each function, outlining the ops to a GPU module, and then cloning all the symbols referenced inside the target regions marked with a declare target attribute. The outlining mechanism is similar to the one found in `gpu-kernel-outlining`. Note: Ignore the base commits, they are being reviewed in other PRs.

github.com/llvm/llvm-project

[mlir][gpu] Add an option to represent LLVM bitcode linking flags.

llvm:main ← fabianmcg:link-options

opened 04:56PM - 13 Jan 24 UTC

fabianmcg

+105 -13

This patch adds an option to `Target/LLVM/ModuleToObject` for specifying the LLV…M bitcode linking flags. One example where this is needed is when compiling OpenMP target offload code, as all the symbols from `libomptarget*.bc` must be imported regardless of usage, because `llvm/Transforms/IPO/OpenMPOpt.cpp` might materialize their use during optimizations. This patch also adds the `lflags` option to `gpu-module-to-binary` for specifying this linking flags.

github.com/llvm/llvm-project

[mlir][interfaces] Add the `TargetInfo` attribute interface

llvm:main ← fabianmcg:target-info

opened 11:52PM - 13 Jan 24 UTC

fabianmcg

+61 -3

This patch adds the TargetInfo attribute interface to the set of DLTI interfaces…. Target information attributes provide essential information on the compilation target. This information includes the target triple identifier, the target chip identifier, and a string representation of the target features. This patch also adds this new interface to the NVVM and ROCDL GPU target attributes.

github.com/llvm/llvm-project

[mlir][Target][LLVM] Add offload utility class

llvm:main ← fabianmcg:offload-util

opened 02:27PM - 14 Jan 24 UTC

fabianmcg

+224 -0

This patch adds the `OffloadHandler` utility class for creating LLVM offload ent…ries. LLVM offload entries hold information on offload symbols; for example, for a GPU kernel, this includes its host address to identify the kernel and the kernel identifier in the binary. Arrays of offload entries can be used to register functions within the CUDA/HIP runtime. Libomptarget also uses these entries to register OMP target offload kernels and variables. This patch is 1/4 on introducing the `OffloadEmbeddingAttr` GPU translation attribute.

github.com/llvm/llvm-project

[mlir][gpu] Add the `OffloadEmbeddingAttr` offloading translation attr

llvm:main ← fabianmcg:offload-attr

opened 02:16AM - 15 Jan 24 UTC

fabianmcg

+783 -62

This patch adds the offloading translation attribute. This attribute uses LLVM …offloading infrastructure to embed GPU binaries in the IR. At the program start, the LLVM offloading mechanism registers kernels and variables with the runtime library: CUDA RT, HIP RT, or LibOMPTarget. The offloading mechanism relies on the runtime library to dispatch the correct kernel based on the registered symbols. This patch is 3/4 on introducing the `OffloadEmbeddingAttr` GPU translation attribute. Note: Ignore the base commits; those are being reviewed in PRs #78057, #78098, and #78073.

github.com/llvm/llvm-project

[mlir][OpenMP] Remove unnecessary hard dialect dependencies

llvm:main ← fabianmcg:omp-cleanup

opened 01:56AM - 16 Jan 24 UTC

fabianmcg

+248 -50

This patch removes dialect dependencies from the OpenMP dialect; all external mo…dels were moved to independent libraries to accomplish this change. Consequently, OpenMP dialect users can pull only the desired external models; for example, a user might choose not to include OpenMP Func external models. However, all external models still reside inside the OpenMP dialect, as it's the owner of those interfaces. Additionally, an external model was included for the GPUModule operation.

github.com/llvm/llvm-project

[llvm][OpenMPIRBuilder] Allow to not register offload entries in the entry manager

llvm:main ← fabianmcg:omp-irbuilder

opened 02:24AM - 16 Jan 24 UTC

fabianmcg

+57 -32

This patch adds an optional field in the create target method to store the offlo…ad entry in a custom location and not register the entry in the entry manager. This change is required to enable JIT compilation in MLIR for OpenMP target offload ops, as arrays of entries are handled differently for standalone MLIR compilation.

github.com/llvm/llvm-project

[mlir][OpenMP] Add the tgt_entry_info attribute

llvm:main ← fabianmcg:omp-entry-attr

opened 02:24AM - 16 Jan 24 UTC

fabianmcg

+423 -36

This patch adds the `omp.tgt_entry_info` attribute. This attribute provides inf…ormation to identify offload entries uniquely and partially reflects the information in the `llvm::TargetRegionEntryInfo` struct. An `info` parameter was added to `TargetOp` to specify the offload entry information from the operation explicitly. Both the host and device versions must have the same `info` attribute; otherwise, the constructs won't correctly map between each other. This patch is required to enable JIT compilation for the OMP dialect, as the entry array has to be fully constructed in the IR instead of using sections to implicitly construct it. Note: Ignore the base commits.

Shoutout to @jhuber6 for the linker-wrapper work. As this proposal relies heavily on it.

CC’ing people that might be interested in the proposal:
@jdoerfert
@kiranchandramohan
@clementval
@jeanPerier
@banach-space
@ftynse
@mehdi_amini
@grypp

kiranchandramohan · January 16, 2024, 8:12pm

These are all important points. And I support the proposal in spirit.

I have not looked into the proposal in detail. How do you make the OpenMP runtime available? Could you expand the flow a bit more and detail whether the execution can be achieved starting from a single module that is then split into a host and device module, or whether two modules are required.

It would be good to discuss this proposal in detail with the engineers working exclusively on the OpenMP target offload side. @skatrak @DominikAdamski @agozillon @TIFitis @jansjodi @kparzysz.

fabianmc · January 16, 2024, 8:25pm

At this moment that’s the only hard dependency. One needs to pass the libomptarget*.bc as command line option or in the IR. But linking against libomptarget*.bc is handled by the GPU compilation infra.

It starts with the host module. Then the omp-target-outline-to-gpu creates the device module and adds it to the host module, which at that point the GPU dialect takes over to compile to binary.

The omp-target-outline-to-gpu goes through all the target ops and outlines them to the GPU module, it also outlines declare_target functions, making the device module self contained.

See this gist for an example of compilation from a single module:

gist.github.com

https://gist.github.com/fabianmcg/6f7eeb986c254ec98dcab45c05e077c4

omp.ll

// ./bin/mlir-opt test-outlined.mlir --nvvm-attach-target="chip=sm_70 l=<build_path>/projects/openmp/libomptarget/DeviceRTL/libomptarget-nvptx-sm_70.bc O=0" \
//    --convert-gpu-to-nvvm \
//    --gpu-module-to-binary="format=bin" \
//    | ./bin/mlir-translate --mlir-to-llvmir -o omp.ll

; ModuleID = 'LLVMDialectModule'
source_filename = "LLVMDialectModule"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

This file has been truncated. show original

out-exec.log

# ./bin/flang-new omp.ll -Llib -lomptarget -o test.exe
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PWD/lib CUDA_VISIBLE_DEVICES=0 LIBOMPTARGET_DEBUG=-1 nvprof ./test.exe
omptarget --> Init offload library!
OMPT --> Entering connectLibrary (libomp)
OMPT --> OMPT: Trying to load library libomp.so
OMPT --> OMPT: Trying to get address of connection routine ompt_libomp_connect
OMPT --> OMPT: Library connection handle = 0x7fc919813160
OMPT --> Exiting connectLibrary (libomp)
omptarget --> Loading RTLs...
omptarget --> Attempting to load library 'libomptarget.rtl.x86_64.so'...

This file has been truncated. show original

test-llvmir.mlir

// ./bin/flang-new --save-temps
module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<i16, dense<16> : vector<2xi64>>, #dlti.dl_entry<i8, dense<8> : vector<2xi64>>, #dlti.dl_entry<i32, dense<32> : vector<2xi64>>, #dlti.dl_entry<f80, dense<128> : vector<2xi64>>, #dlti.dl_entry<i1, dense<8> : vector<2xi64>>, #dlti.dl_entry<!llvm.ptr, dense<64> : vector<4xi64>>, #dlti.dl_entry<i64, dense<64> : vector<2xi64>>, #dlti.dl_entry<!llvm.ptr<272>, dense<64> : vector<4xi64>>, #dlti.dl_entry<i128, dense<128> : vector<2xi64>>, #dlti.dl_entry<f128, dense<128> : vector<2xi64>>, #dlti.dl_entry<!llvm.ptr<270>, dense<32> : vector<4xi64>>, #dlti.dl_entry<!llvm.ptr<271>, dense<32> : vector<4xi64>>, #dlti.dl_entry<f16, dense<16> : vector<2xi64>>, #dlti.dl_entry<f64, dense<64> : vector<2xi64>>, #dlti.dl_entry<"dlti.endianness", "little">, #dlti.dl_entry<"dlti.stack_alignment", 128 : i64>>, fir.defaultkind = "a1c4d8i4l4r4", fir.kindmap = "", llvm.data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-unknown-linux-gnu", omp.is_gpu = false, omp.is_target_device = false, omp.requires = #omp<clause_requires none>, omp.target = #omp.target<target_cpu = "x86-64", target_features = "">, omp.version = #omp.version<version = 11>} {
  llvm.func @_QQmain() attributes {fir.bindc_name = "main", frame_pointer = #llvm.framePointerKind<all>} {
    %0 = llvm.mlir.constant(4 : index) : i64
    %1 = llvm.mlir.constant(9 : i32) : i32
    %2 = llvm.mlir.constant(6 : i32) : i32
    %3 = llvm.mlir.constant(1 : i32) : i32
    %4 = llvm.mlir.constant(0 : i32) : i32
    %5 = llvm.mlir.constant(1 : i64) : i64
    %6 = llvm.alloca %5 x i32 {bindc_name = "x"} : (i64) -> !llvm.ptr

This file has been truncated. show original

There are more than three files. show original

However, it should be possible to use all of this work to combine two separate modules without using omp-target-outline-to-gpu.
For example the following is an example of that:

module attributes {omp.is_target_device = false, omp.is_gpu = false} {
  gpu.module @module <#gpu.offload_object<"omp">>
      attributes {omp.is_target_device = true, omp.is_gpu = true} {
    llvm.func @main() {
      omp.target info = #omp.tgt_entry_info<deviceID = 0, fileID = 0, line = 0> {
        omp.parallel {
          %0 = gpu.thread_id x
          %csti8 = arith.constant 2 : i8
          %cstf32 = arith.constant 3.0 : f32
          gpu.printf "Hello from %lld, %d, %f\n" %0, %csti8, %cstf32  : index, i8, f32
          omp.terminator
        }
        omp.terminator
      }
      llvm.return
    }
  }
  llvm.func @main() {
    omp.target info = #omp.tgt_entry_info<deviceID = 0, fileID = 0, line = 0, offloadModule = @module> {
      omp.terminator
    }
    llvm.return
  }
}

DominikAdamski · January 17, 2024, 10:37am

Hi,
thanks for your proposal. I have couple of questions:

OpenMP directives can be applied to CPU and GPU. How would you like to lower CPU-related clauses?
How do you want to support compilation of omp target if user specifies both AMD and Nvidia GPU as target GPUs for OpenMP target?
What changes are required for Flang driver to enable MLIR GPU codegen?
How do you want to express OpenMP implicit logic? Currently we use OMPIRBuilder which generates (roughly) the same *.ll code as Clang does. Generated LLVM IR contains calls to OpenMP runtime which reflect implicit synchronization barriers, or worksharing loops. The LLVM IR code can be heavily optimized by OpenMP-aware LLVM IR optimization pass and it’s common for Clang and Flang.

Could you provide description of tests which are missed? We have end-to-end GPU tests for Fortran: example . We verify MLIR lowering to LLVM IR for GPU as well.

Could you describe what do you mean? Which components are missed?

rengolin · January 17, 2024, 11:34am

I have no strong opinion on this proposal, but just want to make sure we set expectations straight…

This is similar for LLVM and it’s not a problem per-se. I’m not saying it’s ideal, but I want to make sure we don’t create silo’d technology in MLIR just because of this non-problem.

Basically:

If a front-end needs to check the IR they emit, they can do so on their own tests.
If a front-end depends on LLVM/MLIR, they can create LLVM IR/MLIR tests.
If an LLVM IR functionality needs to be created by a front-end, we use the front-end to create IR and use that as a test (this is not perfect, I know).
If an MLIR dialect is created by some front-end, the test need to be in the front-end.
If an MLIR dialect has local passes, we can repeat the pattern and create the input IR from the front-end and have tests in MLIR.
If we want to create tests that span across multiple projects, we need to discuss this on a wider audience to avoid creating local solutions that are incompatible (silo’d) from the rest of the project.

(3) and (5) above are not perfect solutions, but they’re the current solution. If we want a new one we need to have a wider discussion (6).

If you have another reason to your proposal, please make it clear in the RFC above. If this is just for the sake of testing, then this needs to be a wider discussion.

fabianmc · January 17, 2024, 12:07pm

Preface: this proposal it’s not about lowering OMP operations to the GPU dialect, instead it’s about using the GPU dialect compilation infrastructure to get to executable using only MLIR, the OpenMPIRBuilder, libomptarget*.bc and the existing OpenMP runtime.

The proposal doesn’t add lowerings, everything (translation-related) it’s still managed by the OpenMPIRBuilder and OMP translation to LLVM.

GPU infrastructure already has some support for multiple targets, for example, in GPU we have full support for binaries with many targets:

gpu.binary @omp_binary [
  #gpu.object<#rocdl.target<chip="gfx90a">, bin = "BINARY">,
  #gpu.object<#nvvm.target<chip="sm_90">, bin = "BINARY">,
  #gpu.object<#nvvm.target<chip="sm_70">, bin = "BINARY">
]

However, the pass converting the device module to a binary (gpu-module-to-binary ) does have some restrictions as it uses the same device module to get to binary (some of those restrictions can be lifted). For example, the following will fail because the module contains intrinsics for different targets:

gpu.module @module [#nvvm.target, #rocdl.target] {
  llvm.func @func() {
    nvvm.barrier0
    rocdl.barrier
  }
}

However, AFAIK all target specific intrinsics in OpenMP are introduced by libomptarget*.bc, and if that’s the case then multiple targets are fully supported, because one can specify target specific linking libraries. For example, lets say nvptx.bc and amdgpu.bc have a function called barrier calling the correct intrinsic for each target, then the following IR can be compiled for multiple targets using gpu-module-to-binary:

gpu.module @module [#nvvm.target<libs = ["nvptx.bc"]>, #rocdl.target<libs = ["amdgpu.bc"]>] {
  llvm.func @barrier() ()->()
  llvm.func @func() {
    @llvm.call @barrier() ()->()
  }
}

If flang were to use this path then the biggest change is there wouldn’t be invocations of the compiler for each target, instead the pass manager would need to be adapted to run all necessary passes in a single invocation, and that shouldn’t be that hard. All device specific passes would have to be explicitly scheduled to run on the GPU module op.

All of that it’s still managed by the OpenMPIRBuilder, this proposal is only to get to executable.

There are no integration tests in MLIR llvm-project/mlir/test/Integration/Dialect at main · llvm/llvm-project · GitHub , as it stands right now flang is the source of truth for testing the OMP dialect for offload targets.

What I mean is that IMHO It should be possible to get to executable without requiring the use of flang or clang, and that in consequence the mlir-cpu-runner should be capable of JIT-ing MLIR OMP offload code.

fabianmc · January 17, 2024, 12:34pm

My bad, the title should’ve been clearer, this proposal is only for compilation, there’s no new set of lowerings or paths that are unique to MLIR.

So IMO It’s not silo’d technology, it’d be adding support for something that can be supported in MLIR with some lines of code.
All the heavy lifting of OpenMP code generation is still performed by the OpenMPIRBuilder, the proposal only adds compilation support.

That’s why I included CC’ed people from OpenMP, Flang and MLIR.

If execution of the dialect is fully supported within MLIR, it means that dowstream users can also use LLVM OpenMP in their projects without flang or clang.

It also means that potentially the OMP dialect and OpenMPIRBuilder development could go faster, because there’s no dependency with the flang frontend. For example, the teams clause is not supported in device compilation in flang, the question at that point becomes, is it because the front-end doesn’t support it or something else. When I tried to use the teams clause with this path, it turns out that the OpenMPIRBuilder is not emitting the correct code for the device side.

krzysz00 · January 30, 2024, 3:29pm

I don’t know a lot about OpenMP, but this seems like a reasonable use of the target/“call a separate LLVM to compile an offload” target

fabianmc · February 5, 2024, 2:17pm

I should’ve said this in the RFC more explicitly, but, most of the changes are not related to OpenMP or are happening outside OpenMP. The biggest change is the introduction of the OffloadEmbeddingAttr (see GH PR #78117).

The OffloadEmbeddingAttr allows the usage of the CUDA, HIP and LibOMPTarget runtime for calling kernels. Instead of loading a module and getting a kernel ptr every time there’s a kernel call, this attribute registers kernels at program startup time with the runtime, and then uses traditional runtime functions like cudaLaunchKernel or hipLaunchKernel to launch the kernel.

I believe OffloadEmbeddingAttr could be of interest for MLIR outside of being the cornerstone for providing OpenMP support in MLIR.

krzysz00 · February 15, 2024, 4:52pm

I’ll note I’m overall without objection to the new offload embedding attr so long as I can get a standalone binary that works with hipModuleLoad() and company, because that’s what my library needs to generate.

Topic		Replies	Views
MLIR omp.target for gpu offloading MLIR	23	754	November 8, 2023
[RFC] omp.module and omp.function vs dialect attributes to encode openmp properties MLIR	37	1150	March 1, 2023
About OpenMP dialect in MLIR LLVM Dev List Archives	23	232	February 19, 2020
Is it possible to run part of code to NVIDIA gpu and part to AMD gpu? Community gpu , llvm	19	963	July 14, 2023
[RFC] Prevent optimization/analysis across omp.target region boundaries MLIR	42	681	June 27, 2023

TLDR;

Why?

Proposal:

Small example:

Example from flang:

Github PR:

Related topics