MLIR for gpu offloading


I would like to use the construct in MLIR in order to offload computation to the GPU. Please could you help me with the steps required to generate the final executable for a very simple example provided in llvm-project/mlir/test/Target/LLVMIR/omptarget-region-llvm.mlir?
Thank you!

@jansjodi @DominikAdamski Could you help here?

you should be able to convert mlir code to LLVM IR by command:
mlir-translate -mlir-to-llvmir file.mlir and then you should be able to generate binary code by standard LLVM tools.

The output of the mlir-translate tool for the operation depends on omp attributes, because OpenMP target operation is translated into GPU kernel and fallback code. The mlir code in the test file is converted into host code (attribute omp.is_target_device is set to false).

If you need to generate simple mlir code for the GPU, please write simple Fortran code with omp target pragma, then compile it with flang-new -save-temps -v and openmp related flags. The filename-openmp-your-gpu-arch-llvm.mlir file should contain code which will be launched as GPU kernel.

thank you for your response. What happens if omp.is_target_device is true?
It’s still a little difficult for me to understand since I am new to mlir.
Would it be possible to compile a binary from an mlir file that contains that offloads to the GPU without using a frontend like flang? is a MLIR operation which is used to model OpenMP target Construct . Please remember, that we model the OpenMP operation, not the GPU kernel. We need to follow OpenMP standard which defines execution model of target construct, If you are interested only in GPU kernels I think you should consider 'gpu' Dialect - MLIR

We use omp.attributes to pass information which is used for lowering to LLVM IR. We need to pass information about: target device, host file, etc. That’s why I encouraged you to generate MLIR file by flang-new because it will generate a set of attributes which are required for lowering to the LLVM IR.

Once you have a valid MLIR file then you can use mlir-translate tool to generate LLVM IR and then you can use LLVM tools to convert LLVM-IR to the binary code.

Ok, thank you. I will try to generate MLIR files using flang-new. I am particularly interested in the OpenMP construct because I wanted to actually develop lowering for another hardware and wanted to first see what the flow for the GPU looks like. We do not have a CUDA like kernels and keep the programming construct generic.

If I execute the following, I get an error for -save-temps:

./flang-new  -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target -march=gfx1010 -save-temps -v test.f90
error: unknown integrated tool '-cc1as'. Valid tools include '-fc1'

The last command executed is:

flang-new -cc1as -triple x86_64-unknown-linux-gnu -filetype obj -main-file-name test.f90 -target-cpu x86-64 -fdebug-compilation-dir=/home/darby/llvm-project/flang/build/bin -dwarf-version=5 -mrelocation-model pic -mrelax-all -o test-host-x86_64-unknown-linux-gnu.o test-host-x86_64-unknown-linux-gnu.s

You can skip that issue, because you generated two MLIR files before an error occured. The file for the GPU: test-openmp-amdgcn-amd-amdhsa-llvmir.mlir and for the host: test-host-x86_64-unknown-linux-gnu-llvmir.mlir . You can compare them, you can see the differences and you are ready for your own experiments.

Hi, I see the attributes in the mlir files.
I use the following module attributes in my mlir file:
module_attributes {"llvm.data_layout = "", llvm.target_triple = "x86_64-unknown-linux-gnu", omp.is_device = false, =<target_cpu = "x86_64", target_features = "">"}

But when I use mlir-translate with --mlir-to-llvmir, the offloading entries are not generated. Please could you tell me how to create the offloading entries structure in mlir? Thank you

module_attributes {"llvm.data_layout = "", llvm.target_triple = "x86_64-unknown-linux-gnu", omp.is_device = false, =<target_cpu = "x86_64", target_features = "">"} these attributes are for host (x86) side generation. Host side is responsible for the kernel call. If you want to see the GPU related LLVM IR code please use the command:
mlir-translate --mlir-to-llvmir test-openmp-amdgcn-amd-amdhsa-llvmir.mlir

Yes, they are for the host. But aren’t the offloading entries are generated for the host not for the device? But they are not being generated for the host.

Could you share your mlir file?
For the host *.ll file (generated from test-host-x86_64-unknown-linux-gnu-llvmir.mlir ) you should see:

  1. Definition of the function which contain omp target pragma
  2. Definition of the target function which is executed if given offload device is not present
  3. Module target triple points to x86 target triple.

For the device *.ll file (generated from: test-openmp-amdgcn-amd-amdhsa-llvmir.mlir ) you should see:

  1. Definition of kernel function (something like that: define weak_odr protected amdgpu_kernel void )
  2. The module target triple should be set to GPU triple for example: target triple = “amdgcn-amd-amdhsa”

ok, I attach two mlir files here, one for the host and one for the device. For both I don’t see the offload entries in the *.ll file generated using mlir-translate. Ideally one should have the offloading entry in the .ll file.

%struct.__tgt_offload_entry = type { ptr, ptr, i64, i32, i32 }
! = !{!2}

omphostllvm.txt (2.7 KB)
omptargetllvm.txt (2.8 KB)

could you try by swapping the omp.is_device to omp.is_target_device in the test files you’ve presented in the last post, I believe we changed it recently to this (there is also an omp.is_gpu, but that’s not necessary in this example i think) that might give you the correct results. I’ve only tested this on a downstream version currently unfortunately, but it yields the correct results by emitting the offload info, so hopefully enough is upstreamed for that to work.

You may also encounter an error when lowering the device side of your code currently related to this: Request for input on a fix for a bug utilizing omp.TargetOp in conjunction with mlir-translate's -split-input-file - #5 by agozillon if you started to use the split input command, you can work around it I believe by adding location information to the target op, via a mlir loc attribute. But if you don’t use the split input command you should be fine.

I tried to swap to omp.is_target_device. Unfortunately it doesn’t work for me. I use upstream llvm at commit 7cc57c07e36fc6b4d176cebb28a9bbe637772175 in the master branch.

Thank you for actively trying to help me out.

I think there is a possibility that the commit is too far behind (May-ish? if I read it correctly) to have the changes, a lot of the work on OpenMP offloading takes a while to filter into upstream and actually being able to generate a kernel that offloads from a TargetOp is a fairly recent addition and still a WIP (e.g. I am working on map support, and i think the initial device side generation support landed in the last month or so, although others can perhaps refute that claim).

That’s no problem at all, I am building an upstream llvm-project to see if the offload info is emitted in the most recent commit, and then I’ll regress it to the commit you referenced and see if it does the same.

Yes, that commit is unfortunately too far back to get the desired output sadly. The upstream top-of-the-tree version does seem to emit most of the offload info, but it is missing at least the __tgt_kernel_arguments that the downstream version I have contains, but again it’s a WIP!

Hopefully the original flang command mentioned in this thread would also work on top-of-the-tree as well, not tested it though, but it’s very likely.

ok, so what would I need to get the offloading to work without the flang driver? Because I only want to start from MLIR files and not from fortran source code.
If I update to the latest commit would things work? I am not sure what __tgt_kernel_arguments is about.
Thank you!

Updating to top-of-the-tree will get you the desired offload information I believe without having to go through the flang driver, just via the mlir-translate command specified above.

I believe actually offloading will require this patch: ⚙ D155633 [OpenMP][OpenMPIRBuilder] Add kernel launch codegen to emitTargetCall to land however. But you can always apply it yourself to your branch. Your mileage will still vary though as I think the branch would then be at a state where it will launch the kernel and allow you to write to/from a scalar but not much more (as that’s the test case we’ve been focused on for the first offloading step, but other things could possibly work, but anything complex is still a while away). @jansjodi is very likely the best person to answer what the upstream kernel launch can currently do, so can correct me if I’ve said anything wrong.

Once that patch (⚙ D155633 [OpenMP][OpenMPIRBuilder] Add kernel launch codegen to emitTargetCall) lands, it will be good to write up an example in the docs page of both Flang (using the flang-driver) and MLIR (using mlir tools). And also add an integration test if possible.

1 Like