OpenMP target regions and intrinsic Fortran math functions

Hi,

I am bringing this up because it seems that OpenMP development is really ramping up now and I want to bring up a common use case that does not seem to be supported with other vendor compilers.

It is a common use case to call the Fortran intrinsic Math functions in an openmp target region. I am not sure how this in implemented in the Fortran + OpenMP compilers for the vendors who do support it. I suspect this is done by inlining, but it appears that other vendors have their Fortran math functions in a backend runtime library which somehow prevents these functions from being called in an OpenMP target region.

Hi,

Kiran Chandramohan is working on OpenMP lowering and might have a plan here. Math intrinsics are currently lowered in mlir to a mix of inlined code, calls to llvm intrinsics, and calls to runtime when the first two options are not possible. Attributes are added to these runtime calls so that they can easily be identified as intrinsic calls and later rewrote if needed. These attributes could be adapted based on what the team working on OpenMP lowering needs here.

Jean

Hi Nick, Jean,

Thanks for bringing this topic up. I must confess that I am not an expert in target and device handling in OpenMP and we have not yet finalized the approach for handling target regions.

But here is what I can share, the GPU folks from Nvidia/AMD and Johannes (who implemented this in Clang) can correct me here.

Vendors provide device libraries (https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_sin.html) with math function support. The compiler can/should convert calls to the math library functions in a target region with calls to the device library functions. OpenMP provides the declare variant directive with which specialized variants of functions and the context in which these functions should be used can be specified. This mechanism can be used in a header file and each vendor can declare variants (with calls to their device library) for each math function. If the frontend supports OpenMP declare variant handling then the calls to math library functions are automatically converted to calls to device library functions.

For e.g: Clang has the following,

  1. clang/lib/Headers/openmp_wrappers/math.h
    #pragma omp begin declare variant match(
    device = {arch(nvptx, nvptx64)}, implementation = {extension(match_any)})

#define CUDA
#define OPENMP_NVPTX
#include <__clang_cuda_math.h>
#undef OPENMP_NVPTX
#undef CUDA

#pragma omp end declare variant

  1. clang/lib/Headers/__clang_cuda_math.h
    DEVICE double sin(double __a) { return __nv_sin(__a); }

Thanks,
–Kiran

Hi Nick, Jean,

Thanks for bringing this topic up. I must confess that I am not an expert in target and device handling in OpenMP and we have not yet finalized the approach for handling target regions.

But here is what I can share, the GPU folks from Nvidia/AMD and Johannes (who implemented this in Clang) can correct me here.

Vendors provide device libraries (https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_sin.html) with math function support. The compiler can/should convert calls to the math library functions in a target region with calls to the device library functions. OpenMP provides the declare variant directive with which specialized variants of functions and the context in which these functions should be used can be specified. This mechanism can be used in a header file and each vendor can declare variants (with calls to their device library) for each math function. If the frontend supports OpenMP declare variant handling then the calls to math library functions are automatically converted to calls to device library functions.

For e.g: Clang has the following,

1) clang/lib/Headers/openmp_wrappers/math.h
#pragma omp begin declare variant match( \
     device = {arch(nvptx, nvptx64)}, implementation = {extension(match_any)})

#define __CUDA__
#define __OPENMP_NVPTX__
#include <__clang_cuda_math.h>
#undef __OPENMP_NVPTX__
#undef __CUDA__

#pragma omp end declare variant

2) clang/lib/Headers/__clang_cuda_math.h
__DEVICE__ double sin(double __a) { return __nv_sin(__a); }

First, I can confirm this.

Second, I am unsure if this is "necessary" for Fortran as well.

One of the tricky parts for C/C++ is that we want to get the system math
headers and the device math headers as both might provide unique,
non-standard functions.

From the LLVM perspective, not all math functions have generic LLVM
intrinsics and you cannot always use the generic LLVM intrinsics as they
have requirements wrt. the floating point environment. There are usually
some target intrinsics, e.g., `@llvm.nvvm.fabs.f`, but I'm unsure if
everything has an `nvvm` intrinsic or if certain things remain function
calls. At the end, it hardly matters anyway.

If there is any question on how Clang handles this or just generic
things about this, please let me know :slight_smile:

Thanks,

Johannes

Hi,

Another possibility (like with OpenACC dialect) is to use the GPU dialect in mlir to model target regions.

The gpu dialect can be converted to other vendor dialects like nvvm, rocdl. During these conversions, calls to math library functions are converted to calls to device library functions

For e.g: the following call to the cos math function is converted to either of the calls to __nv_exp or __ocml_cos_f64 below depending on the conversion chosen.
%result64 = std.cos %arg_f64 : f64

%1 = llvm.call @__nv_cos(%arg1) : (!llvm.double) → !llvm.double
%1 = llvm.call @__ocml_cos_f64(%arg1) : (!llvm.double) → !llvm.double

https://github.com/llvm/llvm-project/blob/647e9a54c758a6fdd85a569f019f00a653b2bc40/mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir#L183

https://github.com/llvm/llvm-project/blob/73c12bd8ff1a9cd8375a357ea06f171e127ec1b8/mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl.mlir#L125

–Kiran

Hi,

Another possibility (like with OpenACC dialect) is to use the GPU dialect in mlir to model target regions.

The gpu dialect can be converted to other vendor dialects like nvvm, rocdl. During these conversions, calls to math library functions are converted to calls to device library functions

For e.g: the following call to the cos math function is converted to either of the calls to __nv_exp or __ocml_cos_f64 below depending on the conversion chosen.
     %result64 = std.cos %arg_f64 : f64

       %1 = llvm.call @__nv_cos(%arg1) : (!llvm.double) -> !llvm.double
       %1 = llvm.call @__ocml_cos_f64(%arg1) : (!llvm.double) -> !llvm.double

https://github.com/llvm/llvm-project/blob/647e9a54c758a6fdd85a569f019f00a653b2bc40/mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir#L183
https://github.com/llvm/llvm-project/blob/73c12bd8ff1a9cd8375a357ea06f171e127ec1b8/mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl.mlir#L125

Cool. Where are those math functions and their conversion defined? I
grepped the mlir code but didn't (immediately) find them.

The conversions are in the following locations. Only 8 conversions of functions are there now.

→ mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
patterns.insert<OpToFuncCallLowering>(converter, “__nv_cosf”, “__nv_cos”);
→ mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp

patterns.insert<OpToFuncCallLowering>(converter, “__ocml_cos_f32”, “__ocml_cos_f64”);

–Kiran

Kiran et al.,

My apologies for the delayed reply. I went on vacation right before I sent this message and then I was very focused on a meeting that I was organizing/co-hosting last week.

What I have seen with other vendor compilers (PGI + OpenACC, Cray + OpenMP target, Intel + OpenMP target), is that the omp declare variant is NOT needed for the intrinsic Fortran math library functions (e.g. cos, sin, etc.). I think this implies that these intrinsic math library functions get inline into the target region but I am not entirely sure.

Kiran et al.,

My apologies for the delayed reply. I went on vacation right before I sent this message and then I was very focused on a meeting that I was organizing/co-hosting last week.

What I have seen with other vendor compilers (PGI + OpenACC, Cray + OpenMP target, Intel + OpenMP target), is that the `omp declare variant` is NOT needed for the intrinsic Fortran math library functions (e.g. cos, sin, etc.). I think this implies that these intrinsic math library functions get inline into the target region but I am not entirely sure.

FWIW, while `target` + `math` in Clang is entirely based on `omp begin
declare variant` it does *not* imply anything about the ability to
inline (or optimize).
We inline/optimize as good as Clang does for math calls in CUDA code.

~ Johannes