[RFC] Building Flang runtime for offload devices

Hello,

I would like to discuss the enabling of Flang runtime library for the offload devices. OpenACC/OpenMP compiler support is being improved every day, and the dependency of the Flang generated code on the Flang runtime might become a blocking factor any time soon. I would like to propose initial patch [1] for building Flang runtime for CUDA devices. And I want to make sure that the changes required to do this will also work, in general, for different offload devices that the LLVM/Flang community members care about. I would also like to collect a list of people from different parties who will want to actively participate in making Flang runtime offload friendly.

Goals of this project:

  • Come up with the guidelines for adjusting/disabling some functionalities that cannot be efficiently supported on all or some offload devices.
  • Be able to incrementally enable parts of Flang runtime for the offload devices.
  • Establish a process of continuous testing of the offload builds of Flang runtime, and a process of dealing with the breakages.

The initial patch provides a way to build a small part of Flang runtime, using Clang compiler with NVPTX backend, as a CUDA static fat library. There are other distribution models for the offload libraries (e.g. PTX library for CUDA devices, or LLVM Bitcode library such as the ones used for LLVM Libomptarget/DeviceRTL, etc.), but it is not a goal of this project to define the distribution model of the Flang runtime for CUDA devices. My assumption is that solving the task of building Flang runtime one way will allow us to build it all the other ways with some minor modifications of the CMake files.

I chose Clang-NVPTX as a build compiler just because it is available to all the community. To be fully transparent, I will also want to make sure that Flang runtime builds with NVCC as well, but the changes (if any) required to do this will need to be justified and have community approval. At this stage, building with Clang-NVPTX is a good path forward.

As you may see in the patch, the major change in the library source code is applying CUDA __host__ __device__ attributes to all declarations and definitions of the Flang runtime entry points. The addition of the attributes to the entry points requires adding the attributes to the whole call-graph closure. The attributes are added by the means of new RTDECL and HOST_DEVICE_ATTRS macros that are to be used only for declarations and definitions of the functions. The existing RTNAME macro can still be used to get the mangled name of a runtime entry point. I suppose one may decide to build Flang runtime as a C++ library with the offload directives defining the necessary entry points as the “device” functions (e.g. with OpenMP declare target/end declare target annotating particular APIs), and the current usage of RTDECL/HOST_DEVICE_ATTRS macros at the declaration/definition points may not allow to do this cleanly. So I am open to other alternatives.

Next, there are some functionalities like std::fill_n and others that are not supported by CUDA toolchains currently. In the initial patch I showed that the internal implementation is one option for getting rid of such dependencies. This approach may be taken alone or be accompanied with a feature request for the device toolchains to support std::fill_n, std::unique_ptr, etc. I used CUDA_DEVICE to make device-alternative implementation of std::fill_n. I wonder if a more generic macro like IS_OFFLOAD, IS_DEVICE, etc. looks more appropriate to people. Another alternative is to have a macro like SUPPORTS_STD_FILL_N that can be defined in flang/include/flang/Runtime/device.h based on the device toolchain. I guess this can be changed later when some commonalities are found for different offload devices, but I am open to ideas right now.

As you may expect, there are many unresolved references in the resulting CUDA library. For example, the methods marked with HOST_DEVICE_ATTRS in the header files do not have corresponding implementations compiled for the device. I am using -DCMAKE_CUDA_SEPARABLE_COMPILATION=ON to allow for unresolved references, and this allows me to incrementally enable parts of Flang runtime in the CUDA build (see also supported_files list in flang/runtime/CMakeLists.txt). So at this point I cannot say that the build produces a functional library, but this approach allows splitting the changes into observable pieces. Eventually, I will need to come up with a scripted check for unresolved references to guarantee that the library is linkable in all cases with the CUDA tools.

Regarding the last goal listed above, I would like to have buildbots running CUDA and other offload build of Flang runtime. I would also like to have an option to trigger Phabricator pre-commit testing for any Flang runtime changes. I have no experience here, so any advices/links will be appreciated.

For the time being, I will be resolving all the issues with the CUDA build introduced with new changes in Flang runtime. But once there is a buildbot, e.g. testing the build with Clang-NVPTX, would it be reasonable to make fixing the breakages the responsibility of the commit author? Is there a way to make this requirement “official” (e.g. like for OpenMP offload breakages)?

Thank you very much for reading through it up to here! I am looking forward to your comments!

Thank you,
Slava

[1] âš™ D151173 [RFC][flang] Experimental CUDA build of Flang runtime.

2 Likes

This is very welcome. I hope @jansjodi @skatrak @agozillon @DominikAdamski might be interested since they are all working on OpenMP offloading.

More generic macro might be good here. Alternatively, you can leave it for folks who want to support other devices/technology to generalize this.

Is there a generic option similar to this in CMake?

For buildbots, we will have to make patches for the llvm-zorg repo. The following patch might be a reference. âš™ D87085 Add flang out of tree buildbot
For pre-commit, I think you can make a request here.

It makes sense to have this as official policy once a buildbot is setup. There should be information provided to the developer (from the buildbot) on how to go about fixing the issue. This can be a reference to Flang document (in flang/docs) detailing common issues and how to fix.

I think that we should rely on libc for GPUs project ( Using libc for GPUs — The LLVM C Library ) for building Flang runtime for offload devices. Pros of libc over custom Flang runtime:

  1. Already upstreamed and part of LLVM project.
  2. Larger community (some issues can be solved by C/C++ folks)
  3. Close integration with Clang. We can reuse some part of existing infrastructure.
  4. Optimization for Clang and C code can be applied for Fortran code as well.
  5. Vendor agnostic.

Unfortunately the Using libcxx for GPUs page is missing.

The easiest way to build a library for this using Clang/LLVM would be to use the “new” CUDA driver. You could build the whole library into a single fatbinary like the following. This would require a small tweak to work however, this would only link w/ CUDA in its current from but we can make it generic.

clang++ -x cuda runtime.cpp --offload-arch=sm_60,sm_70,sm_80 -fvisibility=hidden --offload-new-driver -foffload-lto -c
llvm-ar crt libfortran.a runtime.o

Personally I like performing direct compilation on freestanding C++, that’s what I’ve been doing to port LLVM’s C library to the GPU.

This is still a work-in-progress, but the goal is to provide all the basic system functionality that libc and libm provide. If there are any parts specific to Fortran we’d probably need to but that elsewhere. Compilation is done directly via clang --target=nvptx64-nvidia-cuda or similar. We use the same format used in Offloading Design & Internals — Clang 18.0.0git documentation to allow us to use things like LTO and static linking. As far as I understand, flang is using the same pipeline as clang for OpenMP compilation.

That’s another long-term goal. I’m hoping to apply the same approach to implement the C++ library at some point. I don’t

This feels like a step backwards. The Device runtime for OpenMP was originally written in cuda and then AMD reimplemented that in HIP. Then a cleaned up merger with shared architecture independent and differing architecture dependent files. Then a complete revamp was done using pure openmp common code with function variants. This is the current state in openmp/libomptarget/DeviceRTL. OpenMP function variants are the way to go for portability and maintenance .

2nd. I realize that packaging is not part of the goal, but we are in the third iteration of heterogeneous objects in LLVM. The first did not assume host object format and had some tag in the header. The 2nd uses host object format with embedded binary strings. Both 1 and 2 used bundle/unbundle. The 3rd uses a new bundling tool called clang-offload-package to package and unpackage fat binaries. I encourage convergence on this final model for managing heterogeneous objects and archives.

3 Likes

The device code representation can be done by using OpenMP, similar as we already did in the OpenMP device runtime. Target dependent code can be applied by using declare variant. Using macros assume all programming models look similar/same at high level, which is probably only true for CUDA and HIP.

1 Like

I would like to thank you all for the comments! It is super helpful to hear all the available options.

@kiranchandramohan, thank you very much for the links. They are very useful! I do not think CMAKE_CUDA_SEPARABLE_COMPILATION has a generic counterpart or that similar options are available for other programming models. I found this open request for HIP: HIP: Provide `HIP_SEPARABLE_COMPILATION` analogue to `CUDA_SEPARABLE_COMPILATION` (#23210) · Issues · CMake / CMake · GitLab

@jhuber6, if I understand it correctly the “new” CUDA driver clang++ is able to compile CUDA and package it in a way compatible with Clang OpenMP offload linking; the offload-lto is useful for performance, but it is optional. Thank you for the command example and the links! I think CMake is currently doing “approximately” the same with my patch, e.g. the compilation/ar commands are:

clang --cuda-gpu-arch=sm_80 --cuda-path=/usr/local/cuda -x cuda -fgpu-rdc -c transformational.cpp -o CMakeFiles/obj.FortranRuntime.dir/transformational.cpp.o
llvm-ar qc libFortranRuntime.a CMakeFiles/obj.FortranRuntime.dir/transformational.cpp.o
llvm-ranlib libFortranRuntime.a

I suppose this could invoke the new driver with an additional --offload-new-driver option, but I am not sure if --cuda-gpu-arch=sm_80 populated from -DCMAKE_CUDA_ARCHITECTURES=80 list behaves the same way as --offload-arch=sm_80. I appreciate the CMake support that does all the plumbing with the minimal changes I made. If it is possible to invoke the “new” CUDA driver properly without reworking the CMake files too much that would be great.

I would like to keep the flexibility of building Flang runtime with different compilers (g++, clang++, nvcc, nvc++), so I would like to keep the CMake code as generic as possible and let CMake figure out the details. It would be great to make Flang runtime build less clang/llvm-centric than libomptarget/DeviceRTL. Note that we have to also be able build it for different hosts, where clang support might be incomplete.

@greg63706, thank you for the great overview of the history of libomptarget/DeviceRTL development!
I see some benefits of using OpenMP offload for compiling the device library. For example, the automatic declare target marking can help reduce the amount of explicit declare target directives needed to get the full library closure (e.g. the std::fill_n implementation may just be supported, if there is a header implementation without libcxx dependency). Also, I agree that declare variant gives a nice way to provide target/vendor specific versions of function implementations. In addition, building such code with Clang results in the library packaging compatible with the standard Clang OpenMP compilation pipeline.

At the same time, this approach seems to phase out all the other compilers that might be used to build Flang runtime, e.g. g++ does not support declare variant completely, nvc++ - same. If one needs to build Flang runtime for a host with a non-clang compiler, they will have to use a compiler with the appropriate declare target and declare variant support or do some split compilation, e.g. use g++ for host and clang for device compilation, and then probably repackage the two pieces somehow. Moreover, g++ will choke on the declare variant constructs unless the related parts are commented out somehow.

I wonder if we can come up with some combined/tradeoff approach that will allow building Flang runtime with different compilers and will allow building it as CUDA, HIP, OpenMP/OpenACC offload, etc.

@shiltian, I agree that declare variant is one of the options that might be supported. I do not expect at this moment that Flang runtime will need to introduce taget specific variants like libomptarget/DeviceRTL does, though. I think the “unsupported” functionalities in the context of Flang runtime are more vendor/toolchain specific than target specific, e.g. some data types might be supported by nvc++ but not supported by clang and vice versa. So using implementation = {vendor(...)} device = {kind(cpu/gpu)} traits seems reasonable (although, with the above note that g++ host compilation will break on these directives).

To summarize, I think we can support CUDA and HIP builds of Flang runtime using the proposed macros. And we can also use declare target and maybe some limited forms of declare variant with proper guards to support the OpenMP offload builds and builds with any host compiler.

I wonder if such a combined solution sounds acceptable to everyone.

--cuda-gpu-arch is an alias to --offload-arch. clang does not actually provide separable compilation without the --offload-new-driver flag, so the CMake flag is not applicable here. The “new” driver is definitely not portable w.r.t. other build systems. The idea with the new driver is to make the link steps look identical to regular compilation and thus special CMake handling is not required, although you do need to enable -fgpu-rdc. I would also highly recommend LTO if this is supposed to be a library.

I’m somewhat confused here, this is the LLVM/Flang source code right? We should be able to reliably build it with clang in-tree. If needed we can then provide the LLVM-IR as a generic format like Nvidia does with libdevice.bc. Variants should not be a limiting factor, if support is the problem we can trivially replace variants on this level with the following.

#if defined(__NVPTX__)
#else defined(__AMDGPU__)
#endif

But I don’t have a whole view of the project, or even what’s required from this runtime library. This also brings up questions on how we expect to link these. The formats used by each compiler are wildly different, so unless we’re providing plain LLVM-IR it’s unlikely that will be portable.

If you want another portable language you can try freestanding C++, that’s what I’ve been using for libc for GPUs — The LLVM C Library, e.g.

clang++ --target=nvptx64-nvidia-cuda -march=sm_80 foo.cpp
clang++ --target=amdgcn-amd-amdhsa -mcpu=gfx1030 foo.cpp

But that’s just a passing suggestion, sticking with offloading languages is more likely to be tried and tested.

In understand that you want to compile Flang with the NVIDIA toolchain. I am wondering on which platforms you have only a gcc and cannot bootstrap a fresh Clang in open-source?

I am starting to see the issue. If you want a Flang with CUDA Fortran, then the runtime probably has to be build as CUDA and not in OpenMP target/device mode?

In understand that you want to compile Flang with the NVIDIA toolchain. I am wondering on which platforms you have only a gcc and cannot bootstrap a fresh Clang in open-source?

@tschuett, I do not have an example of such a platform right now. I have seen some feature completeness issues in clang comparing to gcc, e.g. float128 support in clang has been behind gcc for some time. I suppose there may be more examples like this, but there might also be examples where clang is ahead of gcc regarding some features. Relying on just one compiler that is able to build Flang runtime seems too restrictive to me.

I am starting to see the issue. If you want a Flang with CUDA Fortran, then the runtime probably has to be build as CUDA and not in OpenMP target/device mode?

I think OpenMP target offload may also work, but I really wanted to find a way to build Flang runtime without using an intermediate OpenMP compiler (e.g. using CUDA/HIP compilers for direct device compilation).

I’m somewhat confused here, this is the LLVM/Flang source code right? We should be able to reliably build it with clang in-tree. If needed we can then provide the LLVM-IR as a generic format like Nvidia does with libdevice.bc.

This also brings up questions on how we expect to link these. The formats used by each compiler are wildly different, so unless we’re providing plain LLVM-IR it’s unlikely that will be portable.

@jhuber6, these are all great questions! I think we still need to be able to build Flang runtime for host with gcc, as a general requirement for LLVM sources. I guess anything beyond the host might be only buildable by clang, but I am not sure what LLVM requirements are in this regard.

My plan with the CUDA build was to provide a way to build Flang runtime with clang or nvcc and get the device library either in the form of PTX or LLVM Bitcode. Then depending on the toolchain that is using the Flang runtime the packaging might be different and the toolchain linkers might behave differently. E.g. I may think of letting nvfortran/nvc++ drivers to link the Flang runtime in the form of LLVM Bitcode produced by nvcc. The same can be done for LLVM Bitcode Flang runtime created by clang and using clang driver (i.e. using complete LLVM toolchain). While the two libraries might be not compatible, the build process of Flang runtime might be uniform, i.e. the sources are treated as just CUDA code.

You are completely right that the formats used by different compilers are different, and this is true even for the plain LLVM IR because of the versioning. Do you have more details about we can then provide the LLVM-IR as a generic format like Nvidia does - does clang have some special mode for this?

As I said, I think OpenMP target offload may work well with clang and other compilers, but I need to try it to be sure. I do not think there is a convenient way of putting omp declare target/omp end declare target with function granularity, and there will probably be cases where placing these pragmas around the whole file(s) will not be appropriate (e.g. if there are some functions that we do not want/need to enable for the device). Please let me know if I am missing something here.

I suppose we can introduce a couple of macros that expand to _Pragma("omp declare target device_type(nohost)") and _Pragma("omp end declare target device_type(nohost)") and use them around Flang runtime functions or around the whole file(s) where appropriate. We will also have to make sure that the host-only build still works with gcc, so some restrictions/guards on using other OpenMP constructs will have to be established (e.g. gcc will the declare variant versions as redefinitions of the same functions, and this will just not work).

A big concern for me is the verbocity of the CMake code that is used for libomptarget/DeviceRTL and its dependency on LLVM toolchain (e.g. llvm-link, llvm-extract, opt, etc.). I wonder if there is a more clean way to compile a library with OpenMP target offload for multiple devices with clang. I suppose the users of clang OpenMP target offload are just compiling their libraries with -fopenmp -fopenmp-targets=... (+ -Xopenmp-target=nvptx64 --offload-arch=sm_80 or similar) and then archive the object files into the library – can this work for libomptarget/DeviceRTL and the Flang runtime?

The verbosity there is merely an artifact of a few requirements coming together, it’s not an inherent part of the process. We simply wanted to continue to provide the OpenMP device runtime as a monolithic LLVM-IR file and that required doing some of the steps manually.

OpenMP in clang handles multiple architectures natively now. Here’s the easiest way to create a library for use with OpenMP in the LLVM/Clang ecosystem for a few architectures:

$> clang++ src/library.cpp src/impl.cpp -fopenmp --offload-arch=sm_70,sm_80,sm_86,gfx908,gfx90a,gfx1030 -fvisibility=hidden -foffload-lto -fopenmp-cuda-mode -O3 -c
$> ar rcs libfoo.a library.o impl.o
$> clang++ app.cpp -fopenmp --offload-arch=sm_80 -lfoo

Clang uses a binary format similar to CUDA’s to package all the binaries into a single source object. This obviously requires buying in to this format of Clang’s linking, but is fully capable of providing library support. The downside is that we have a copy here for every architecture even when redundant, but it avoids a good amount of complexity. Also for the DeviceRTL we compile it like a standard C++ file, but use the device_type(nohost) to make it to where the host portion of the code is a mostly empty ELF whose only purpose is to carry the .llvm.offloading section we use for linking.

What we’re doing in the DeviceRTL that you’ve pointed out is instead turning this into a bitcode library. For that we use the tools manually and can’t rely on the built-in handling. It’s a little more esoteric and you’ll need to repeat it for each architecture. Potentially you could only use common features and make a single one, but that’s some more engineering work.

$> clang++ src/library.cpp -fvisibility=hidden -foffload-lto --offload-arch=sm_70 --offload-device-only -emit-llvm -c -O3
$> clang++ src/impl.cpp -fvisibility=hidden -foffload-lto --offload-arch=sm_70 --offload-device-only -emit-llvm -c -O3
$> llvm-link impl.o library.o -o libfoo-sm_70.bc

One downside is that this generally requires an up-to-date clang to use. This is important because if we’re using LLVM bitcode we need the producer and consumer to generally be the same version. In-tree this is usually handled with an LLVM_ENABLE_RUNTIMES build but we probably don’t want to for flang to do that. Alternatively you can do what we do in openmp and just pick out the clang binary from bin/clang in the build because we will presumably always build flang / openmp after clang.

1 Like

It is a hard requirement of LLVM that everything in the mono repo can be build with:

  • Clang 5.0
  • Apple Clang 10.0
  • GCC 7.1
  • Visual Studio 2019 16.7
    Offload might have higher requirements.

It seems feasible to me to annotate the f18 runtime code with multiple approaches to device code generation to accommodate the requirements of multiple targets & build compilers.

2 Likes

@jhuber6, thank you for the reply! Could you please also clarify what you mean by “freestanding C++”? Does it mean not using any STD C++ functionality or something else?

When I use “freestanding C++” I’m referring to compiling C++ code directly for the target GPU without any offloading runtime or existing GPU libraries. Here’s an example using clang Compiler Explorer. This is the method I’m currently using to build the LLVM C library on the GPU that has been linked earlier.

Thanks! I thought “freestanding” was related to the way the library is developed, not the way it is compiled.

You can just use clang -fopenmp --offload-arch=sm_80,gfx90a -c {in} -o {out} and then bundle the results in a static library. Part of the complications of the DeviceRTL setup makes sure we still have the separate .bc files for non-lto builds targeting NVIDIA. Other parts are to ensure we run a “LTO” round of optimization already on the runtime, rather than shipping it only “TU-wise optimized”. That is not necessary, but beneficial. The need for 4 explicit calls in a cmake seems low cost for the gain.


Wrt. the runtime design. I’d propose to actually follow the principle of DeviceRTL. We compile everything for the device and everything is generic code if possible. If now function A::foo needs to be specialized between host and device, we provide a variant.
If B::baz is not easy to support on a device, we would specialize it with a variant that errors out, e.g., print + trap, or we could force a missing symbol (which will make non-optimizing builds painful).
The above should hold if we want begin declare target everywhere or if we just set the target tripe passed to clang, as the GPU lib does it.

Thank you for the comment, Johannes!

Yes, I ended up adding an OpenMP target offload build mode that produces a fatlib with the host and device parts of Flang runtime. I understand the value of the early LTO for the library itself and the ability to apply LTO to the program+library Bitcode, so I mentioned this mode in the patch under a TODO. At this point, building the fatlib should allow me to incrementally enable parts of Flang runtime as I was planning.

I do not currently see the need for declare variant, but I agree that it is a good way to produce different code versions for different devices/vendors.

Everyone, please review the updated patch at âš™ D151173 [RFC][flang] Experimental device build of Flang runtime.