I would like to discuss the enabling of Flang runtime library for the offload devices. OpenACC/OpenMP compiler support is being improved every day, and the dependency of the Flang generated code on the Flang runtime might become a blocking factor any time soon. I would like to propose initial patch  for building Flang runtime for CUDA devices. And I want to make sure that the changes required to do this will also work, in general, for different offload devices that the LLVM/Flang community members care about. I would also like to collect a list of people from different parties who will want to actively participate in making Flang runtime offload friendly.
Goals of this project:
- Come up with the guidelines for adjusting/disabling some functionalities that cannot be efficiently supported on all or some offload devices.
- Be able to incrementally enable parts of Flang runtime for the offload devices.
- Establish a process of continuous testing of the offload builds of Flang runtime, and a process of dealing with the breakages.
The initial patch provides a way to build a small part of Flang runtime, using Clang compiler with NVPTX backend, as a CUDA static fat library. There are other distribution models for the offload libraries (e.g. PTX library for CUDA devices, or LLVM Bitcode library such as the ones used for LLVM Libomptarget/DeviceRTL, etc.), but it is not a goal of this project to define the distribution model of the Flang runtime for CUDA devices. My assumption is that solving the task of building Flang runtime one way will allow us to build it all the other ways with some minor modifications of the CMake files.
I chose Clang-NVPTX as a build compiler just because it is available to all the community. To be fully transparent, I will also want to make sure that Flang runtime builds with NVCC as well, but the changes (if any) required to do this will need to be justified and have community approval. At this stage, building with Clang-NVPTX is a good path forward.
As you may see in the patch, the major change in the library source code is applying CUDA
__host__ __device__ attributes to all declarations and definitions of the Flang runtime entry points. The addition of the attributes to the entry points requires adding the attributes to the whole call-graph closure. The attributes are added by the means of new
HOST_DEVICE_ATTRS macros that are to be used only for declarations and definitions of the functions. The existing
RTNAME macro can still be used to get the mangled name of a runtime entry point. I suppose one may decide to build Flang runtime as a C++ library with the offload directives defining the necessary entry points as the “device” functions (e.g. with OpenMP
end declare target annotating particular APIs), and the current usage of
HOST_DEVICE_ATTRS macros at the declaration/definition points may not allow to do this cleanly. So I am open to other alternatives.
Next, there are some functionalities like
std::fill_n and others that are not supported by CUDA toolchains currently. In the initial patch I showed that the internal implementation is one option for getting rid of such dependencies. This approach may be taken alone or be accompanied with a feature request for the device toolchains to support
std::unique_ptr, etc. I used
CUDA_DEVICE to make device-alternative implementation of
std::fill_n. I wonder if a more generic macro like
IS_DEVICE, etc. looks more appropriate to people. Another alternative is to have a macro like
SUPPORTS_STD_FILL_N that can be defined in
flang/include/flang/Runtime/device.h based on the device toolchain. I guess this can be changed later when some commonalities are found for different offload devices, but I am open to ideas right now.
As you may expect, there are many unresolved references in the resulting CUDA library. For example, the methods marked with
HOST_DEVICE_ATTRS in the header files do not have corresponding implementations compiled for the device. I am using
-DCMAKE_CUDA_SEPARABLE_COMPILATION=ON to allow for unresolved references, and this allows me to incrementally enable parts of Flang runtime in the CUDA build (see also
supported_files list in
flang/runtime/CMakeLists.txt). So at this point I cannot say that the build produces a functional library, but this approach allows splitting the changes into observable pieces. Eventually, I will need to come up with a scripted check for unresolved references to guarantee that the library is linkable in all cases with the CUDA tools.
Regarding the last goal listed above, I would like to have buildbots running CUDA and other offload build of Flang runtime. I would also like to have an option to trigger Phabricator pre-commit testing for any Flang runtime changes. I have no experience here, so any advices/links will be appreciated.
For the time being, I will be resolving all the issues with the CUDA build introduced with new changes in Flang runtime. But once there is a buildbot, e.g. testing the build with Clang-NVPTX, would it be reasonable to make fixing the breakages the responsibility of the commit author? Is there a way to make this requirement “official” (e.g. like for OpenMP offload breakages)?
Thank you very much for reading through it up to here! I am looking forward to your comments!
 ⚙ D151173 [RFC][flang] Experimental CUDA build of Flang runtime.