Compiling CUDA code fails

Hi,

I am trying to build a CUDA code with Clang 14.0.0 following the instructions (Compiling CUDA with clang — LLVM 15.0.0git documentation) but I am getting a compiler error:

In file included from /home/bwibking/amrex/Src/Base/AMReX_ParmParse.cpp:2:
In file included from /home/bwibking/amrex/Src/Base/AMReX_ParmParse.H:7:
/home/bwibking/amrex/Src/Base/AMReX_TypeTraits.H:87:100: error: 'T' does not refer to a value
    struct MaybeHostDeviceRunnable<T, std::enable_if_t<__nv_is_extended_device_lambda_closure_type(T)> >
                                                                                                   ^
/home/bwibking/amrex/Src/Base/AMReX_TypeTraits.H:86:21: note: declared here
    template <class T>
                    ^

It looks like the issue is that Clang doesn’t recognize __nv_is_extended_device_lambda_closure_type. Is there a workaround for this?

Thanks,
Ben

Clang’s CUDA headers do not contain __nv_is_extended_device_lambda_closure_type which indicates you are right about the missing support.

Workaround ideas:

  • Modify the code to pass a T() and implement __nv_is_extended_device_lambda_closure_type as consteval function in user space.
  • Teach clang about __nv_is_extended_device_lambda_closure_type properly.
  • Specialize the template for all the types you need explicitly.

@Artem-B might have more ideas.

Based on Programming Guide :: CUDA Toolkit Documentation, I don’t see how to implement this behavior as a consteval function. Are there other type traits implemented by Clang that would allow this?

Manually replacing its usage may be the only option for now. I’m afraid I don’t understand Clang internals well enough to propose a patch for this.

Correct, clang does not implement those NVCC builtins.

Clang treats lambdas as HD functions by default. E.g. Compiler Explorer

We may be able to implement a builtin indicating whether a lambda can be executed on device or not (or whether it has a host or device attribute, implicit or explicit), but it will probably not be a 1:1 equivalent to these NVCC’s builtins as clang just does not have NVCC’s concept of “extended lambdas”. Or, rather, all lambdas in clang are extended lambdas in NVCC’s terms.

1 Like

Try defining those builtins as true. You may get lucky. Clang is usually more forgiving about lambdas than NVCC. Unless the code specifically looks for lambdas that intentionally can’t be executed on the GPU, there’s a good chance that it may work with clang.

1 Like

Thanks. I’ve reverted to the way this is handled in the code when compiling with HIP when using Clang to compile CUDA, which also doesn’t have the concept of extended lambdas, and everything just works.

However, my code still fails to build and complains about not recognizing __managed__ attributes on global variables. Is there a workaround for this second issue?

Support for __managed__ variables is not implemented in clang for CUDA. NVIDIA does not document how they are supposed to work and interact with CUDA runtime, so there’s not much we can do about that.

That’s unfortunate, but I’ve managed to work around that.

I am now getting some bizarre errors about __float128 not being supported. I am not using __float128 in my code, but it appears in the C++ standard library headers:

Consolidate compiler generated dependencies of target Test_AsyncOut_multifab
[ 71%] Building CUDA object Tests/AsyncOut/multifab/CMakeFiles/Test_AsyncOut_multifab.dir/main.cpp.o
In file included from <built-in>:1:
In file included from /g/data/jh2/bw0729/spack/opt/spack/linux-rocky8-skylake_avx512/gcc-8.5.0/llvm-14.0.0-bjek4pvxz253ed64tm66shfrwjfsaogg/lib/clang/14.0.0/include/__clang_cuda_runtime_wrapper.h:41:
In file included from /half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/cmath:47:
/half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/std_abs.h:102:7: error: __float128 is not supported on this target
  abs(__float128 __x)
      ^
/half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/std_abs.h:101:3: error: __float128 is not supported on this target
  __float128
  ^
/half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/std_abs.h:102:18: note: '__x' defined here
  abs(__float128 __x)
                 ^
/half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/std_abs.h:102:18: note: '__x' defined here
/half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/std_abs.h:102:18: note: '__x' defined here
In file included from /g/data/jh2/bw0729/amrex/Tests/AsyncOut/multifab/main.cpp:1:
In file included from /g/data/jh2/bw0729/amrex/Src/Base/AMReX.H:9:
In file included from /g/data/jh2/bw0729/amrex/Src/Base/AMReX_ccse-mpi.H:14:
In file included from /apps/openmpi/4.1.2/include/mpi.h:2887:
In file included from /apps/openmpi/4.1.2/include/openmpi/ompi/mpi/cxx/mpicxx.h:42:
In file included from /half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/map:60:
In file included from /half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/stl_tree.h:63:
In file included from /half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/stl_algobase.h:64:
In file included from /half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/stl_pair.h:59:
In file included from /half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/move.h:55:
/half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/type_traits:335:39: error: __float128 is not supported on this target
    struct __is_floating_point_helper<__float128>
                                      ^
3 errors generated when compiling for sm_70.
make[2]: *** [Tests/AsyncOut/multifab/CMakeFiles/Test_AsyncOut_multifab.dir/build.make:76: Tests/AsyncOut/multifab/CMakeFiles/Test_AsyncOut_multifab.dir/main.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:491: Tests/AsyncOut/multifab/CMakeFiles/Test_AsyncOut_multifab.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

The __float128 error turns out to be caused by the fact that glibc uses __float128 as an extension, but Clang only supports __float128 on x86. I can work around this by adding -D__STRICT_ANSI__ to the compiler options when building device code.

Do you know why this doesn’t cause an issue with nvcc? Is there a better way to fix this?

This issue appears to have been noticed by the CMake developers: CUDA: Clang support (!4442) · Merge requests · CMake / CMake · GitLab (kitware.com)

However, this patch suggests that the issue had been solved at one point, but it’s broken again: [CUDA] Define CUDACC before standard library headers · llvm/llvm-project@8e20516 (github.com)

Edit: Ok, it looks like the issue is that the GNU standard library only avoids __float128 due to a Debian patch: debian/patches/cuda-float128.diff · 9.3.0-16 · Debian GCC Maintainers / GCC · GitLab. I am running on RHEL 8, so I don’t get the benefit of this. I think a more robust solution is needed on the Clang side.

Do you know why this doesn’t cause an issue with nvcc? Is there a better way to fix this?

nvcc uses a very different compilation strategy. NVCC physically separates host and device code, which allows compilation of host-side code which uses types not supported on GPU.

On the other hand, Clang sees both host and device code simultaneously which allows us to handle C++ compilation better than NVCC. E.g. NVCC has to jump through some hoops to deal with some templates being instantiated by the code on the other side of the compilation.

The downside of seeing both sides is that the code must be ‘reasonably’ valid for both the host and the GPU, as complete TU (unless you rely on preprocessing and CUDA_ARCH macro).

Some of the issues in this category we work around by introducing delayed diagnostics. If the code that triggered it does not get emitted (e.g. host code during GPU-side compilation), then compilation succeeds. With types it’s trickier as using a type does not necessarily generate any code, so it’s hard to tie the error to anything specific.

Depending on where __float128 pops up in the headers, you may be able to do something like this:

# if __CUDA_ARCH__
#define __STRICT_ANSI__ 1
#endif
#include <header that may use float128.h>

It may not work if that header is included via the cuda runtime wrapper header clang -include s itself. It may also lead to differences in code/types as seen by the compiler during host and device compilations which may cause further trouble if it affects data exchanged between the host and the GPU.

1 Like

Ah, this explains why nvcc has so many template-related bugs…

I can live with defining __STRICT_ANSI__ in both host and device compilation for now. Is there any plan to add support for __float128 when building nvptx with Clang in the future?

Is there any plan to add support for __float128 when building nvptx with Clang in the future?

Not to my knowledge. There’s no FP128 support on existing NVIDIA GPUs, so it would be of limited practical use on the GPU. Even if we were to emulate it via float/double it would be prohibitively slow (even double is rather slow on most GPU variants).

We may be able to add storage-only support for it, but I’m not sure if we can easily add fp128 emulation using fp64/fp32.

Double is perfectly fine on V100/A100, which is presumably what most HPC codes will be running on. If this much precision is needed, I think datacenter-class cards would be used. But it might be a performance surprise for users of other models.

NVIDIA released a BSD-licensed double-double header-only library for CUDA that seems to have disappeared from their website, but it is still available, e.g., here: Version 1.2 of NVIDIA's double-double arithmetic header, distributed in accordance with its BSD License. · GitHub. Unfortunately, the libquadmath documentation doesn’t describe the implementation details, so I cannot figure out whether libquadmath actually implements IEEE-compliant binary128 or just double-double. What does Clang do for __float128 on x86? If it’s double-double, then using double-double on device as well would seem to be reasonable.

It’s true for V100.
Less so for A100. Cards like A100/A30 that are based on GA100 chip do indeed have the normal 1:2 fp64/fp32 hardware ratio. However, other nominally datacenter-grade cards like A40,A10/A16 are based on GA102/GA107 GPU variants and those come with 1:64 and 1:32 fp64/fp32 ratio.

The thing I’m constantly irked about NVIDIA’s GPU nomenclature is that GA102 and GA107 have the same compute capability, but the former has only half of fp64 hardware. I guess it’s better than the situation with sm_35 where we had models with 1:3 and 1:24 ratios (K40 vs GTX 780), but it still makes it a bit of a pain to come up with reasonable optimization trade-offs.

AFAICT, it implements it as a soft-float emulation of IEEE FP (at least that’s what GCC does on x64, according to Gcc 4.3 release notes).

__float128 ops in both gcc and clang call the standard library to actually do the operations: Compiler Explorer

We currently do not have the standard library on the GPU. We may be able to use the same soft-float approach once we have a way to provide GPU-side libcall implmentations that @jdoerfert has proposed. See [llvm-dev] [RFC] The `implements` attribute, or how to swap functions statically but late

IMO that would be ideal. Would be happy to test once there is a way to call standard library functions.