Compiling CUDA code fails

BenWibking · March 27, 2022, 1:59am

Hi,

I am trying to build a CUDA code with Clang 14.0.0 following the instructions (Compiling CUDA with clang — LLVM 16.0.0git documentation) but I am getting a compiler error:

In file included from /home/bwibking/amrex/Src/Base/AMReX_ParmParse.cpp:2:
In file included from /home/bwibking/amrex/Src/Base/AMReX_ParmParse.H:7:
/home/bwibking/amrex/Src/Base/AMReX_TypeTraits.H:87:100: error: 'T' does not refer to a value
    struct MaybeHostDeviceRunnable<T, std::enable_if_t<__nv_is_extended_device_lambda_closure_type(T)> >
                                                                                                   ^
/home/bwibking/amrex/Src/Base/AMReX_TypeTraits.H:86:21: note: declared here
    template <class T>
                    ^

It looks like the issue is that Clang doesn’t recognize __nv_is_extended_device_lambda_closure_type. Is there a workaround for this?

Thanks,
Ben

jdoerfert · March 27, 2022, 10:45pm

Clang’s CUDA headers do not contain __nv_is_extended_device_lambda_closure_type which indicates you are right about the missing support.

Workaround ideas:

Modify the code to pass a T() and implement __nv_is_extended_device_lambda_closure_type as consteval function in user space.
Teach clang about __nv_is_extended_device_lambda_closure_type properly.
Specialize the template for all the types you need explicitly.

@Artem-B might have more ideas.

BenWibking · March 27, 2022, 11:49pm

Based on Programming Guide :: CUDA Toolkit Documentation, I don’t see how to implement this behavior as a consteval function. Are there other type traits implemented by Clang that would allow this?

Manually replacing its usage may be the only option for now. I’m afraid I don’t understand Clang internals well enough to propose a patch for this.

Artem-B · March 28, 2022, 7:05pm

Correct, clang does not implement those NVCC builtins.

Clang treats lambdas as HD functions by default. E.g. Compiler Explorer

We may be able to implement a builtin indicating whether a lambda can be executed on device or not (or whether it has a host or device attribute, implicit or explicit), but it will probably not be a 1:1 equivalent to these NVCC’s builtins as clang just does not have NVCC’s concept of “extended lambdas”. Or, rather, all lambdas in clang are extended lambdas in NVCC’s terms.

Artem-B · March 28, 2022, 7:14pm

Try defining those builtins as true. You may get lucky. Clang is usually more forgiving about lambdas than NVCC. Unless the code specifically looks for lambdas that intentionally can’t be executed on the GPU, there’s a good chance that it may work with clang.

BenWibking · March 28, 2022, 10:28pm

Thanks. I’ve reverted to the way this is handled in the code when compiling with HIP when using Clang to compile CUDA, which also doesn’t have the concept of extended lambdas, and everything just works.

However, my code still fails to build and complains about not recognizing __managed__ attributes on global variables. Is there a workaround for this second issue?

Artem-B · March 28, 2022, 11:03pm

Support for __managed__ variables is not implemented in clang for CUDA. NVIDIA does not document how they are supposed to work and interact with CUDA runtime, so there’s not much we can do about that.

BenWibking · March 29, 2022, 5:19am

That’s unfortunate, but I’ve managed to work around that.

I am now getting some bizarre errors about __float128 not being supported. I am not using __float128 in my code, but it appears in the C++ standard library headers:

Consolidate compiler generated dependencies of target Test_AsyncOut_multifab
[ 71%] Building CUDA object Tests/AsyncOut/multifab/CMakeFiles/Test_AsyncOut_multifab.dir/main.cpp.o
In file included from <built-in>:1:
In file included from /g/data/jh2/bw0729/spack/opt/spack/linux-rocky8-skylake_avx512/gcc-8.5.0/llvm-14.0.0-bjek4pvxz253ed64tm66shfrwjfsaogg/lib/clang/14.0.0/include/__clang_cuda_runtime_wrapper.h:41:
In file included from /half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/cmath:47:
/half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/std_abs.h:102:7: error: __float128 is not supported on this target
  abs(__float128 __x)
      ^
/half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/std_abs.h:101:3: error: __float128 is not supported on this target
  __float128
  ^
/half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/std_abs.h:102:18: note: '__x' defined here
  abs(__float128 __x)
                 ^
/half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/std_abs.h:102:18: note: '__x' defined here
/half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/std_abs.h:102:18: note: '__x' defined here
In file included from /g/data/jh2/bw0729/amrex/Tests/AsyncOut/multifab/main.cpp:1:
In file included from /g/data/jh2/bw0729/amrex/Src/Base/AMReX.H:9:
In file included from /g/data/jh2/bw0729/amrex/Src/Base/AMReX_ccse-mpi.H:14:
In file included from /apps/openmpi/4.1.2/include/mpi.h:2887:
In file included from /apps/openmpi/4.1.2/include/openmpi/ompi/mpi/cxx/mpicxx.h:42:
In file included from /half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/map:60:
In file included from /half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/stl_tree.h:63:
In file included from /half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/stl_algobase.h:64:
In file included from /half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/stl_pair.h:59:
In file included from /half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/move.h:55:
/half-root/usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/type_traits:335:39: error: __float128 is not supported on this target
    struct __is_floating_point_helper<__float128>
                                      ^
3 errors generated when compiling for sm_70.
make[2]: *** [Tests/AsyncOut/multifab/CMakeFiles/Test_AsyncOut_multifab.dir/build.make:76: Tests/AsyncOut/multifab/CMakeFiles/Test_AsyncOut_multifab.dir/main.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:491: Tests/AsyncOut/multifab/CMakeFiles/Test_AsyncOut_multifab.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

BenWibking · March 29, 2022, 10:59am

The __float128 error turns out to be caused by the fact that glibc uses __float128 as an extension, but Clang only supports __float128 on x86. I can work around this by adding -D__STRICT_ANSI__ to the compiler options when building device code.

Do you know why this doesn’t cause an issue with nvcc? Is there a better way to fix this?

BenWibking · March 29, 2022, 11:09am

This issue appears to have been noticed by the CMake developers: CUDA: Clang support (!4442) · Merge requests · CMake / CMake · GitLab (kitware.com)

However, this patch suggests that the issue had been solved at one point, but it’s broken again: [CUDA] Define CUDACC before standard library headers · llvm/llvm-project@8e20516 (github.com)

Edit: Ok, it looks like the issue is that the GNU standard library only avoids __float128 due to a Debian patch: debian/patches/cuda-float128.diff · 9.3.0-16 · Debian GCC Maintainers / GCC · GitLab. I am running on RHEL 8, so I don’t get the benefit of this. I think a more robust solution is needed on the Clang side.

Artem-B · March 29, 2022, 7:24pm

Do you know why this doesn’t cause an issue with nvcc? Is there a better way to fix this?

nvcc uses a very different compilation strategy. NVCC physically separates host and device code, which allows compilation of host-side code which uses types not supported on GPU.

On the other hand, Clang sees both host and device code simultaneously which allows us to handle C++ compilation better than NVCC. E.g. NVCC has to jump through some hoops to deal with some templates being instantiated by the code on the other side of the compilation.

The downside of seeing both sides is that the code must be ‘reasonably’ valid for both the host and the GPU, as complete TU (unless you rely on preprocessing and CUDA_ARCH macro).

Some of the issues in this category we work around by introducing delayed diagnostics. If the code that triggered it does not get emitted (e.g. host code during GPU-side compilation), then compilation succeeds. With types it’s trickier as using a type does not necessarily generate any code, so it’s hard to tie the error to anything specific.

Depending on where __float128 pops up in the headers, you may be able to do something like this:

# if __CUDA_ARCH__
#define __STRICT_ANSI__ 1
#endif
#include <header that may use float128.h>

It may not work if that header is included via the cuda runtime wrapper header clang -include s itself. It may also lead to differences in code/types as seen by the compiler during host and device compilations which may cause further trouble if it affects data exchanged between the host and the GPU.

BenWibking · March 29, 2022, 10:28pm

Ah, this explains why nvcc has so many template-related bugs…

I can live with defining __STRICT_ANSI__ in both host and device compilation for now. Is there any plan to add support for __float128 when building nvptx with Clang in the future?

Artem-B · March 29, 2022, 10:42pm

Is there any plan to add support for __float128 when building nvptx with Clang in the future?

Not to my knowledge. There’s no FP128 support on existing NVIDIA GPUs, so it would be of limited practical use on the GPU. Even if we were to emulate it via float/double it would be prohibitively slow (even double is rather slow on most GPU variants).

We may be able to add storage-only support for it, but I’m not sure if we can easily add fp128 emulation using fp64/fp32.

BenWibking · March 31, 2022, 4:25am

Double is perfectly fine on V100/A100, which is presumably what most HPC codes will be running on. If this much precision is needed, I think datacenter-class cards would be used. But it might be a performance surprise for users of other models.

NVIDIA released a BSD-licensed double-double header-only library for CUDA that seems to have disappeared from their website, but it is still available, e.g., here: Version 1.2 of NVIDIA's double-double arithmetic header, distributed in accordance with its BSD License. · GitHub. Unfortunately, the libquadmath documentation doesn’t describe the implementation details, so I cannot figure out whether libquadmath actually implements IEEE-compliant binary128 or just double-double. What does Clang do for __float128 on x86? If it’s double-double, then using double-double on device as well would seem to be reasonable.

Artem-B · March 31, 2022, 9:37pm

It’s true for V100.
Less so for A100. Cards like A100/A30 that are based on GA100 chip do indeed have the normal 1:2 fp64/fp32 hardware ratio. However, other nominally datacenter-grade cards like A40,A10/A16 are based on GA102/GA107 GPU variants and those come with 1:64 and 1:32 fp64/fp32 ratio.

The thing I’m constantly irked about NVIDIA’s GPU nomenclature is that GA102 and GA107 have the same compute capability, but the former has only half of fp64 hardware. I guess it’s better than the situation with sm_35 where we had models with 1:3 and 1:24 ratios (K40 vs GTX 780), but it still makes it a bit of a pain to come up with reasonable optimization trade-offs.

AFAICT, it implements it as a soft-float emulation of IEEE FP (at least that’s what GCC does on x64, according to Gcc 4.3 release notes).

__float128 ops in both gcc and clang call the standard library to actually do the operations: Compiler Explorer

We currently do not have the standard library on the GPU. We may be able to use the same soft-float approach once we have a way to provide GPU-side libcall implmentations that @jdoerfert has proposed. See [llvm-dev] [RFC] The `implements` attribute, or how to swap functions statically but late

BenWibking · April 1, 2022, 1:08am

IMO that would be ideal. Would be happy to test once there is a way to call standard library functions.

burlen · May 25, 2023, 3:28pm

I hit this issue as well. CMake may be at fault here since by default when one requests a specific C++ version, i.e.

set(CMAKE_CUDA_STANDARD 17)

CMake adds the flag -std=gnu++17 which results in C++ 17 and GNU extensions. Telling CMake not to use the extensions:

set(CMAKE_CUDA_EXTENSIONS OFF)
set(CMAKE_CUDA_STANDARD 17)

Solves this issue. CMake is doing the wrong thing here, when I ask for C++17 I mean C++17 not GNU extensions. I think one should have to explicitly ask for these extensions not the other way around.

Topic		Replies	Views
[CUDA] CUDA device code does not support variadic functions in clang Clang Frontend cuda , clang	1	1086	February 24, 2022
Parsing CUDA AST using clang Clang Frontend	2	109	March 10, 2019
[GPUCC] link against libdevice LLVM Dev List Archives	7	70	August 2, 2016
problem on compiling cuda program with clang++ LLVM Dev List Archives	7	97	October 27, 2016
Using Clang Tools on CUDA Programs LLVM Dev List Archives	3	101	December 23, 2015

Compiling CUDA code fails

Related topics