Compiling CUDA code fails

It’s true for V100.
Less so for A100. Cards like A100/A30 that are based on GA100 chip do indeed have the normal 1:2 fp64/fp32 hardware ratio. However, other nominally datacenter-grade cards like A40,A10/A16 are based on GA102/GA107 GPU variants and those come with 1:64 and 1:32 fp64/fp32 ratio.

The thing I’m constantly irked about NVIDIA’s GPU nomenclature is that GA102 and GA107 have the same compute capability, but the former has only half of fp64 hardware. I guess it’s better than the situation with sm_35 where we had models with 1:3 and 1:24 ratios (K40 vs GTX 780), but it still makes it a bit of a pain to come up with reasonable optimization trade-offs.

AFAICT, it implements it as a soft-float emulation of IEEE FP (at least that’s what GCC does on x64, according to Gcc 4.3 release notes).

__float128 ops in both gcc and clang call the standard library to actually do the operations: Compiler Explorer

We currently do not have the standard library on the GPU. We may be able to use the same soft-float approach once we have a way to provide GPU-side libcall implmentations that @jdoerfert has proposed. See [llvm-dev] [RFC] The `implements` attribute, or how to swap functions statically but late