NVPTX codegen for llvm.sin (and friends)

Gentle ping on this one to see if there has been any more activity. I ran into a related issue but on a different path translating from MLIR to LLVM IR where such math operations will have already been converted to __nv_<math_func> calls (before translation to LLVM IR), and this would lead to errors during the link step of the CUDA driver API which is what MLIR’s gpu-to-cubin pass uses ( https://github.com/llvm/llvm-project/blob/befa8cf087dbb8159a4d9dc8fa4d6748d6d5049a/mlir/lib/Dialect/GPU/Transforms/SerializeToCubin.cpp#L121). I posted this link issue on the NVIDIA forum thinking this could/should happen post generation of PTX: CUDA driver API cuLinkComplete can't find libdevice (nvvm intrinsics bitcode) - CUDA Programming and Performance - NVIDIA Developer Forums but the discussion here is quite advanced on the approach and solutions. On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich t...@google.com wrote:

It all boils down to the fact that PTX does not have the standard libc/libm which LLVM could lower the calls to, nor does it have a ‘linking’ phase where we could link such library in, if we had it.

Is the comment on the ‘linking’ phase still true? While CUDA/PTX does not provide a standard library, there is a linking phase post-PTX (which gpu-to-cubin pass uses) ( CUDA Driver API :: CUDA Toolkit Documentation). I’m wondering why cuLinkComplete can’t complete the link to these functions. This won’t, however, have the early IR optimization advantages mentioned upthread, given that “higher-level” NVVM bitcode is available. Perhaps the loss of essential optimization and transformation opportunity (potentially chip-specific) is the reason these are packaged as “bitcode libraries” instead of standard libraries and are meant to be linked before translation to PTX? Having the linking support in LLVM would mean that MLIR’s gpu-to-cubin pass could include the necessary LLVM passes on its way lowering to PTX. (However, this path never sees llvm.sin/cos/exp, etc. intrinsics; instead, math operations in MLIR like math.exp are currently directly converted to __nv_expf calls.)

-Uday

There has been some activity downstream but nothing upstream. Here is the current status and plan (as I see it):

  • We have the necessary code to build a libm.a for our GPU targets (AMD and NVIDIA). This reuses the existing [c]math[.h] code in clang/lib/Headers/ which translates sin to __XY_sin, depending on the architecture. (thanks @jhuber6!)
  • We are about to clean up our code in order to upstream it into clang/lib/GPURuntimes (or similar). However, the deployment might change as the RFC towards more runtime support (libc[std][++]) on the GPU advances. Unfortunately, the RFC hasn’t been written yet.
  • We disable the header translation in clang for GPU targets and instead link in our libm.a also for the GPU. This improves performance iff you use device side LTO (-foffload-lto). We could keep with the current code path if device side LTO is disabled. This is not much hassle as we need the headers anyway to build the libm.a in the first place.
  • Once we landed the libm.a stuff we can go ahead with the implemented.by idea (see thread) to translate llvm.sin to sin, or we reuse existing codegen logic to do that. Either should work. After this is done, we can enable -fno-errno for device compilation by default, we don’t support errno anyway rn.
  • MLIR, should not enter LLVM-IR with __nv_<math> calls, IMHO. (I mean, isn’t that counter the entire “high-level” idea?) That said, the problem you are having seems to be the missing inclusion of libdevice.bc on LLVM-IR level. This is a “driver” issue as far as I can tell and not related to proper handling of math functions. Maybe I misunderstand the description, feel free to elaborate.