Gentle ping on this one to see if there has been any more activity. I ran into a related issue but on a different path translating from MLIR to LLVM IR where such math operations will have already been converted to
__nv_<math_func>
calls (before translation to LLVM IR), and this would lead to errors during the link step of the CUDA driver API which is what MLIR’s gpu-to-cubin pass uses ( https://github.com/llvm/llvm-project/blob/befa8cf087dbb8159a4d9dc8fa4d6748d6d5049a/mlir/lib/Dialect/GPU/Transforms/SerializeToCubin.cpp#L121). I posted this link issue on the NVIDIA forum thinking this could/should happen post generation of PTX: CUDA driver API cuLinkComplete can't find libdevice (nvvm intrinsics bitcode) - CUDA Programming and Performance - NVIDIA Developer Forums but the discussion here is quite advanced on the approach and solutions. On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich t...@google.com wrote:It all boils down to the fact that PTX does not have the standard libc/libm which LLVM could lower the calls to, nor does it have a ‘linking’ phase where we could link such library in, if we had it.
Is the comment on the ‘linking’ phase still true? While CUDA/PTX does not provide a standard library, there is a linking phase post-PTX (which gpu-to-cubin pass uses) ( CUDA Driver API :: CUDA Toolkit Documentation). I’m wondering why
cuLinkComplete
can’t complete the link to these functions. This won’t, however, have the early IR optimization advantages mentioned upthread, given that “higher-level” NVVM bitcode is available. Perhaps the loss of essential optimization and transformation opportunity (potentially chip-specific) is the reason these are packaged as “bitcode libraries” instead of standard libraries and are meant to be linked before translation to PTX? Having the linking support in LLVM would mean that MLIR’sgpu-to-cubin
pass could include the necessary LLVM passes on its way lowering to PTX. (However, this path never seesllvm.sin/cos/exp
, etc. intrinsics; instead, math operations in MLIR likemath.exp
are currently directly converted to __nv_expf calls.)-Uday
There has been some activity downstream but nothing upstream. Here is the current status and plan (as I see it):
- We have the necessary code to build a
libm.a
for our GPU targets (AMD and NVIDIA). This reuses the existing[c]math[.h]
code inclang/lib/Headers/
which translatessin
to__XY_sin
, depending on the architecture. (thanks @jhuber6!) - We are about to clean up our code in order to upstream it into
clang/lib/GPURuntimes
(or similar). However, the deployment might change as the RFC towards more runtime support (libc[std][++]
) on the GPU advances. Unfortunately, the RFC hasn’t been written yet. - We disable the header translation in clang for GPU targets and instead link in our
libm.a
also for the GPU. This improves performance iff you use device side LTO (-foffload-lto
). We could keep with the current code path if device side LTO is disabled. This is not much hassle as we need the headers anyway to build thelibm.a
in the first place. - Once we landed the
libm.a
stuff we can go ahead with theimplemented.by
idea (see thread) to translatellvm.sin
tosin
, or we reuse existing codegen logic to do that. Either should work. After this is done, we can enable-fno-errno
for device compilation by default, we don’t supporterrno
anyway rn. - MLIR, should not enter LLVM-IR with
__nv_<math>
calls, IMHO. (I mean, isn’t that counter the entire “high-level” idea?) That said, the problem you are having seems to be the missing inclusion oflibdevice.bc
on LLVM-IR level. This is a “driver” issue as far as I can tell and not related to proper handling of math functions. Maybe I misunderstand the description, feel free to elaborate.