NVPTX codegen for llvm.sin (and friends)

Date: Wed, 28 Apr 2021 18:56:32 -0400
From: William Moses via llvm-dev <llvm-dev@lists.llvm.org>
To: Artem Belevich <tra@google.com>

Hi all,

Reviving this thread as Johannes and I recently had some time to take a
look and do some additional design work. We’d love any thoughts on the
following proposal.

Keenly interested in this. Simplification (subjective) of the metadata proposal at the end. Some extra background info first though as GPU libm is a really interesting design space. When I did the bring up for a different architecture ~3 years ago, iirc I found the complete set:

  • clang lowering libm (named functions) to intrinsics
  • clang lowering intrinsic to libm functions
  • optimisation passes that transform libm and ignore intrinsics
  • optimisation passes that transform intrinsics and ignore libm
  • selectiondag represents some intrinsics as nodes
  • strength reduction, e.g. cos(double) → cosf(float) under fast-math

I then wrote some more IR passes related to opencl-style vectorisation and some combines to fill in the gaps (which have not reached upstream). So my knowledge here is out of date but clang/llvm wasn’t a totally consistent lowering framework back then.

Cuda ships an IR library containing functions similar to libm. ROCm does something similar, also IR. We do an impedance matching scheme in inline headers which blocks various optimisations and poses some challenges for fortran.

Background:


While in theory we could define the lowering of these intrinsics to be a
table which looks up the correct __nv_sqrt, this would require the
definition of all such functions to remain or otherwise be available. As
it’s undesirable for the LLVM backend to be aware of CUDA paths, etc, this
means that the original definitions brought in by merging libdevice.bc must
be maintained. Currently these are deleted if they are unused (as libdevice
has them marked as internal).

The deleting is it’s own hazard in the context of fast-math, as the function can be deleted, and then later an optimisation creates a reference to it, which doesn’t link. It also prevents the backend from (safely) assuming the functions are available, which is moderately annoying for lowering some SDag ISD nodes.

  1. GPU math functions aren’t able to be optimized, unlike standard math

functions.

This one is bad.

Design Constraints:

To remedy the problems described above we need a design that meets the
following:

  • Does not require modifying libdevice.bc or other code shipped by a
    vendor-specific installation
  • Allows llvm math intrinsics to be lowered to device-specific code
  • Keeps definitions of code used to implement intrinsics until after all
    potential relevant intrinsics (including those created by LLVM passes) have
    been lowered.

Yep, constraints sound right. Back ends can emit calls to these functions too, but I think nvptx/amdgcn do not. Perhaps they would like to be able to in places.

Initial Design:

… metadata / aliases …

Design would work, lets us continue with the header files we have now. Avoids some tedious programming, i.e. if we approached this as the usual back end lowering, where intrinsics / isd nodes are emitted as named function calls. That can be mostly driven by a table lookup as the function arity is limited. It is (i.e. was) quite tedious programming that in ISel. Doing basically the same thing for SDag + GIsel / ptx + gcn, with associated tests, is also unappealing.

The set of functions near libm is small and known. We would need to mark ‘sin’ as ‘implemented by’ slightly different functions for nvptx and amdgcn, and some of them need thin wrapper code (e.g. modf in amdgcn takes an argument by pointer). It would be helpful for the fortran runtime libraries effort if the implementation didn’t use inline code in headers.

There’s very close to a 1:1 mapping between the two gpu libraries, even some extensions to libm exist in both. Therefore we could write a table,
{llvm.sin.f64, “sin”, __nv_sin, __ocml_sin},
with NULL or similar for functions that aren’t available.

A function level IR pass, called late in the pipeline, crawls the call instructions and rewrites based on simple rules and that table. That is, would rewrite a call to llvm.sin.f64 to a call to __ocml_sin. Exactly the same net effect as a header file containing metadata annotations, except we don’t need the metadata machinery and we can use a single trivial IR pass for N architectures (by adding a column). Pass can do the odd ugly thing like impedance match function type easily enough.

The other side of the problem - that functions once introduced have to hang around until we are sure they aren’t needed - is the same as in your proposal. My preference would be to introduce the libdevice functions immediately after the lowering pass above, but we can inject it early and tag them to avoid erasure instead. Kind of need that to handle the cos->cosf transform anyway.

Quite similar to the ‘in theory … table’ suggestion, which I like because I remember it being far simpler than the sdag rewrite rules.

Thanks!

Jon

+bump

Jon did respond positive to the proposal. I think the table implementation
vs the "implemented_by" implementation is something we can experiment with.
I'm in favor of the latter as it is more general and can be used in other
places more easily, e.g., by providing source annotations. That said, having
the table version first would be a big step forward too.

I'd say, if we hear some other positive voices towards this we go ahead with
patches on phab. After an end-to-end series is approved we merge it together.

That said, people should chime in if they (dis)like the approach to get math
optimizations (and similar things) working on the GPU.

~ Johannes

+bump

Jon did respond positive to the proposal. I think the table implementation
vs the “implemented_by” implementation is something we can experiment with.
I’m in favor of the latter as it is more general and can be used in other
places more easily, e.g., by providing source annotations. That said, having
the table version first would be a big step forward too.

I’d say, if we hear some other positive voices towards this we go ahead with
patches on phab. After an end-to-end series is approved we merge it
together.

That said, people should chime in if they (dis)like the approach to get math
optimizations (and similar things) working on the GPU.

I do like this approach for CUDA and NVPTX. I think HIP/AMDGPU may benefit from it, too (+cc: yaxun.liu@).

This will likely also be useful for things other than math functions.
E.g. it may come handy for sanitizer runtimes (+cc: eugenis@) that currently rely on LLVM not materializing libcalls they can’t provide when they are building the runtime itself.

–Artem

bump.

+bump

Jon did respond positive to the proposal. I think the table implementation
vs the “implemented_by” implementation is something we can experiment with.
I’m in favor of the latter as it is more general and can be used in other
places more easily, e.g., by providing source annotations. That said, having
the table version first would be a big step forward too.

I’d say, if we hear some other positive voices towards this we go ahead with
patches on phab. After an end-to-end series is approved we merge it
together.

I think we’ve got as much interest expressed (or not) as we can reasonably expect for something that most back-ends do not care about.

I vote for moving forward with the patches.

–Artem

Thanks for the ping.

The IR pass that rewrote llvm.libm intrinsics to architecture specific ones I wrote years ago was pretty trivial. I’m up for re-implementing that.

Essentially type out a (hash)table with entries like {llvm.sin.f64, “sin”, __nv_sin, __ocml_sin} and do the substitution as a pass called ‘ExpandLibmIntrinsics’ or similar, run somewhere before instruction selection for nvptx / amdgpu / other.

Could factor it differently if we don’t like having the nv/oc names next to each other, pass could take the corresponding lookup table as an argument.

Main benefit over the implemented-in-terms-of metadata approach is it’s trivial to implement and dead simple. Lowering in IR means doing it once instead of once in sdag and once in gisel. I’ll write the pass (from scratch, annoyingly, as the last version I wrote is still closed source) if people seem in favour.

Thanks all,

Jon

Thanks for the ping.

The IR pass that rewrote llvm.libm intrinsics to architecture specific ones I wrote years ago was pretty trivial. I’m up for re-implementing that.

Essentially type out a (hash)table with entries like {llvm.sin.f64, “sin”, __nv_sin, __ocml_sin} and do the substitution as a pass called ‘ExpandLibmIntrinsics’ or similar, run somewhere before instruction selection for nvptx / amdgpu / other.

Could factor it differently if we don’t like having the nv/oc names next to each other, pass could take the corresponding lookup table as an argument.

Main benefit over the implemented-in-terms-of metadata approach is it’s trivial to implement and dead simple. Lowering in IR means doing it once instead of once in sdag and once in gisel. I’ll write the pass (from scratch, annoyingly, as the last version I wrote is still closed source) if people seem in favour.

SGTM.
Providing a fixed set of replacements for specific intrinsics is all NVPTX needs now.
Expanding intrinsics late may miss some optimization opportunities,
so we may consider doing it earlier and/or more than once, in case we happen to materialize new intrinsics in the later passes.

–Artem

SGTM.

Providing a fixed set of replacements for specific intrinsics is all NVPTX needs now.
Expanding intrinsics late may miss some optimization opportunities,
so we may consider doing it earlier and/or more than once, in case we happen to materialize new intrinsics in the later passes.

Good old phase ordering. I don’t think we’ve got any optimisations that target the nv/oc named functions and would personally prefer to never implement any.

We do have ones that target llvm.libm, and some that target extern C functions with the same names as libm. There’s some code in clang that converts some libm functions into llvm intrinsics, and I think some other code in clang that converts in the other direction. Maybe dependent on various math flags.

So it seems we either canonicalise libm-like code and rearrange optimisations to work on the canonical form, or we write optimisations that know there are N names for essentially the same function. I’d prefer to go with the canonical form approach, e.g. we could rewrite calls to __nv_sin into calls to sin early on in the pipeline (or ignore them? seems likely applications call libm functions directly), and rewrite calls to sin to __nv_sin late on, with optimisations written against sin.

Thanks!

+1 but we may want to put it under a clang option in the beginning in case it causes perf degradation.

Sam

I would like to note that there's prior (and generic!) art in this
area - ReplaceWithVeclib (⚙ D95373 Replace vector intrinsics with call to vector library)
Presumably the NVPTX backend only needs to declare
the wanted replacements, and they //should// already happen.

Roman

SGTM.

Providing a fixed set of replacements for specific intrinsics is all NVPTX needs now.
Expanding intrinsics late may miss some optimization opportunities,
so we may consider doing it earlier and/or more than once, in case we happen to materialize new intrinsics in the later passes.

Good old phase ordering. I don’t think we’ve got any optimisations that target the nv/oc named functions and would personally prefer to never implement any.

We do have ones that target llvm.libm, and some that target extern C functions with the same names as libm. There’s some code in clang that converts some libm functions into llvm intrinsics, and I think some other code in clang that converts in the other direction. Maybe dependent on various math flags.

So it seems we either canonicalise libm-like code and rearrange optimisations to work on the canonical form, or we write optimisations that know there are N names for essentially the same function. I’d prefer to go with the canonical form approach, e.g. we could rewrite calls to __nv_sin into calls to sin early on in the pipeline (or ignore them? seems likely applications call libm functions directly), and rewrite calls to sin to __nv_sin late on, with optimisations written against sin.

I should’ve phrased it better. What I mean is that because the _nv* functions are provided as IR, Replacing intrinsics with calls to _nv functions may provide further IR optimization opportunities – inlining, CSE, DCE, etc… I didn’t mean the optimizations based on known semantics of the functions. I agree that those should be done for canonical calls only.

–Artem

Gentle ping on this one to see if there has been any more activity. I ran into a related issue but on a different path translating from MLIR to LLVM IR where such math operations will have already been converted to __nv_<math_func> calls (before translation to LLVM IR), and this would lead to errors during the link step of the CUDA driver API which is what MLIR’s gpu-to-cubin pass uses ( llvm-project/SerializeToCubin.cpp at befa8cf087dbb8159a4d9dc8fa4d6748d6d5049a · llvm/llvm-project · GitHub). I posted this link issue on the NVIDIA forum thinking this could/should happen post generation of PTX: CUDA driver API cuLinkComplete can't find libdevice (nvvm intrinsics bitcode) - CUDA Programming and Performance - NVIDIA Developer Forums but the discussion here is quite advanced on the approach and solutions. On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich t...@google.com wrote:

It all boils down to the fact that PTX does not have the standard libc/libm which LLVM could lower the calls to, nor does it have a ‘linking’ phase where we could link such library in, if we had it.

Is the comment on the ‘linking’ phase still true? While CUDA/PTX does not provide a standard library, there is a linking phase post-PTX (which gpu-to-cubin pass uses) ( CUDA Driver API :: CUDA Toolkit Documentation). I’m wondering why cuLinkComplete can’t complete the link to these functions. This won’t, however, have the early IR optimization advantages mentioned upthread, given that “higher-level” NVVM bitcode is available. Perhaps the loss of essential optimization and transformation opportunity (potentially chip-specific) is the reason these are packaged as “bitcode libraries” instead of standard libraries and are meant to be linked before translation to PTX? Having the linking support in LLVM would mean that MLIR’s gpu-to-cubin pass could include the necessary LLVM passes on its way lowering to PTX. (However, this path never sees llvm.sin/cos/exp, etc. intrinsics; instead, math operations in MLIR like math.exp are currently directly converted to __nv_expf calls.)

-Uday

There has been some activity downstream but nothing upstream. Here is the current status and plan (as I see it):

  • We have the necessary code to build a libm.a for our GPU targets (AMD and NVIDIA). This reuses the existing [c]math[.h] code in clang/lib/Headers/ which translates sin to __XY_sin, depending on the architecture. (thanks @jhuber6!)
  • We are about to clean up our code in order to upstream it into clang/lib/GPURuntimes (or similar). However, the deployment might change as the RFC towards more runtime support (libc[std][++]) on the GPU advances. Unfortunately, the RFC hasn’t been written yet.
  • We disable the header translation in clang for GPU targets and instead link in our libm.a also for the GPU. This improves performance iff you use device side LTO (-foffload-lto). We could keep with the current code path if device side LTO is disabled. This is not much hassle as we need the headers anyway to build the libm.a in the first place.
  • Once we landed the libm.a stuff we can go ahead with the implemented.by idea (see thread) to translate llvm.sin to sin, or we reuse existing codegen logic to do that. Either should work. After this is done, we can enable -fno-errno for device compilation by default, we don’t support errno anyway rn.
  • MLIR, should not enter LLVM-IR with __nv_<math> calls, IMHO. (I mean, isn’t that counter the entire “high-level” idea?) That said, the problem you are having seems to be the missing inclusion of libdevice.bc on LLVM-IR level. This is a “driver” issue as far as I can tell and not related to proper handling of math functions. Maybe I misunderstand the description, feel free to elaborate.

I didn’t think it was entirely against the “high-level” idea; you still get the benefit of inlining and optimizing those bitcode library functions. I understood to some extent the benefits of actually having these appear as llvm.sin/cos/exp etc. when entering from MLIR into LLVM. However, arguably, the goals behind having these llvm math intrinsics around could be accomplished with the math dialect operations in MLIR where these would appear as math.sin, math.cos, etc. — so you do have high-level semantic information. I’m not fully sure though, as to the rationale for lowering these out early to _nv* calls in MLIR land. But from your original post, llvm.sin/cos themselves aren’t still handled by NVPTX – so we’d be farther from something functional for MLIR.

That’s one way to view it. But if we wanted a reusable strategy (instead of all drivers having to link in the libdevice bitcode file), can the NVPTX backend link this in? I thought the driver vs LLVM argument was settled in the way of the latter on this thread. And in order to do that, would it be reasonable to have a CMAKE_NVVM_LIBDEVICE_BITCODE to provide LLVM/NVPTX with the location of the library?

It’s pretty straightforward for me to otherwise link in the bitcode in the MLIR gpu-to-cubin pass here (which would presumably be the driver you are referring to) while also running the right LLVM passes over there before translating to PTX.

I don’t think anything has changed and there’s still no way to ‘link’ PTX during clang compilation.

cuLinkComplete is the API provided by the GPU driver. It’s not available during compilation.

The choices are to either compile to a GPU object with -fgpu-rdc and then link GPU object files using nvlink or compile to IR, link IR from all TUs, and then generate full-program PTX (in other words - use LTO).

As for the progress, we are moving in the right direction. @jhuber6’s clang driver and related tools changes are a huge step towards making GPU-side compilation “just work”, or, as close as we can get to that state.

We don’t want that:
There is little reason to fix one libdevice version per installation of LLVM and if we don’t do it you’d still need driver code and everything else regardless of the “default” libdevice the backend might link in. Further, NVPTX backend is too late to perform optimizations.

The right way to do this is the new driver work by @jhuber6. In essence, we link in the libdevice code (and similar things), as part of the linker step. If you enable (offload) LTO you get the optimizations you are looking for while we avoid duplicating that library for each TU.

1 Like

I’ve done some trials with linking libdevice late. There’s a few annoyances with it I may as well mention, but none of them are major blockages.

  • Since we can’t embed libdevice.bc, we’ll need to link it in manually rather than extract it from the input like we do with -lomptarget.devicertl. If we could redistribute libdevice.bc we could just wrap around it, but I’m assuming that’s against the license.
  • libdevice doesn’t use hiddne visibility (CUDA seems to ignore visibility entirely) so LTO will by-default keep every single definition alive. We’ll need to manually set the visibility when we load in libdevice for LTO.
  • LTO compilation got a lot slower when linking libdevice late, I’m not exactly sure about the LTO pipeline, but it seemed like a lot of functions were being optimized only to be pruned later.
  • There can be performance regressions from linking in un-optimized bitcode during the LTO pass pipeline, as it will have fewer optimizations applied to it than it would have if it were linked in from the start.
  • We’ll need a shim library to remap sin calls to nv_sin for example. This is easy with CUDA, but there isn’t a good way to do this currently with OpenMP variants that currently doesn’t involve linking a bitcode library per-TU in clang that just remaps sin to omp_sin then omp_sin to sin, this is really ugly and wastes compilation time so I’d like to avoid it.

I didn’t understand this part – is this post PTX? The scenario I was referring to is the MLIR JIT (not AOT). There isn’t a linker step post-PTX and you need to know where libdevice is when linking prior to that AFAIU.

Looks like I described it wrong. This is how one can link it in manually, and it’s not late
to perform optimizations: https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc;l=283;drc=3381da37560d64c7cb62b53879a0a931ff9036c4
I have a workaround that does exactly this in the MLIR gpu-to-cubin pass and then runs LLVM passes at a desired opt level post that.

Maybe we talk about different things when we say

I’m thinking of the LLVM-IR → ptx conversion. Which is after the LLVM-IR passes and thereby too late for optimizations. If it’s in MLIR land, it’s fine.

No. We are working towards always embedding IR rather than ptx and then finalizing the device code at link time (or runtime with a JIT). The expected usage is LTO but even w/o LTO we could do this (and I would want to).

I think I might be missing the point. What is it you want to add/change and where?

Thanks. It’s clearer to me as to what you meant.

Something that could just make LLVM IR with __nv_cos/sin/expf/... calls compile and link (with libdevice bitcode), i.e., provide the necessary passes/infrastructure to a driver generating such LLVM IR. Perhaps it’s what you mention right above, the details of which aren’t fully clear to me but are covered by the work in progress referred to. In the case of a JIT, I assume the external user would have to provide the path of the bitcode file to be linked in. It’s clear to me that you would embed IR (from the bitcode file), but “finalizing the device code at link time” is unclear. If you are embedding LLVM IR, you’d as well inline/optimize – then what’s finalization?

Finalizing = lowering the IR to whatever we load into the driver, e.g., cubin.

This is a driver task, not a pass task.

For that to ahppen something would need to provide the implementation of those functions. libdevice provides one as IR, so linking with it on the IR level is one option. It has issues, mentioned above – we’re not allowed to distribute libdevice, and even if were allowed to, we would not know which one we’ll need, as libdevice is technically supplied by and is tied to particular CUDA version. So, cur CUDA/OpenMP compilation, it’s the driver’s job to figure out where to find the libdevice.bc and link it in. LLVM is just not suited for this.

One way out of this jam would be to provide our own implementation which would be a superset of libdevice implementations in all supported CUDA versions. If it could be embedded in LLVM, ti may potentially make it possible to link in libdevice functions in a NVPTX-specific pass.

Ideally we should not be needing libdevice at all and would have a regular library to be linked with. Unfortunately with PTX being the end of the LLVM-owned compilation road, we do not even have a concept of “linking” and need to use NVIDIA’s tools for that. Again, we can not easily use them from LLVM itself.

Within these constraints, what XLA and the cland --offload-new-driver are doing are pretty much the extent of we can practically do when we JIT or compile offline.

On a somewhat related note, there’s also
[llvm-dev] [RFC] The implements attribute, or how to swap functions statically but late which could help to get rid of libdevice altogether and replace it with GPU-specific implementation of the standard math functions. It would still need to be linked in at IR level, but we would at least have control over it.

2 Likes