NVPTX codegen for llvm.sin (and friends)

Artem, Justin,

I am running into a problem and I'm curious if I'm missing something or if the support is simply missing.
Am I correct to assume the NVPTX backend does not deal with `llvm.sin` and friends?

This is what I see, with some variations: https://godbolt.org/z/PxsEWs

If this is missing in the backend, is there a plan to get this working, I'd really like to have the
intrinsics in the middle end rather than __nv_cos, not to mention that -ffast-math does emit intrinsics
and crashes.

~ Johannes

Artem, Justin,

I am running into a problem and I’m curious if I’m missing something or
if the support is simply missing.
Am I correct to assume the NVPTX backend does not deal with llvm.sin
and friends?

Correct. It can’t deal with anything that may need to lower to a standard library call.

This is what I see, with some variations: https://godbolt.org/z/PxsEWs

If this is missing in the backend, is there a plan to get this working,
I’d really like to have the
intrinsics in the middle end rather than __nv_cos, not to mention that
-ffast-math does emit intrinsics
and crashes.

It all boils down to the fact that PTX does not have the standard libc/libm which LLVM could lower the calls to, nor does it have a ‘linking’ phase where we could link such a library in, if we had it.

Libdevice bitcode does provide the implementations for some of the functions (though with a _nv prefix) and clang links it in in order to avoid generating IR that LLVM can’t handle, but that’s a workaround that does not help LLVM itself.

–Artem

Since clang (and arguably any other frontend that uses) should link in libdevice, could we lower these intrinsics to the libdevice code?

For example, consider compiling the simple device function below:

// /mnt/sabrent/wmoses/llvm13/build/bin/clang [tmp.cu](http://tmp.cu) -S -emit-llvm --cuda-path=/usr/local/cuda-11.0 -L/usr/local/cuda-11.0/lib64 --cuda-gpu-arch=sm_37
__device__ double f(double x) {
return cos(x);
}

The LLVM module for it is as follows:

...
define dso_local double @_Z1fd(double %x) #0 {
entry:
%__a.addr.i = alloca double, align 8
%x.addr = alloca double, align 8
store double %x, double* %x.addr, align 8
%0 = load double, double* %x.addr, align 8
store double %0, double* %__a.addr.i, align 8
%1 = load double, double* %__a.addr.i, align 8
%call.i = call contract double @__nv_cos(double %1) #7
ret double %call.i
}

define internal double @__nv_cos(double %a) #1 {
%q.i = alloca i32, align 4

Obviously we would need to do something to ensure these functions don’t get deleted prior to their use in lowering from intrinsic to libdevice.

Right. We could keep the definition of __nv_cos and friends
around. Right now, -ffast-math might just crash on the user,
which is arguably a bad thing. I can also see us benefiting
in various other ways from llvm.cos uses instead of __nv_cos
(assuming precision is according to the user requirements but
that is always a condition).

It could be as simple as introducing __nv_cos into
"llvm.used" and a backend matching/rewrite pass.

If the backend knew the libdevice location it could even pick
the definitions from there. Maybe we could link libdevice late
instead of eager?

Trying to figure out a good way to have the cake and eat it too.

~ Johannes

Right. We could keep the definition of __nv_cos and friends
around. Right now, -ffast-math might just crash on the user,
which is arguably a bad thing. I can also see us benefiting
in various other ways from llvm.cos uses instead of __nv_cos
(assuming precision is according to the user requirements but
that is always a condition).

It could be as simple as introducing __nv_cos into
“llvm.used” and a backend matching/rewrite pass.

If the backend knew the libdevice location it could even pick
the definitions from there. Maybe we could link libdevice late
instead of eager?

It’s possible, but it would require plumbing in CUDA SDK awareness into LLVM. While clang driver can deal with that, LLVM currently can’t. The bitcode library path would have to be provided by the user.

The standard library as bitcode raises some questions.

  • When do we want to do the linking? If we do it at the beginning, then the question is how to make sure unused functions are not eliminated before we may need them, as we don’t know apriori what’s going to be needed. We also do want the unused functions to be gone after we’re done. Linking it in early would allow optimizing the code better at the expense of having to optimize a lot of code we’ll throw away. Linking it in late has less overhead, but leaves the linked in bitcode unoptimized, though it’s probably in the ballpark of what would happen with a real library call. I.e. no inlining, etc.

  • It incorporates linking into LLVM, which is not LLVM’s job. Arguably, the line should be drawn at the lowering to libcalls as it’s done for other back-ends. However, we’re also constrained to by the need to have the linking done before we generate PTX which prevents doing it after LLVM is done generating an object file.

One thing that may work within the existing compilation model is to pre-compile the standard library into PTX and then textually embed relevant functions into the generated PTX, thus pushing the ‘linking’ phase past the end of LLVM’s compilation and make it look closer to the standard compile/link process. This way we’d only enable libcall lowering in NVPTX, assuming that the library functions will be magically available out there. Injection of PTX could be done with an external script outside of LLVM and it could be incorporated into clang driver. Bonus points for the fact that this scheme is compatible with -fgpu-rdc out of the box – assemble the PTX with ptxas -rdc and then actually link with the library, instead of injecting its PTX before invoking ptxas.

–Artem

Trying to figure out a good way to have the cake and eat it too.

~ Johannes

Since clang (and arguably any other frontend that uses) should link in
libdevice, could we lower these intrinsics to the libdevice code?

The linking happens before LLVM gets to work on IR.
As I said, it’s a workaround, not the solution. It’s possible for LLVM to still attempt lowering something in the IR into a libcall and we would not be able to deal with that. It happens to work well enough in practice.

Do you have an example where you see the problem with -ffast-math?

Right. We could keep the definition of __nv_cos and friends
around. Right now, -ffast-math might just crash on the user,
which is arguably a bad thing. I can also see us benefiting
in various other ways from llvm.cos uses instead of __nv_cos
(assuming precision is according to the user requirements but
that is always a condition).

It could be as simple as introducing __nv_cos into
"llvm.used" and a backend matching/rewrite pass.

If the backend knew the libdevice location it could even pick
the definitions from there. Maybe we could link libdevice late
instead of eager?

It's possible, but it would require plumbing in CUDA SDK awareness into
LLVM. While clang driver can deal with that, LLVM currently can't. The
bitcode library path would have to be provided by the user.

The PTX backend could arguably be CUDA SDK aware, IMHO, it would
even be fine if the middle-end does the remapping to get inlining
and folding benefits also after __nv_cos is used. See below.

The standard library as bitcode raises some questions.

Which standard library? CUDAs libdevice is a bitcode library, right?

* When do we want to do the linking? If we do it at the beginning, then the
question is how to make sure unused functions are not eliminated before we
may need them, as we don't know apriori what's going to be needed. We also
do want the unused functions to be gone after we're done. Linking it in
early would allow optimizing the code better at the expense of having to
optimize a lot of code we'll throw away. Linking it in late has less
overhead, but leaves the linked in bitcode unoptimized, though it's
probably in the ballpark of what would happen with a real library call.
I.e. no inlining, etc.

* It incorporates linking into LLVM, which is not LLVM's job. Arguably, the
line should be drawn at the lowering to libcalls as it's done for other
back-ends. However, we're also constrained to by the need to have the
linking done before we generate PTX which prevents doing it after LLVM is
done generating an object file.

I'm confused. Clang links in libdevice.bc early. If we make sure
`__nv_cos` is not deleted early, we can at any point "lower" `llvm.cos`
to `__nv_cos` which is available. After the lowering we can remove
the artificial uses of `__nv_XXX` functions that we used to keep the
definitions around in order to remove them from the final result.
We get the benefit of having `llvm.cos` for some of the pipeline,
we know it does not have all the bad effects while `__nv_cos` is defined
with inline assembly. We also get the benefit of inlining `__nv_cos`
and folding the implementation based on the arguments. Finally,
this should work with the existing pipeline, the linking is the same
as before, all we do is to keep the definitions alive longer and
lower `llvm.cos` to `__nv_cos` in a middle end pass.

This might be similar to the PTX solution you describe below but I feel
we get the inline benefit from this without actually changing the pipeline
at all.

~ Johannes

Right. We could keep the definition of __nv_cos and friends
around. Right now, -ffast-math might just crash on the user,
which is arguably a bad thing. I can also see us benefiting
in various other ways from llvm.cos uses instead of __nv_cos
(assuming precision is according to the user requirements but
that is always a condition).

It could be as simple as introducing __nv_cos into
“llvm.used” and a backend matching/rewrite pass.

If the backend knew the libdevice location it could even pick
the definitions from there. Maybe we could link libdevice late
instead of eager?

It’s possible, but it would require plumbing in CUDA SDK awareness into
LLVM. While clang driver can deal with that, LLVM currently can’t. The
bitcode library path would have to be provided by the user.

The PTX backend could arguably be CUDA SDK aware, IMHO, it would
even be fine if the middle-end does the remapping to get inlining
and folding benefits also after __nv_cos is used. See below.

The standard library as bitcode raises some questions.

Which standard library? CUDAs libdevice is a bitcode library, right?

It’s whatever LLVM will need to lower libcalls to. libdevice bitcode is the closest approximation of that we have at the moment.

  • When do we want to do the linking? If we do it at the beginning, then the
    question is how to make sure unused functions are not eliminated before we
    may need them, as we don’t know apriori what’s going to be needed. We also
    do want the unused functions to be gone after we’re done. Linking it in
    early would allow optimizing the code better at the expense of having to
    optimize a lot of code we’ll throw away. Linking it in late has less
    overhead, but leaves the linked in bitcode unoptimized, though it’s
    probably in the ballpark of what would happen with a real library call.
    I.e. no inlining, etc.

  • It incorporates linking into LLVM, which is not LLVM’s job. Arguably, the
    line should be drawn at the lowering to libcalls as it’s done for other
    back-ends. However, we’re also constrained to by the need to have the
    linking done before we generate PTX which prevents doing it after LLVM is
    done generating an object file.

I’m confused. Clang links in libdevice.bc early.

Yes. Because that’s where it has to happen if we want to keep LLVM unaware of CUDA SDK.

It does not have to be the case if/when LLVM can do the linking itself.

If we make sure
__nv_cos is not deleted early, we can at any point “lower” llvm.cos
to __nv_cos which is available. After the lowering we can remove
the artificial uses of __nv_XXX functions that we used to keep the
definitions around in order to remove them from the final result.

This is the ‘link early’ approach, I should’ve been explicit that it’s ‘link early everything’ as opposed to linking only what’s needed at the beginning.
It would work at the expense of having to process/optimize 500KB worth of bitcode for every compilation, whether it needs it or not.

We get the benefit of having llvm.cos for some of the pipeline,
we know it does not have all the bad effects while __nv_cos is defined
with inline assembly. We also get the benefit of inlining __nv_cos
and folding the implementation based on the arguments. Finally,
this should work with the existing pipeline, the linking is the same
as before, all we do is to keep the definitions alive longer and
lower llvm.cos to __nv_cos in a middle end pass.

Again, I agree that it is doable.

This might be similar to the PTX solution you describe below but I feel
we get the inline benefit from this without actually changing the pipeline
at all.

So, to summarize:

  • link the library as bitcode early, add artificial placeholders for everything, compile, remove placeholders and DCE unused stuff away.
    Pros:
  • we’re already doing most of it before clang hands hands off IR to LLVM, so it just pushes it a bit lower in the compilation.
    Cons:
  • runtime cost of optimizing libdevice bitcode,
  • libdevice may be required for all NVPTX compilations?
  • link the library as bitcode late.
    Pros:
  • lower runtime cost than link-early approach.

Cons:

  • We’ll need to make sure that NVVMReflect pass processes the library.

  • less optimizations on the library functions. Some of the code gets DCE’ed away after NVVMReflect and the rest could be optimized better.

  • libdevice may be required for all NVPTX compilations?

  • ‘link’ with the library as PTX appended as text to LLVM’s output and let ptxas do the ‘linking’
    Pros: LLVM remains agnostic of CUDA SDK installation details. All it does is allows lowering libcalls and leaves their resolution to the external tools.
    Cons: Need to have the PTX library somewhere and need to integrate the ‘linking’ into the compilation process somehow.

Neither is particularly good. If the runtime overhead of link-early is acceptable, then it may be a winner here, by a very small margin.
link-as-PTX may be better conceptually as it keeps linking and compilation separate.

As for the practical steps, here’s what we need:

  • allow libcall lowering in NVPTX, possibly guarded by a flag. This is needed for all of the approaches above.
  • teach LLVM how to link in bitcode (and, possibly, control early/late mode)
  • teach clang driver to delegate libdevice linking to LLVM.

This will allow us to experiment with all three approaches and see what works best.

–Artem

Right. We could keep the definition of __nv_cos and friends
around. Right now, -ffast-math might just crash on the user,
which is arguably a bad thing. I can also see us benefiting
in various other ways from llvm.cos uses instead of __nv_cos
(assuming precision is according to the user requirements but
that is always a condition).

It could be as simple as introducing __nv_cos into
"llvm.used" and a backend matching/rewrite pass.

If the backend knew the libdevice location it could even pick
the definitions from there. Maybe we could link libdevice late
instead of eager?

It's possible, but it would require plumbing in CUDA SDK awareness into
LLVM. While clang driver can deal with that, LLVM currently can't. The
bitcode library path would have to be provided by the user.

The PTX backend could arguably be CUDA SDK aware, IMHO, it would
even be fine if the middle-end does the remapping to get inlining
and folding benefits also after __nv_cos is used. See below.

The standard library as bitcode raises some questions.

Which standard library? CUDAs libdevice is a bitcode library, right?

It's whatever LLVM will need to lower libcalls to. libdevice bitcode is the
closest approximation of that we have at the moment.

* When do we want to do the linking? If we do it at the beginning, then

the

question is how to make sure unused functions are not eliminated before

we

may need them, as we don't know apriori what's going to be needed. We

also

do want the unused functions to be gone after we're done. Linking it in
early would allow optimizing the code better at the expense of having to
optimize a lot of code we'll throw away. Linking it in late has less
overhead, but leaves the linked in bitcode unoptimized, though it's
probably in the ballpark of what would happen with a real library call.
I.e. no inlining, etc.

* It incorporates linking into LLVM, which is not LLVM's job. Arguably,

the

line should be drawn at the lowering to libcalls as it's done for other
back-ends. However, we're also constrained to by the need to have the
linking done before we generate PTX which prevents doing it after LLVM is
done generating an object file.

I'm confused. Clang links in libdevice.bc early.

Yes. Because that's where it has to happen if we want to keep LLVM unaware
of CUDA SDK.
It does not have to be the case if/when LLVM can do the linking itself.

If we make sure
`__nv_cos` is not deleted early, we can at any point "lower" `llvm.cos`
to `__nv_cos` which is available. After the lowering we can remove
the artificial uses of `__nv_XXX` functions that we used to keep the
definitions around in order to remove them from the final result.

This is the 'link early' approach, I should've been explicit that it's
'link early *everything*' as opposed to linking only what's needed at the
beginning.
It would work at the expense of having to process/optimize 500KB worth of
bitcode for every compilation, whether it needs it or not.

We get the benefit of having `llvm.cos` for some of the pipeline,
we know it does not have all the bad effects while `__nv_cos` is defined
with inline assembly. We also get the benefit of inlining `__nv_cos`
and folding the implementation based on the arguments. Finally,
this should work with the existing pipeline, the linking is the same
as before, all we do is to keep the definitions alive longer and
lower `llvm.cos` to `__nv_cos` in a middle end pass.

Again, I agree that it is doable.

This might be similar to the PTX solution you describe below but I feel
we get the inline benefit from this without actually changing the pipeline
at all.

So, to summarize:
* link the library as bitcode early, add artificial placeholders for
everything, compile, remove placeholders and DCE unused stuff away.
   Pros:
      - we're already doing most of it before clang hands hands off IR to
LLVM, so it just pushes it a bit lower in the compilation.
   Cons:
      - runtime cost of optimizing libdevice bitcode,
      - libdevice may be required for all NVPTX compilations?

* link the library as bitcode late.
    Pros:
      - lower runtime cost than link-early approach.
    Cons:
      - We'll need to make sure that NVVMReflect pass processes the library.
      - less optimizations on the library functions. Some of the code gets
DCE'ed away after NVVMReflect and the rest could be optimized better.
      - libdevice may be required for all NVPTX compilations?
* 'link' with the library as PTX appended as text to LLVM's output and let
ptxas do the 'linking'
   Pros: LLVM remains agnostic of CUDA SDK installation details. All it
does is allows lowering libcalls and leaves their resolution to the
external tools.
   Cons: Need to have the PTX library somewhere and need to integrate the
'linking' into the compilation process somehow.

Neither is particularly good. If the runtime overhead of link-early is
acceptable, then it may be a winner here, by a very small margin.
link-as-PTX may be better conceptually as it keeps linking and compilation
separate.

As for the practical steps, here's what we need:
- allow libcall lowering in NVPTX, possibly guarded by a flag. This is
needed for all of the approaches above.
- teach LLVM how to link in bitcode (and, possibly, control early/late mode)
- teach clang driver to delegate libdevice linking to LLVM.

This will allow us to experiment with all three approaches and see what
works best.

I think if we embed knowledge about the nv_XXX functions we can
even get away without the cons you listed for early linking above.

For early link I'm assuming an order similar to [0] but I also discuss
the case where we don't link libdevice early for a TU.

Link early:
1) clang emits module.bc and links in libdevice.bc but with the
`optnone`, `noinline`, and "used" attribute for functions in
libdevice. ("used" is not an attribute but could as well be.)
At this stage module.bc might call __nv_XXX or llvm.XXX freely
as defined by -ffast-math and friends.
2) Run some optimizations in the middle end, maybe till the end of
the inliner loop, unsure.
3) Run a libcall lowering pass and another NVVMReflect pass (or the
only instance thereof). We effectively remove all llvm.XXX calls
in favor of __nv_XXX now. Note that we haven't spend (much) time
on the libdevice code as it is optnone and most passes are good
at skipping those. To me, it's unclear if the used parts should
not be optimized before we inline them anyway to avoid redoing
the optimizations over and over (per call site). That needs
measuring I guess. Also note that we can still retain the current
behavior for direct calls to __nv_XXX if we mark the call sites
as `alwaysinline`, or at least the behavior is almost like the
current one is.
4) Run an always inliner pass on the __nv_XXX calls because it is
something we would do right now. Alternatively, remove `optnone`
and `noinline` from the __nv_XXX calls.
5) Continue with the pipeline as before.

As mentioned above, `optnone` avoids spending time on the libdevice
until we "activate" it. At that point (globals) DCE can be scheduled
to remove all unused parts right away. I don't think this is (much)
more expensive than linking libdevice early right now.

Link late, aka. translation units without libdevice:
1) clang emits module.bc but does not link in libdevice.bc, it will be
made available later. We still can mix __nv_XXX and llvm.XXX calls
freely as above.
2) Same as above.
3) Same as above.
4) Same as above but effectively a no-op, no __nv_XXX definitions are
available.
5) Same as above.

I might misunderstand something about the current pipeline but from [0]
and the experiments I run locally it looks like the above should cover all
the cases. WDYT?

~ Johannes

P.S. If the rewrite capability (aka libcall lowering) is generic we could
use the scheme for many other things as well.

[0] https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice

Right. We could keep the definition of __nv_cos and friends
around. Right now, -ffast-math might just crash on the user,
which is arguably a bad thing. I can also see us benefiting
in various other ways from llvm.cos uses instead of __nv_cos
(assuming precision is according to the user requirements but
that is always a condition).

It could be as simple as introducing __nv_cos into
“llvm.used” and a backend matching/rewrite pass.

If the backend knew the libdevice location it could even pick
the definitions from there. Maybe we could link libdevice late
instead of eager?

It’s possible, but it would require plumbing in CUDA SDK awareness into
LLVM. While clang driver can deal with that, LLVM currently can’t. The
bitcode library path would have to be provided by the user.
The PTX backend could arguably be CUDA SDK aware, IMHO, it would
even be fine if the middle-end does the remapping to get inlining
and folding benefits also after __nv_cos is used. See below.

The standard library as bitcode raises some questions.
Which standard library? CUDAs libdevice is a bitcode library, right?

It’s whatever LLVM will need to lower libcalls to. libdevice bitcode is the
closest approximation of that we have at the moment.

  • When do we want to do the linking? If we do it at the beginning, then
    the
    question is how to make sure unused functions are not eliminated before
    we
    may need them, as we don’t know apriori what’s going to be needed. We
    also
    do want the unused functions to be gone after we’re done. Linking it in
    early would allow optimizing the code better at the expense of having to
    optimize a lot of code we’ll throw away. Linking it in late has less
    overhead, but leaves the linked in bitcode unoptimized, though it’s
    probably in the ballpark of what would happen with a real library call.
    I.e. no inlining, etc.

  • It incorporates linking into LLVM, which is not LLVM’s job. Arguably,
    the
    line should be drawn at the lowering to libcalls as it’s done for other
    back-ends. However, we’re also constrained to by the need to have the
    linking done before we generate PTX which prevents doing it after LLVM is
    done generating an object file.
    I’m confused. Clang links in libdevice.bc early.

Yes. Because that’s where it has to happen if we want to keep LLVM unaware
of CUDA SDK.
It does not have to be the case if/when LLVM can do the linking itself.

If we make sure
__nv_cos is not deleted early, we can at any point “lower” llvm.cos
to __nv_cos which is available. After the lowering we can remove
the artificial uses of __nv_XXX functions that we used to keep the
definitions around in order to remove them from the final result.

This is the ‘link early’ approach, I should’ve been explicit that it’s
‘link early everything’ as opposed to linking only what’s needed at the
beginning.
It would work at the expense of having to process/optimize 500KB worth of
bitcode for every compilation, whether it needs it or not.

We get the benefit of having llvm.cos for some of the pipeline,
we know it does not have all the bad effects while __nv_cos is defined
with inline assembly. We also get the benefit of inlining __nv_cos
and folding the implementation based on the arguments. Finally,
this should work with the existing pipeline, the linking is the same
as before, all we do is to keep the definitions alive longer and
lower llvm.cos to __nv_cos in a middle end pass.

Again, I agree that it is doable.

This might be similar to the PTX solution you describe below but I feel
we get the inline benefit from this without actually changing the pipeline
at all.

So, to summarize:

  • link the library as bitcode early, add artificial placeholders for
    everything, compile, remove placeholders and DCE unused stuff away.
    Pros:
  • we’re already doing most of it before clang hands hands off IR to
    LLVM, so it just pushes it a bit lower in the compilation.
    Cons:
  • runtime cost of optimizing libdevice bitcode,
  • libdevice may be required for all NVPTX compilations?
  • link the library as bitcode late.
    Pros:
  • lower runtime cost than link-early approach.
    Cons:
  • We’ll need to make sure that NVVMReflect pass processes the library.
  • less optimizations on the library functions. Some of the code gets
    DCE’ed away after NVVMReflect and the rest could be optimized better.
  • libdevice may be required for all NVPTX compilations?
  • ‘link’ with the library as PTX appended as text to LLVM’s output and let
    ptxas do the ‘linking’
    Pros: LLVM remains agnostic of CUDA SDK installation details. All it
    does is allows lowering libcalls and leaves their resolution to the
    external tools.
    Cons: Need to have the PTX library somewhere and need to integrate the
    ‘linking’ into the compilation process somehow.

Neither is particularly good. If the runtime overhead of link-early is
acceptable, then it may be a winner here, by a very small margin.
link-as-PTX may be better conceptually as it keeps linking and compilation
separate.

As for the practical steps, here’s what we need:

  • allow libcall lowering in NVPTX, possibly guarded by a flag. This is
    needed for all of the approaches above.
  • teach LLVM how to link in bitcode (and, possibly, control early/late mode)
  • teach clang driver to delegate libdevice linking to LLVM.

This will allow us to experiment with all three approaches and see what
works best.

I think if we embed knowledge about the nv_XXX functions we can
even get away without the cons you listed for early linking above.

WDYM by embed knowledge about the nv_XXX functions? By linking those functions in? Of do you mean that we should just declare them before/instead of linking libdevice in?

For early link I’m assuming an order similar to [0] but I also discuss
the case where we don’t link libdevice early for a TU.

That link just describes the steps needed to use libdevice. It does not deal with how/where it fits in the LLVM pipeline.
The gist is that NVVMreflect replaces some conditionals with constants. libdevice uses that as a poor man’s IR preprocessor, conditionally enabling different implementations and relying on DCE and constant folding to remove unused parts and eliminate the now useless branches.
While running NVVM alone will make libdevice code valid and usable, it would still benefit from further optimizations. I do not know to what degree, though.

Link early:

  1. clang emits module.bc and links in libdevice.bc but with the
    optnone, noinline, and “used” attribute for functions in
    libdevice. (“used” is not an attribute but could as well be.)
    At this stage module.bc might call __nv_XXX or llvm.XXX freely
    as defined by -ffast-math and friends.

That could work. Just carrying extra IR around would probably be OK.
We may want to do NVVMReflect as soon as we have it linked in and, maybe, allow optimizing the functions that are explicitly used already.

  1. Run some optimizations in the middle end, maybe till the end of
    the inliner loop, unsure.
  2. Run a libcall lowering pass and another NVVMReflect pass (or the
    only instance thereof). We effectively remove all llvm.XXX calls

in favor of __nv_XXX now. Note that we haven’t spend (much) time
on the libdevice code as it is optnone and most passes are good
at skipping those. To me, it’s unclear if the used parts should
not be optimized before we inline them anyway to avoid redoing
the optimizations over and over (per call site). That needs
measuring I guess. Also note that we can still retain the current
behavior for direct calls to __nv_XXX if we mark the call sites
as alwaysinline, or at least the behavior is almost like the
current one is.
4) Run an always inliner pass on the __nv_XXX calls because it is
something we would do right now. Alternatively, remove optnone
and noinline from the __nv_XXX calls.
5) Continue with the pipeline as before.

SGTM.

As mentioned above, optnone avoids spending time on the libdevice
until we “activate” it. At that point (globals) DCE can be scheduled
to remove all unused parts right away. I don’t think this is (much)
more expensive than linking libdevice early right now.

Link late, aka. translation units without libdevice:

  1. clang emits module.bc but does not link in libdevice.bc, it will be
    made available later. We still can mix __nv_XXX and llvm.XXX calls
    freely as above.
  2. Same as above.
  3. Same as above.
  4. Same as above but effectively a no-op, no __nv_XXX definitions are
    available.
  5. Same as above.

I might misunderstand something about the current pipeline but from [0]
and the experiments I run locally it looks like the above should cover all
the cases. WDYT?

The optnone trick may indeed remove much of the practical differences between the early/late approaches.
In principle it should work.

Next question is – is libdevice sufficient to satisfy LLVM’s assumptions about the standard library.
While it does provide most of the equivalents of libm functions, the set is not complete and some of the functions differ from their libm counterparts.
The differences are minor, so we should be able to deal with it by generating few wrapper functions for the odd cases.
Here’s what clang does to provide math functions using libdevice:
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h

The most concerning aspect of libdevice is that we don’t know when we’ll no longer be able to use the libdevice bitcode? My understanding is that IR does not guarantee binary stability and at some point we may just be unable to use it. Ideally we need our own libm for GPUs.

–Artem

Right. We could keep the definition of __nv_cos and friends
around. Right now, -ffast-math might just crash on the user,
which is arguably a bad thing. I can also see us benefiting
in various other ways from llvm.cos uses instead of __nv_cos
(assuming precision is according to the user requirements but
that is always a condition).

It could be as simple as introducing __nv_cos into
"llvm.used" and a backend matching/rewrite pass.

If the backend knew the libdevice location it could even pick
the definitions from there. Maybe we could link libdevice late
instead of eager?

It's possible, but it would require plumbing in CUDA SDK awareness into
LLVM. While clang driver can deal with that, LLVM currently can't. The
bitcode library path would have to be provided by the user.

The PTX backend could arguably be CUDA SDK aware, IMHO, it would
even be fine if the middle-end does the remapping to get inlining
and folding benefits also after __nv_cos is used. See below.

The standard library as bitcode raises some questions.

Which standard library? CUDAs libdevice is a bitcode library, right?

It's whatever LLVM will need to lower libcalls to. libdevice bitcode is

the

closest approximation of that we have at the moment.

* When do we want to do the linking? If we do it at the beginning, then

the

question is how to make sure unused functions are not eliminated before

we

may need them, as we don't know apriori what's going to be needed. We

also

do want the unused functions to be gone after we're done. Linking it in
early would allow optimizing the code better at the expense of having

to

optimize a lot of code we'll throw away. Linking it in late has less
overhead, but leaves the linked in bitcode unoptimized, though it's
probably in the ballpark of what would happen with a real library call.
I.e. no inlining, etc.

* It incorporates linking into LLVM, which is not LLVM's job. Arguably,

the

line should be drawn at the lowering to libcalls as it's done for other
back-ends. However, we're also constrained to by the need to have the
linking done before we generate PTX which prevents doing it after LLVM

is

done generating an object file.

I'm confused. Clang links in libdevice.bc early.

Yes. Because that's where it has to happen if we want to keep LLVM

unaware

of CUDA SDK.
It does not have to be the case if/when LLVM can do the linking itself.

If we make sure
`__nv_cos` is not deleted early, we can at any point "lower" `llvm.cos`
to `__nv_cos` which is available. After the lowering we can remove
the artificial uses of `__nv_XXX` functions that we used to keep the
definitions around in order to remove them from the final result.

This is the 'link early' approach, I should've been explicit that it's
'link early *everything*' as opposed to linking only what's needed at the
beginning.
It would work at the expense of having to process/optimize 500KB worth of
bitcode for every compilation, whether it needs it or not.

We get the benefit of having `llvm.cos` for some of the pipeline,
we know it does not have all the bad effects while `__nv_cos` is defined
with inline assembly. We also get the benefit of inlining `__nv_cos`
and folding the implementation based on the arguments. Finally,
this should work with the existing pipeline, the linking is the same
as before, all we do is to keep the definitions alive longer and
lower `llvm.cos` to `__nv_cos` in a middle end pass.

Again, I agree that it is doable.

This might be similar to the PTX solution you describe below but I feel
we get the inline benefit from this without actually changing the

pipeline

at all.

So, to summarize:
* link the library as bitcode early, add artificial placeholders for
everything, compile, remove placeholders and DCE unused stuff away.
    Pros:
       - we're already doing most of it before clang hands hands off IR to
LLVM, so it just pushes it a bit lower in the compilation.
    Cons:
       - runtime cost of optimizing libdevice bitcode,
       - libdevice may be required for all NVPTX compilations?

* link the library as bitcode late.
     Pros:
       - lower runtime cost than link-early approach.
     Cons:
       - We'll need to make sure that NVVMReflect pass processes the

library.

       - less optimizations on the library functions. Some of the code

gets

DCE'ed away after NVVMReflect and the rest could be optimized better.
       - libdevice may be required for all NVPTX compilations?
* 'link' with the library as PTX appended as text to LLVM's output and

let

ptxas do the 'linking'
    Pros: LLVM remains agnostic of CUDA SDK installation details. All it
does is allows lowering libcalls and leaves their resolution to the
external tools.
    Cons: Need to have the PTX library somewhere and need to integrate the
'linking' into the compilation process somehow.

Neither is particularly good. If the runtime overhead of link-early is
acceptable, then it may be a winner here, by a very small margin.
link-as-PTX may be better conceptually as it keeps linking and

compilation

separate.

As for the practical steps, here's what we need:
- allow libcall lowering in NVPTX, possibly guarded by a flag. This is
needed for all of the approaches above.
- teach LLVM how to link in bitcode (and, possibly, control early/late

mode)

- teach clang driver to delegate libdevice linking to LLVM.

This will allow us to experiment with all three approaches and see what
works best.

I think if we embed knowledge about the nv_XXX functions we can
even get away without the cons you listed for early linking above.

WDYM by `embed knowledge about the nv_XXX functions`? By linking those
functions in? Of do you mean that we should just declare them
before/instead of linking libdevice in?

I mean by providing the "libcall lowering" pass. So the knowledge
that llvm.cos maps to __nv_cos.

For early link I'm assuming an order similar to [0] but I also discuss
the case where we don't link libdevice early for a TU.

That link just describes the steps needed to use libdevice. It does not
deal with how/where it fits in the LLVM pipeline.
The gist is that NVVMreflect replaces some conditionals with constants.
libdevice uses that as a poor man's IR preprocessor, conditionally enabling
different implementations and relying on DCE and constant folding to remove
unused parts and eliminate the now useless branches.
While running NVVM alone will make libdevice code valid and usable, it
would still benefit from further optimizations. I do not know to what
degree, though.

Link early:
1) clang emits module.bc and links in libdevice.bc but with the
     `optnone`, `noinline`, and "used" attribute for functions in
     libdevice. ("used" is not an attribute but could as well be.)
     At this stage module.bc might call __nv_XXX or llvm.XXX freely
     as defined by -ffast-math and friends.

That could work. Just carrying extra IR around would probably be OK.
We may want to do NVVMReflect as soon as we have it linked in and, maybe,
allow optimizing the functions that are explicitly used already.

Right. NVVMReflect can be run twice and with `alwaysinline`
on the call sites of __nv_XXX functions we will actually
inline and optimize them while the definitions are just "dragged
along" in case we need them later.

2) Run some optimizations in the middle end, maybe till the end of
     the inliner loop, unsure.
3) Run a libcall lowering pass and another NVVMReflect pass (or the
     only instance thereof). We effectively remove all llvm.XXX calls

     in favor of __nv_XXX now. Note that we haven't spend (much) time

     on the libdevice code as it is optnone and most passes are good
     at skipping those. To me, it's unclear if the used parts should
     not be optimized before we inline them anyway to avoid redoing
     the optimizations over and over (per call site). That needs
     measuring I guess. Also note that we can still retain the current
     behavior for direct calls to __nv_XXX if we mark the call sites
     as `alwaysinline`, or at least the behavior is almost like the
     current one is.
4) Run an always inliner pass on the __nv_XXX calls because it is
     something we would do right now. Alternatively, remove `optnone`
     and `noinline` from the __nv_XXX calls.
5) Continue with the pipeline as before.

SGTM.

As mentioned above, `optnone` avoids spending time on the libdevice
until we "activate" it. At that point (globals) DCE can be scheduled
to remove all unused parts right away. I don't think this is (much)
more expensive than linking libdevice early right now.

Link late, aka. translation units without libdevice:
1) clang emits module.bc but does not link in libdevice.bc, it will be
     made available later. We still can mix __nv_XXX and llvm.XXX calls
     freely as above.
2) Same as above.
3) Same as above.
4) Same as above but effectively a no-op, no __nv_XXX definitions are
     available.
5) Same as above.

I might misunderstand something about the current pipeline but from [0]
and the experiments I run locally it looks like the above should cover all
the cases. WDYT?

The `optnone` trick may indeed remove much of the practical differences
between the early/late approaches.
In principle it should work.

Next question is -- is libdevice sufficient to satisfy LLVM's assumptions
about the standard library.
While it does provide most of the equivalents of libm functions, the set is
not complete and some of the functions differ from their libm counterparts.
The differences are minor, so we should be able to deal with it by
generating few wrapper functions for the odd cases.
Here's what clang does to provide math functions using libdevice:
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h

Right now, clang will generate any llvm intrinsic and we crash, so anything
else is probably a step in the right direction. Eventually, we should "lower"
all intrinsics that the NVPTX backend can't handle or at least emit a nice
error message. Preferably, clang would know what we can't deal with and not
generate intinsic calls for those in the first place.

The most concerning aspect of libdevice is that we don't know when we'll no
longer be able to use the libdevice bitcode? My understanding is that IR
does not guarantee binary stability and at some point we may just be unable
to use it. Ideally we need our own libm for GPUs.

For OpenMP I did my best to avoid writing libm (code) for GPUs by piggy
backing on CUDA and libc++ implementations, I hope it will stay that way.
That said, if the need arises we might really have to port libc++ to the
GPUs.

Back to the problem with libdevice. I agree that the solution of NVIDIA
to ship a .bc library is suboptimal but with the existing, or an extended,
auto-upgrader we might be able to make that work reasonably well for the
foreseeable future. That problem is orthogonal to what we are discussing
above, I think.

~ Johannes

We could also consider doing something slightly broader.

For example we could define a special attribute on top of the llvm.cos call/declaration etc with metadata or an attribute that points to the actual __nv_cos function. Then in a subsequent lowering pass the corresponding intrinsic with the relevant attribute has its uses replaced by the actual function.

I certainly agree we should try to avoid a hard-coded mapping
in C++.

I could see something like:

__attribute__((implementation("llvm.cos"))
double __nv_cos(...) { ... }

and a pass that transforms all calls to a function with an
"implementation" to calls to that implementation. Maybe
later we attach a score/priority :wink:

I certainly agree we should try to avoid a hard-coded mapping
in C++.

I could see something like:

__attribute__((implementation("llvm.cos"))
double __nv_cos(...) { ... }

and a pass that transforms all calls to a function with an
“implementation” to calls to that implementation. Maybe
later we attach a score/priority :wink:

I’m not sure how that would work.

Where would you place that __attribute__((implementation)) ? We do not have the definitions for __nv_* as they come from NVIDIA-provided bitcode. We could add the attribute to the declaration in __clang_cuda_libdevice_declares.h.

How does LLVM handle the differences in function attributes between function declaration and definition? Will there be trouble when we link in the actual __nv_cos from the libdevice that would not have that attribute?

Another potential gotcha is that for the functions that can’t be directly mapped 1:1 to __nv_* counterparts, we’d still need to provide the implementation ourselves. We will not know whether the implementation will be used until after the substitution pass, so we’ll need to make sure it’s not DCE’d until then. It appears to be the same issue (though on a smaller scale) as with linking in libdevice directly.

Let’s take a step back and figure out what are the issues we want to solve.

The top-level goal is to provide implementation for LLVM intrinsics. For now let’s stick with libm-related ones.
What we have is the libdevice bitcode which uses different function names and provides a subset of the functionality we need.
What we miss is

  • something to connect LLVM’s libcalls to the GPU-side implementation,
  • additional code to provide implementations for the functions that are missing or different in libdevice.

Considering that we want this to work in LLVM, the additional code would have to be a bitcode and it would have to exist in addition to libdevice.
Our options for the mapping between LLVM intrinsics and the implementation are

  • intrinsic → _nv* equivalent mapping pass
    This would still need additional bitcode for the missing/different functions.
  • lower libcalls to the standard libm APIs, implement libm → _nv* mapping in our own bitcode.

Considering that additional bitcode is needed in both cases, I believe that the second approach makes more sense.
LLVM does not need to know or care about what’s provided by libdevice, and we’d have more flexibility, compared to what we could do in the mapping pass. It also makes it easy to substitute a different implementation, if we have or need one.

WDYT?

–Artem

I certainly agree we should try to avoid a hard-coded mapping
in C++.

I could see something like:

__attribute__((implementation("llvm.cos"))
double __nv_cos(...) { ... }

and a pass that transforms all calls to a function with an
"implementation" to calls to that implementation. Maybe
later we attach a score/priority :wink:

I'm not sure how that would work.
Where would you place that `__attribute__((implementation))` ? We do not
have the definitions for `__nv_*` as they come from NVIDIA-provided
bitcode. We could add the attribute to the declaration in
`__clang_cuda_libdevice_declares.h`.
How does LLVM handle the differences in function attributes between
function declaration and definition? Will there be trouble when we link in
the actual __nv_cos from the libdevice that would not have that attribute?

Another potential gotcha is that for the functions that can't be directly
mapped 1:1 to `__nv_*` counterparts, we'd still need to provide the
implementation ourselves. We will not know whether the implementation will
be used until after the substitution pass, so we'll need to make sure it's
not DCE'd until then. It appears to be the same issue (though on a smaller
scale) as with linking in libdevice directly.

Let's take a step back and figure out what are the issues we want to solve.

The top-level goal is to provide implementation for LLVM intrinsics. For
now let's stick with libm-related ones.
What we have is the libdevice bitcode which uses different function names
and provides a subset of the functionality we need.
What we miss is
   - something to connect LLVM's libcalls to the GPU-side implementation,
   - additional code to provide implementations for the functions that are
missing or different in libdevice.

Considering that we want this to work in LLVM, the additional code would
have to be a bitcode and it would have to exist in addition to libdevice.
Our options for the mapping between LLVM intrinsics and the implementation
are
* intrinsic -> __nv_* equivalent mapping pass
    This would still need additional bitcode for the missing/different
functions.
* lower libcalls to the standard libm APIs, implement libm -> __nv_*
mapping in our own bitcode.

Considering that additional bitcode is needed in both cases, I believe that
the second approach makes more sense.

I really hope to avoid any additional bitcode, there are too many
drawbacks and basically no benefits, IMHO.

LLVM does not need to know or care about what's provided by libdevice, and
we'd have more flexibility, compared to what we could do in the mapping
pass. It also makes it easy to substitute a different implementation, if we
have or need one.

I agree that LLVM (core) should not know about __nv_*, that's why I suggested
the `__attribute__((implements("...")))` approach. My preferred solution
is still to annotate our declarations of __nv_* and point to the
llvm.intrinsics (name) from there. If we have a missing mapping, we point to an
intrinsic from a definition that lives in the Clang headers next to the
__nv_* declarations.

This does not yet work because -mlink-builtin-bitcode (which I assume
triggers the llvm-link logic) will drop the attributes of a declaration
if a definition is found. I think that should not be the case anyway
such that the union of attributes is set.

The benefit I see for the above is that the mapping is tied to the
declarations and doesn't live in a tablegen file far away. It works well
even if we can't map 1:1, and we could even restrict the "used" attribute
to anything that has an "implements" attribute. So:

__nv_A() { ... } // called, inlined and optimized as before, DCE'ed after.

__nv_B() { ... } // not called, DCE'ed.

__attribute__((implements("llvm.C"))
__nv_C() { ... } // calls are inlined and optimized as before, not DCE'ed
                  // though because of the attribute. Replaces llvm.C as
                  // callee in the special pass.

So "implements" gives you a way to statically replace a function declaration
or definition with another one. I could see it being used to provide other
intrinsics to platforms with backends that don't support them.

Does that make some sense?

~ Johannes

I certainly agree we should try to avoid a hard-coded mapping
in C++.

I could see something like:

__attribute__((implementation("llvm.cos"))
double __nv_cos(...) { ... }

and a pass that transforms all calls to a function with an
“implementation” to calls to that implementation. Maybe
later we attach a score/priority :wink:

I’m not sure how that would work.
Where would you place that __attribute__((implementation)) ? We do not
have the definitions for __nv_* as they come from NVIDIA-provided
bitcode. We could add the attribute to the declaration in
__clang_cuda_libdevice_declares.h.
How does LLVM handle the differences in function attributes between
function declaration and definition? Will there be trouble when we link in
the actual __nv_cos from the libdevice that would not have that attribute?

Another potential gotcha is that for the functions that can’t be directly
mapped 1:1 to __nv_* counterparts, we’d still need to provide the
implementation ourselves. We will not know whether the implementation will
be used until after the substitution pass, so we’ll need to make sure it’s
not DCE’d until then. It appears to be the same issue (though on a smaller
scale) as with linking in libdevice directly.

Let’s take a step back and figure out what are the issues we want to solve.

The top-level goal is to provide implementation for LLVM intrinsics. For
now let’s stick with libm-related ones.
What we have is the libdevice bitcode which uses different function names
and provides a subset of the functionality we need.
What we miss is

  • something to connect LLVM’s libcalls to the GPU-side implementation,
  • additional code to provide implementations for the functions that are
    missing or different in libdevice.

Considering that we want this to work in LLVM, the additional code would
have to be a bitcode and it would have to exist in addition to libdevice.
Our options for the mapping between LLVM intrinsics and the implementation
are

  • intrinsic → _nv* equivalent mapping pass
    This would still need additional bitcode for the missing/different
    functions.
  • lower libcalls to the standard libm APIs, implement libm → _nv*
    mapping in our own bitcode.

Considering that additional bitcode is needed in both cases, I believe that
the second approach makes more sense.

I really hope to avoid any additional bitcode, there are too many
drawbacks and basically no benefits, IMHO.

Could you elaborate on the drawbacks?

The fact is that we already depend on the external bitcode (libdevice in this case), though right now we’re trying to keep that to clang only. The current approach is not sound in principle and is rather brittle in practice. Nor clang is the only source of the IR for the LLVM to compile, so it leaves LLVM-only users without a good solution. There are already a handful of JIT compilers that each do their own gluing of libdevice into the IR they want to compile for NVPTX. I think we do have a very good reason to deal with that in LLVM itself.

While I agree that additional bitcode is a hassle, I think it would be a net positive change for LLVM usability for NVPTX users.
The external bitcode would not be required for those who do not need libdevice now, so the change should not be disruptive.

LLVM does not need to know or care about what’s provided by libdevice, and
we’d have more flexibility, compared to what we could do in the mapping
pass. It also makes it easy to substitute a different implementation, if we
have or need one.

I agree that LLVM (core) should not know about _nv, that’s why I
suggested
the __attribute__((implements("..."))) approach. My preferred solution
is still to annotate our declarations of _nv
and point to the
llvm.intrinsics (name) from there. If we have a missing mapping, we
point to an
intrinsic from a definition that lives in the Clang headers next to the
_nv* declarations.

We may have slightly different end goals in mind.
I was thinking of making the solution work for LLVM. I.e. users would be free to use llvm.sin with NVPTX back-end with a few documented steps needed to make it work (basically “pass additional -link-libm-bitcode=path/to/bitcode_libm.bc”).

Your scenario above suggests that the goal is to allow clang to generate both llvm intrinsics and the glue which would then be used by LLVM to make it work for clang, but not in general. It’s an improvement compared to what we have now, but I still think we should try a more general solution.

This does not yet work because -mlink-builtin-bitcode (which I assume
triggers the llvm-link logic) will drop the attributes of a declaration
if a definition is found. I think that should not be the case anyway
such that the union of attributes is set.

The benefit I see for the above is that the mapping is tied to the
declarations and doesn’t live in a tablegen file far away. It works well
even if we can’t map 1:1, and we could even restrict the “used” attribute
to anything that has an “implements” attribute.

I do not think we need tablegen for anything here. I was thinking of just compiling a real math library (or a wrapper on top of libdevice) from C/C++ sources.

Our approaches are not mutually exclusive. If there’s a strong opposition to providing a bitcode libm for NVPTX, implementing it somewhere closer to clang would still be an improvement, even if it’s not as general as I’d like. It should still be possible to allow LLVM to lower libcalls in NVPTX to standard libm API, enabled with a flag, and just let the end users who are interested (e.g. JITs) to provide their own implementation.

–Artem

I certainly agree we should try to avoid a hard-coded mapping
in C++.

I could see something like:

__attribute__((implementation("llvm.cos"))
double __nv_cos(...) { ... }

and a pass that transforms all calls to a function with an
"implementation" to calls to that implementation. Maybe
later we attach a score/priority :wink:

I'm not sure how that would work.
Where would you place that `__attribute__((implementation))` ? We do not
have the definitions for `__nv_*` as they come from NVIDIA-provided
bitcode. We could add the attribute to the declaration in
`__clang_cuda_libdevice_declares.h`.
How does LLVM handle the differences in function attributes between
function declaration and definition? Will there be trouble when we link

in

the actual __nv_cos from the libdevice that would not have that

attribute?

Another potential gotcha is that for the functions that can't be directly
mapped 1:1 to `__nv_*` counterparts, we'd still need to provide the
implementation ourselves. We will not know whether the implementation

will

be used until after the substitution pass, so we'll need to make sure

it's

not DCE'd until then. It appears to be the same issue (though on a

smaller

scale) as with linking in libdevice directly.

Let's take a step back and figure out what are the issues we want to

solve.

The top-level goal is to provide implementation for LLVM intrinsics. For
now let's stick with libm-related ones.
What we have is the libdevice bitcode which uses different function names
and provides a subset of the functionality we need.
What we miss is
    - something to connect LLVM's libcalls to the GPU-side implementation,
    - additional code to provide implementations for the functions that

are

missing or different in libdevice.

Considering that we want this to work in LLVM, the additional code would
have to be a bitcode and it would have to exist in addition to libdevice.
Our options for the mapping between LLVM intrinsics and the

implementation

are
* intrinsic -> __nv_* equivalent mapping pass
     This would still need additional bitcode for the missing/different
functions.
* lower libcalls to the standard libm APIs, implement libm -> __nv_*
mapping in our own bitcode.

Considering that additional bitcode is needed in both cases, I believe

that

the second approach makes more sense.

I really hope to avoid any additional bitcode, there are too many
drawbacks and basically no benefits, IMHO.

Could you elaborate on the drawbacks?

The fact is that we already depend on the external bitcode (libdevice in
this case), though right now we're trying to keep that to clang only. The
current approach is not sound in principle and is rather brittle in
practice. Nor clang is the only source of the IR for the LLVM to
compile, so it leaves LLVM-only users without a good solution. There are
already a handful of JIT compilers that each do their own gluing of
libdevice into the IR they want to compile for NVPTX. I think we do have a
very good reason to deal with that in LLVM itself.

While I agree that additional bitcode is a hassle, I think it would be a
net positive change for LLVM usability for NVPTX users.
The external bitcode would not be required for those who do not need
libdevice now, so the change should not be disruptive.

Bitcode comes with all the problems libdevice itself has wrt.
compatibility. It is also hard to update and maintain. You basically
maintain IR or you maintain C(++) as I suggest. Also, bitcode is
platform specific. I can imagine building a bitcode file during the
build but shipping one means you have to know ABI and datalayout or
hope they are the same everywhere.

LLVM does not need to know or care about what's provided by libdevice,

and

we'd have more flexibility, compared to what we could do in the mapping
pass. It also makes it easy to substitute a different implementation, if

we

have or need one.

I agree that LLVM (core) should not know about __nv_*, that's why I
suggested
the `__attribute__((implements("...")))` approach. My preferred solution
is still to annotate our declarations of __nv_* and point to the
llvm.intrinsics (name) from there. If we have a missing mapping, we
point to an
intrinsic from a definition that lives in the Clang headers next to the
__nv_* declarations.

We may have slightly different end goals in mind.
I was thinking of making the solution work for LLVM. I.e. users would be
free to use llvm.sin with NVPTX back-end with a few documented steps needed
to make it work (basically "pass additional
-link-libm-bitcode=path/to/bitcode_libm.bc").

Your scenario above suggests that the goal is to allow clang to generate
both llvm intrinsics and the glue which would then be used by LLVM to make
it work for clang, but not in general. It's an improvement compared to what
we have now, but I still think we should try a more general solution.

My scenario doesn't disallow a bitcode approach for non-clang
frontends, nor does it disallow them to simply build the glue code
with clang and package it themselves. It does however allow us to
maintain C(++) code rather than IR, which is by itself a big win.

This does not yet work because -mlink-builtin-bitcode (which I assume
triggers the llvm-link logic) will drop the attributes of a declaration
if a definition is found. I think that should not be the case anyway
such that the union of attributes is set.

The benefit I see for the above is that the mapping is tied to the
declarations and doesn't live in a tablegen file far away. It works well
even if we can't map 1:1, and we could even restrict the "used" attribute
to anything that has an "implements" attribute.

I do not think we need tablegen for anything here. I was thinking of just
compiling a real math library (or a wrapper on top of libdevice) from C/C++
sources.

I did not understand your suggestion before. Agreed, no tablegen.

Our approaches are not mutually exclusive. If there's a strong opposition
to providing a bitcode libm for NVPTX, implementing it somewhere closer to
clang would still be an improvement, even if it's not as general as I'd
like. It should still be possible to allow LLVM to lower libcalls in NVPTX
to standard libm API, enabled with a flag, and just let the end users who
are interested (e.g. JITs) to provide their own implementation.

Right. And their own implementation could be trivially created for
them as bc file:

`clang -emit-llvm-bc $clang_src/.../__clang_cuda_cmath.h -femit-all-decls`

Or am I missing something here?

~ Johannes

I prototyped the LLVM-Core parts last night:

https://reviews.llvm.org/D98516

If this is something we support I'll write an RFC, also
for the missing clang parts.

~ Johannes

[EOM]

I certainly agree we should try to avoid a hard-coded mapping
in C++.

I could see something like:

__attribute__((implementation("llvm.cos"))
double __nv_cos(...) { ... }

and a pass that transforms all calls to a function with an
“implementation” to calls to that implementation. Maybe
later we attach a score/priority :wink:

I’m not sure how that would work.
Where would you place that __attribute__((implementation)) ? We do not
have the definitions for __nv_* as they come from NVIDIA-provided
bitcode. We could add the attribute to the declaration in
__clang_cuda_libdevice_declares.h.
How does LLVM handle the differences in function attributes between
function declaration and definition? Will there be trouble when we link
in
the actual __nv_cos from the libdevice that would not have that
attribute?
Another potential gotcha is that for the functions that can’t be directly
mapped 1:1 to __nv_* counterparts, we’d still need to provide the
implementation ourselves. We will not know whether the implementation
will
be used until after the substitution pass, so we’ll need to make sure
it’s
not DCE’d until then. It appears to be the same issue (though on a
smaller
scale) as with linking in libdevice directly.

Let’s take a step back and figure out what are the issues we want to
solve.
The top-level goal is to provide implementation for LLVM intrinsics. For
now let’s stick with libm-related ones.
What we have is the libdevice bitcode which uses different function names
and provides a subset of the functionality we need.
What we miss is

  • something to connect LLVM’s libcalls to the GPU-side implementation,
  • additional code to provide implementations for the functions that
    are
    missing or different in libdevice.

Considering that we want this to work in LLVM, the additional code would
have to be a bitcode and it would have to exist in addition to libdevice.
Our options for the mapping between LLVM intrinsics and the
implementation
are

  • intrinsic → _nv* equivalent mapping pass
    This would still need additional bitcode for the missing/different
    functions.
  • lower libcalls to the standard libm APIs, implement libm → _nv*
    mapping in our own bitcode.

Considering that additional bitcode is needed in both cases, I believe
that
the second approach makes more sense.
I really hope to avoid any additional bitcode, there are too many
drawbacks and basically no benefits, IMHO.

Could you elaborate on the drawbacks?

The fact is that we already depend on the external bitcode (libdevice in
this case), though right now we’re trying to keep that to clang only. The
current approach is not sound in principle and is rather brittle in
practice. Nor clang is the only source of the IR for the LLVM to
compile, so it leaves LLVM-only users without a good solution. There are
already a handful of JIT compilers that each do their own gluing of
libdevice into the IR they want to compile for NVPTX. I think we do have a
very good reason to deal with that in LLVM itself.

While I agree that additional bitcode is a hassle, I think it would be a
net positive change for LLVM usability for NVPTX users.
The external bitcode would not be required for those who do not need
libdevice now, so the change should not be disruptive.

Bitcode comes with all the problems libdevice itself has wrt.
compatibility.

We already have this problem, so it does not make things (much) worse than they are.

Considering that we’ll be able to keep the library in sync with LLVM, the compatibility is less of a problem as the library that would come with LLVM would be built with/for exactly that LLVM version.

It is also hard to update and maintain. You basically
maintain IR or you maintain C(++) as I suggest.

We seem to agree that the implementation of such a library would be in C/C++.

Also, bitcode is platform specific. I can imagine building a bitcode file during the
build but shipping one means you have to know ABI and datalayout or
hope they are the same everywhere.

Agreed. We will likely need multiple variants. We will compile specifically for NVPTX or AMDGPU and we will know specific ABI and the data layout for them regardless of the host we’re building on.

It appears to me is the the difference vs what we have now is that we’ll need to have the libm sources somewhere, the process to build them for particular GPUs (that may need to be done out of the tree as it may need CUDA/HIP SDKs) and having to incorporate such libraries into llvm distribution.

OK. I’ll agree that that may be a bit too much for now.

LLVM does not need to know or care about what’s provided by libdevice,
and
we’d have more flexibility, compared to what we could do in the mapping
pass. It also makes it easy to substitute a different implementation, if
we
have or need one.
I agree that LLVM (core) should not know about _nv, that’s why I
suggested
the __attribute__((implements("..."))) approach. My preferred solution
is still to annotate our declarations of _nv
and point to the
llvm.intrinsics (name) from there. If we have a missing mapping, we
point to an
intrinsic from a definition that lives in the Clang headers next to the
_nv* declarations.

We may have slightly different end goals in mind.
I was thinking of making the solution work for LLVM. I.e. users would be
free to use llvm.sin with NVPTX back-end with a few documented steps needed
to make it work (basically “pass additional
-link-libm-bitcode=path/to/bitcode_libm.bc”).

Your scenario above suggests that the goal is to allow clang to generate
both llvm intrinsics and the glue which would then be used by LLVM to make
it work for clang, but not in general. It’s an improvement compared to what
we have now, but I still think we should try a more general solution.

My scenario doesn’t disallow a bitcode approach for non-clang
frontends, nor does it disallow them to simply build the glue code
with clang and package it themselves. It does however allow us to
maintain C(++) code rather than IR, which is by itself a big win.

Agreed.

This does not yet work because -mlink-builtin-bitcode (which I assume
triggers the llvm-link logic) will drop the attributes of a declaration
if a definition is found. I think that should not be the case anyway
such that the union of attributes is set.

The benefit I see for the above is that the mapping is tied to the
declarations and doesn’t live in a tablegen file far away. It works well
even if we can’t map 1:1, and we could even restrict the “used” attribute
to anything that has an “implements” attribute.

I do not think we need tablegen for anything here. I was thinking of just
compiling a real math library (or a wrapper on top of libdevice) from C/C++
sources.

I did not understand your suggestion before. Agreed, no tablegen.

Our approaches are not mutually exclusive. If there’s a strong opposition
to providing a bitcode libm for NVPTX, implementing it somewhere closer to
clang would still be an improvement, even if it’s not as general as I’d
like. It should still be possible to allow LLVM to lower libcalls in NVPTX
to standard libm API, enabled with a flag, and just let the end users who
are interested (e.g. JITs) to provide their own implementation.

Right. And their own implementation could be trivially created for
them as bc file:

clang -emit-llvm-bc $clang_src/.../__clang_cuda_cmath.h -femit-all-decls

Or am I missing something here?

I think we’re on the same page. Let’s see where the attribute(implementation) gets us.

–Artem

It sounded before like you were saying the library should effectively be function aliases for standard libm names, to call _nv names. Isn’t it utterly trivial to generate such a bitcode file as part of the toolchain build, without requiring any external SDKs?

That’s true for most, but not all functions provided by libdevice. We’d still need something that’s a bit more involved.

–Artem