CUDA separate compilation

Clang currently doesn’t support CUDA separate compilation and thus extern device functions and variables cannot be used.

Could someone give me any pointers where to look or what has to be done to support this? If at all possible, I’d like to see what’s missing and possibly try to tackle it.

[+Doru, Carlo]

Indeed, Clang currently treats device-side compilation in each CUDA file as a whole program compilation. I.e. the result of it is a GPU executable.

I think clang should be able to compile code with ‘extern device’, and it’s ptxas that’s unhappy to see unresolved symbols because it expects to see the whole program. Someone/somewhere needs to eventually turn GPU object file into a GPU executable.

It should be possible to make it work.
First step is to tell ptxas to compile to a GPU object file:

  • Add separate compilation support in driver. At the very minimum driver should pass appropriate flags to ptxas and warn/error if it’s not supported. You may be able to get by with just “-Xcuda-ptxas -c” + external nvlink of CUDA files into a single partially linked .o.

As for who does GPU-side linking, we should perhaps consider running nvlink completely outside of clang. I.e. clang will produce GPU object files, if required, but it would be up to user’s build system to link them together with nvlink before the final linking of the host executable. If that’s acceptable, that’s probably all you need.

If you want clang to do GPU-side linking, then there are more things to do.

  • Figure out who’s supposed to run nvlink. Driver will need to be augmented to run it. The problem here is that there’s no easy way to tell if any of the given .o files given to clang during linking phase contain GPU executables, so it will most likely be controlled by a global flag which would insert another step into the compilation pipeline which will invoke nvlink on object files and would pass .o with partially linked host+GPU executable to the host linker.
  • Figure out whether the way GPU binaries are embedded in host .o is compatible with nvlink and implement missing bits, if necessary.
  • Figure out whether .o (or executable) produced by nvlink is something that Clang-generated init code can still work with. Fix it, if broken.

This should be about it.


Thank you for clarification, I’ll check the driver and see if I can make compiling to GPU object files + nvlink work.