[gpucc] relationship between host and device IR for global __device__ variable?

Hi everybody!

I'm working on a pass that instruments CUDA kernels, which then require some additional "arguments" in order to write out their results.
To my knowledge, changes to the signature of device functions must be made in the frontend so host and device are in sync.
In order to avoid having to hack the front-end, we implemented this kind of stuff using global device variables which can be accessed using cudaMemcpyToSymbol.
Unfortunately that function does not allow passing the symbol name of the variable.

So here is the actual two part question:
- How does clang link global device variables on the device to its "host version" so that the CUDA runtime can use it?
- Can I duplicate the functionality in IR passes to create global variables after the frontend finishes?

Looking at the IR I can only see regular global variables in both host and device IR.
I tried just inserting global variables in IR (externally initialized, external linkage) but get an "invalid device symbol" error if I try to use it at runtime.
I was unable to find the interesting bits in clang itself since I'm not at all familiar with the codebase.

My apologies if this is not the right mailing this.
I chose this one instead of llvm-dev because I figured clang is responsible for creating and linking __device__ variables.

Thanks,
Alex

Hi everybody!

I’m working on a pass that instruments CUDA kernels, which then require some additional “arguments” in order to write out their results.
To my knowledge, changes to the signature of device functions must be made in the frontend so host and device are in sync.
In order to avoid having to hack the front-end, we implemented this kind of stuff using global device variables which can be accessed using cudaMemcpyToSymbol.
Unfortunately that function does not allow passing the symbol name of the variable.

So here is the actual two part question:

  • How does clang link global device variables on the device to its “host version” so that the CUDA runtime can use it?

​Clang creates “shadow” variable on the host side and registers {address, name}​ tuple with CUDA runtime. Whenever you need to pass a pointer to the device side via CUDA runtime, it automatically translates the address of the host-side shadow variable to the device-side address of the variable when it passes it to the kernel.

  • Can I duplicate the functionality in IR passes to create global variables after the frontend finishes?

​Probably. You can take a look at the glue IR we generate on the​ host side.
E.g. compile a simple cuda source with --keep-temps and you should find number of calls to __cudaRegister…() calls.

​Note that you need to have full CUDA compilation as clang will not generate this glue code if it has no-device-side object to include. I.e. --cuda-host-only will not work.​

​Here’s an example (look for __cuda_register_globals) : https://godbolt.org/g/ib3Hyk

​–Artem​