instrumenting device code with gpucc

Hi Jingyue,

My name is Yuanfeng Peng, I'm a PhD student at UPenn. I'm sorry to bother
you, but I'm having trouble with gpucc in my project, and I would be really
grateful for your help!

Currently we're trying to instrument CUDA code using LLVM 3.9, and I've
written a pass to insert hook functions for certain function calls and
memory accesses. For example, given a CUDA program, say,, I
first compile it with

clang++ -emit-llvm -c,

which gives me two bitcode files, axpy.bc and axpy-sm_20.bc. Then I use
opt to load my pass and insert the hook functions to axpy.bc, which works
fine. After inspecting the instrumented axpy.bc, I noticed that the kernel
code was not there; rather, it lived inside axpy-sm_20.bc, so I also load
my pass to instrument axpy-sm_20.bc.

Expected. axpy.bc contains host code, and axpy-sm_??.bc contains device
code. If you only want to instrument the device side, you don't need to
modify axpy.bc.

However, after instrumenting axpy-sm_20.bc, I don't know how could I
combine the host bitcode & device bitcode into a single binary... When I
used llc to compile axpy-sm_20.bc into native code, I always got a bunch of
errors; if I only do llc axpy.bc -o axpy.s and then link axpy.s with the
necessary libraries, I got a working binary, but only the host code was

So what should I do to get a binary where the device code is also

To link the modified axpy-sm_20.bc to the final binary, you need several
extra steps:
1. Compile axpy-sm_20.bc to PTX assembly using llc: llc axpy-sm_20.bc -o
axpy-sm_20.ptx -march=<nvptx or nvptx64>
2. Compile the PTX assembly to SASS using ptxas
3. Make the SASS a fat binary using NVIDIA's fatbinary tool
4. Link the fat binary to the host code using ld.

Clang does step 2-4 by invoking subcommands. Therefore, you can use "clang
-###" to dump all the subcommands, and then find the ones for step 2-4. For

$ clang++ -### -O3 -I/usr/local/cuda/samples/common/inc
-L/usr/local/cuda/lib64 -lcudart_static -lcuda -ldl -lrt -pthread

clang version 3.9.0 (
4ce165e39e7b185e394aa713d9adffd920288988) (
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/google/home/jingyue/Work/llvm/install/bin
"-cc1" "-triple" "nvptx64-nvidia-cuda" "-aux-triple"
"x86_64-unknown-linux-gnu" "-fcuda-target-overloads"
"-fcuda-disable-target-call-checks" "-S" "-disable-free" "-main-file-name" "" "-mrelocation-model" "static" "-mthread-model" "posix"
"-mdisable-fp-elim" "-fmath-errno" "-no-integrated-as" "-fcuda-is-device"
"-target-feature" "+ptx42" "-target-cpu" "sm_35" "-dwarf-column-info"
"-debugger-tuning=gdb" "-resource-dir"
"-I" "/usr/local/cuda/samples/common/inc" "-internal-isystem"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-internal-isystem"
"-internal-isystem" "/usr/local/cuda/include" "-include"
"__clang_cuda_runtime_wrapper.h" "-O3" "-fdeprecated-macro"
"-fno-dwarf-directory-asm" "-fdebug-compilation-dir"
"/usr/local/google/home/jingyue/Work/cuda" "-ferror-limit" "19"
"-fmessage-length" "205" "-pthread" "-fobjc-runtime=gcc" "-fcxx-exceptions"
"-fexceptions" "-fdiagnostics-show-option" "-fcolor-diagnostics"
"-vectorize-loops" "-vectorize-slp" "-o" "/tmp/axpy-a88a72.s" "-x" "cuda" ""
"/usr/local/cuda/bin/ptxas" "-m64" "-O3" "--gpu-name" "sm_35"
"--output-file" "/tmp/axpy-1dbca7.o" "/tmp/axpy-a88a72.s"
"/usr/local/cuda/bin/fatbinary" "--cuda" "-64" "--create"
"/tmp/axpy-e6057c.fatbin" "--image=profile=sm_35,file=/tmp/axpy-1dbca7.o"
"-cc1" "-triple" "x86_64-unknown-linux-gnu" "-aux-triple"
"nvptx64-nvidia-cuda" "-fcuda-target-overloads"
"-fcuda-disable-target-call-checks" "-emit-obj" "-disable-free"
"-main-file-name" "" "-mrelocation-model" "static" "-mthread-model"
"posix" "-fmath-errno" "-masm-verbose" "-mconstructor-aliases"
"-munwind-tables" "-fuse-init-array" "-target-cpu" "x86-64"
"-momit-leaf-frame-pointer" "-dwarf-column-info" "-debugger-tuning=gdb"
"-I" "/usr/local/cuda/samples/common/inc" "-internal-isystem"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-internal-isystem" "/usr/local/cuda/include" "-include"
"__clang_cuda_runtime_wrapper.h" "-O3" "-fdeprecated-macro"
"-fdebug-compilation-dir" "/usr/local/google/home/jingyue/Work/cuda"
"-ferror-limit" "19" "-fmessage-length" "205" "-pthread"
"-fobjc-runtime=gcc" "-fcxx-exceptions" "-fexceptions"
"-fdiagnostics-show-option" "-fcolor-diagnostics" "-vectorize-loops"
"-vectorize-slp" "-o" "/tmp/axpy-48f6b5.o" "-x" "cuda" ""
"-fcuda-include-gpubinary" "/tmp/axpy-e6057c.fatbin"
"/usr/bin/ld" "-z" "relro" "--hash-style=gnu" "--build-id"
"--eh-frame-hdr" "-m" "elf_x86_64" "-dynamic-linker"
"/lib64/" "-o" "a.out"
"/usr/lib/gcc/x86_64-linux-gnu/4.8/crtbegin.o" "-L/usr/local/cuda/lib64"
"-L/lib/x86_64-linux-gnu" "-L/lib/../lib64" "-L/usr/lib/x86_64-linux-gnu"
"-L/lib" "-L/usr/lib" "/tmp/axpy-48f6b5.o" "-lcudart_static" "-lcuda"
"-ldl" "-lrt" "-lstdc++" "-lm" "-lgcc_s" "-lgcc" "-lpthread" "-lc"
"-lgcc_s" "-lgcc" "/usr/lib/gcc/x86_64-linux-gnu/4.8/crtend.o"

Hi Jingyue,

Thanks for the instructions! I instrumented the device code and got a binary of; however, the resulting executable always fails on the first cudaMalloc call in host code (the kernel had not even been launched yet), with the error code being 30 (cudaErrorUnknown). In my instrumentation pass, I only inserted a hook function upon each access to device memory, with their signatures being: “device void _Cool_MemRead_Hook(uint64_t addr)”. I’ve compiled these hooks functions into a shared object, and linked the axpy binary with it.

I’m really sorry to bother you again, but I wonder whether any step I did was apparently wrong, or there’s any gpucc-specific step I need to do when instrumenting a kernel?


It’s hard to tell what is wrong without a concrete example. E.g., what is the program you are instrumenting? What is the definition of the hook function? How did you link that definition with the binary?

One thing suspicious to me is that you may have linked the definition of _Cool_MemRead_Hook as a host function instead of a device function. AFAIK, PTX assembly cannot be linked. So, if you want that hook function called from your device code, you should merge the IR of the hook function and the IR of your device code into one IR (via linking or direct IR emitting) before the IR to PTX.

Hi Jingyue,

Thank you so much for the helpful response! I didn't know that PTX
assembly cannot be linked; that's likely the reason for my issue.

So I did the following as you suggested(axpy-sm_30.bc is the instrumented
bitcode, and cuda_hooks-sm_30.bc contains the hook functions):

*llvm-link axpy-sm_30.bc cuda_hooks-sm_30.bc -o inst_axpy-sm_30.bc*

*llc inst_axpy-sm_30.bc -o axpy-sm_30.s*

*"/usr/local/cuda/bin/ptxas" "-m64" "-O3" -c "--gpu-name" "sm_30"
"--output-file" axpy-sm_30.o axpy-sm_30.s*

However, I got the following error from ptxas:

*ptxas axpy-sm_30.s, line 106; error : Duplicate definition of function

*ptxas axpy-sm_30.s, line 106; fatal : Parsing error near '.2': syntax

*ptxas fatal : Ptx assembly aborted due to errors*

Looks like some cuda function definitions are in both bitcode files which
caused duplicate definition... what am I supposed to do to resolve this



Looks like we are getting closer!

According to the examples you sent, I believe the linking issue was caused by nvvm reflection anchors. I haven’t played with that, but I guess running nvvm-reflect on an IR removes the nvvm reflect anchors. After that, you can llvm-link the two bc/ll files.

Another potential issue is that your cuda_hooks-sm_30.ll is unoptimized. This could cause the instrumented code to run super slow.

Hey Jingyue,

Though I tried opt -nvvm-reflect on both bc files, the nvvm reflect anchor didn’t go away; ptxas is still complaining about the duplicate definition of of function ‘_ZL21__nvvm_reflect_anchorv’ . Did I misused the nvvm-reflect pass?


I’ve no idea. Without instrumentation, nvvm_reflect_anchor doesn’t appear in the final PTX, right? If that’s the case, some pass in llc must have deleted the anchor and you should be able to figure out which one.

Hey Jingyue,

Thanks for being so responsive! I finally figured out a way to resolve the issue: all I have to do is to use -only-needed when merging the device bitcodes with llvm-link.

However, since we actually need to instrument the host code as well, I encountered another issue when I tried to glue the instrumented host code and fatbin together. When I only instrumented the device code, I used the following cmd to do so:

“/mnt/wtf/tools/bin/clang-3.9” “-cc1” “-triple” “x86_64-unknown-linux-gnu” “-aux-triple” “nvptx64-nvidia-cuda” “-fcuda-target-overloads” “-fcuda-disable-target-call-checks” “-emit-obj” “-disable-free” “-main-file-name” “” “-mrelocation-model” “static” “-mthread-model” “posix” “-fmath-errno” “-masm-verbose” “-mconstructor-aliases” “-munwind-tables” “-fuse-init-array” “-target-cpu” “x86-64” “-momit-leaf-frame-pointer” “-dwarf-column-info” “-debugger-tuning=gdb” “-resource-dir” “/mnt/wtf/tools/bin/…/lib/clang/3.9.0” “-I” “/usr/local/cuda-7.0/samples/common/inc” “-internal-isystem” “/usr/lib/gcc/x86_64-linux-gnu/4.8/…/…/…/…/include/c++/4.8” “-internal-isystem” “/usr/lib/gcc/x86_64-linux-gnu/4.8/…/…/…/…/include/x86_64-linux-gnu/c++/4.8” “-internal-isystem” “/usr/lib/gcc/x86_64-linux-gnu/4.8/…/…/…/…/include/x86_64-linux-gnu/c++/4.8” “-internal-isystem” “/usr/lib/gcc/x86_64-linux-gnu/4.8/…/…/…/…/include/c++/4.8/backward” “-internal-isystem” “/usr/local/include” “-internal-isystem” “/mnt/wtf/tools/bin/…/lib/clang/3.9.0/include” “-internal-externc-isystem” “/usr/include/x86_64-linux-gnu” “-internal-externc-isystem” “/include” “-internal-externc-isystem” “/usr/include” “-internal-isystem” “/usr/local/cuda/include” “-include” “__clang_cuda_runtime_wrapper.h” “-O3” “-fdeprecated-macro” “-fdebug-compilation-dir” “/mnt/wtf/workspace/cuda/gpu-race-detection” “-ferror-limit” “19” “-fmessage-length” “291” “-pthread” “-fobjc-runtime=gcc” “-fcxx-exceptions” “-fexceptions” “-fdiagnostics-show-option” “-vectorize-loops” “-vectorize-slp” “-o” “axpy-host.o” “-x” “cuda” “tests/” “-fcuda-include-gpubinary” “axpy-sm_30.fatbin”

which, from my understanding, compiles the host code in tests/ and link it with axpy-sm_30.fatbin. However, now that I instrumented the IR of the host code (axpy.bc) and did llc axpy.bc -o axpy.s, which cmd should I use to link axpy.s with axpy-sm_30.fatbin? I tried to use -cc1as, but the flag ‘-fcuda-include-gpubinary’ was not recognized.



Including fatbin into host code should be done in frontend.

Hi Jingyue,

Sorry to ask again, but how exactly could I glue the fatbin with the instrumented host code? Or does it mean we actually cannot instrument both the host & device code at the same time?


When you generate axpy-host.bc, you should use “clang -cc1 …” with the “-fcuda-include-gpubinary” flag. “clang -cc1” invokes the frontend only.

Gotcha. Thank you sooooo much for all your invaluable help!