I am trying to generate PTX code for 'nbody' sample program's kernel
(nbody_kernel.cu) using clang/LLVM version 3.2. The nbody CUDA program is
available in Nvidia's SDK.
I replaced the global indexes like for e.g threadIdx.x with
__builtin_ptx_read_tid_x() and others. There are no problems in generating
LLVM IR (i.e .ll). The error pops up while trying to generate PTX from the
IR using llc.
Any pointers on what might be going on here ? Will appreciate any help in
going forward
The problem you are seeing is because clang is putting the global variables in address space 0, which in NVPTX means the generic address space. PTX does not allow this, so the back-end should be printing an error for you. Are you using trunk or 3.2?
Generally, clang won’t be compatible with the CUDA Toolkit headers. If you want to use the constant modifier from CUDA in Clang, define it like so:
Another question is
What about extern __shared__ ?
I can see that the error goes away if I replace "extern __shared__ float4
sharedPos" with "__shared__ float4* sharedPos;". Do I have to dynamically
allocate the shared memory by specifying size in kernel Launch? If so, why
doesn't the second use of the same statement in another function cause the
error ?
Another question is
What about extern __shared__ ?
I can see that the error goes away if I replace "extern __shared__ float4
sharedPos" with "__shared__ float4* sharedPos;". Do I have to dynamically
allocate the shared memory by specifying size in kernel Launch? If so, why
doesn't the second use of the same statement in another function cause the
error ?
I am using 3.2.
I would just do away with the toolkit headers. I may try to put together
some minimalistic headers for clang w/ nvptx at some point. Your best bet
is to just define what you need yourself for now.
Either using or * should work. Just be aware that you will need to
specify a shared size when you launch the kernel. You can get the address
space mapping from lib/Target/NVPTX/MCTargetDesc/NVPTXBaseInfo.h.
I will remove the toolkit header. Just one last question..(maybe ) If I
do away with toolkit headers it says unknown type name '__device__'. Does
this function qualifier have an alternative ? or I can just do away with ?
Not really. Clang does not have a way to annotate device vs. kernel functions in C/C++ mode. You’re probably better off trying to use OpenCL or CUDA mode in clang.
In the clang unit tests, there is a cuda.h header that provides very basic support for these keywords: tests/SemaCUDA/cuda.h
If you compile as CUDA (use .cu extension, or “-x cuda”) and use this header, you will have basic support. You can invoke clang with something like: