I’m looking into extending the PTX Clang target to support code generation from OpenCL/CUDA code, so I’m wondering about the current state of these two Clang implementations. As a test, I’ve implemented the AddrSpaceMap map in the PTX target in lib/Basic/Targets.cpp, but I’m not sure what other hooks are required. From what I can tell, basic functionality is working quite well! I hope to commit a small patch soon to support the AddrSpaceMap for PTX.
I’m currently investigating the following issues/concerns:
- What is the plan for language-specific functions and other constructs, such as __syncthreads/barrier, get_local_id/threadIdx, etc.? Is it up to the back-end to define compatible definitions of these, or is there a plan to introduce generic LLVM intrinsics for these? Since OpenCL has pre-defined functions that do not require header files, it may be awkward to require OpenCL to include a back-end specific header file when compiling with Clang.
- What is the status of the address space mapping? The CUDA frontend does not seem to respect the mapping (I get address-space-less alloca’s for shared arrays), and the OpenCL frontend seems to respect the address mapping but does not emit complete array definitions for locally-defined __local arrays. Does the front-end currently not support __local arrays embedded in the code? It seems to work if the __local arrays are passed as pointers to the kernel.
As an example of the OpenCL issue:
jholewinski@aquila [tests]$ cat kernel1.cl
__kernel
void foo(__global float* a) {
__local float buffer[64];
buffer[0] = a[0];
// PTX-specific intrinsic
__builtin_ptx_bar_sync(0);
a[0] = buffer[0];
}
jholewinski@aquila [tests]$ clang -ccc-host-triple ptx64 -S -emit-llvm kernel1.cl -o kernel1.ll
jholewinski@aquila [tests]$ cat kernel1.ll
; ModuleID = ‘kernel1.cl’
target datalayout = “e-p:64:64-i64:64:64-f64:64:64-n1:8:16:32:64”
target triple = “ptx64–”
@foo.buffer.0 = internal addrspace(4) unnamed_addr global float 0.000000e+00
define ptx_kernel void @foo(float* nocapture %a) nounwind {
entry:
%0 = load float* %a, align 4, !tbaa !1
store float %0, float addrspace(4)* @foo.buffer.0, align 4, !tbaa !1
tail call void @llvm.ptx.bar.sync(i32 0)
%1 = load float addrspace(4)* @foo.buffer.0, align 4, !tbaa !1
store float %1, float* %a, align 4, !tbaa !1
ret void
}
declare void @llvm.ptx.bar.sync(i32) nounwind
!opencl.kernels = !{!0}
!0 = metadata !{void (float*)* @foo}
!1 = metadata !{metadata !“float”, metadata !2}
!2 = metadata !{metadata !“omnipotent char”, metadata !3}
!3 = metadata !{metadata !“Simple C/C++ TBAA”, null}
The definition of the local array is present in the LLVM IR, but it does not provide an array size.