OpenCL/CUDA Interop with PTX Back-End

I’m looking into extending the PTX Clang target to support code generation from OpenCL/CUDA code, so I’m wondering about the current state of these two Clang implementations. As a test, I’ve implemented the AddrSpaceMap map in the PTX target in lib/Basic/Targets.cpp, but I’m not sure what other hooks are required. From what I can tell, basic functionality is working quite well! I hope to commit a small patch soon to support the AddrSpaceMap for PTX.

I’m currently investigating the following issues/concerns:

  1. What is the plan for language-specific functions and other constructs, such as __syncthreads/barrier, get_local_id/threadIdx, etc.? Is it up to the back-end to define compatible definitions of these, or is there a plan to introduce generic LLVM intrinsics for these? Since OpenCL has pre-defined functions that do not require header files, it may be awkward to require OpenCL to include a back-end specific header file when compiling with Clang.
  2. What is the status of the address space mapping? The CUDA frontend does not seem to respect the mapping (I get address-space-less alloca’s for shared arrays), and the OpenCL frontend seems to respect the address mapping but does not emit complete array definitions for locally-defined __local arrays. Does the front-end currently not support __local arrays embedded in the code? It seems to work if the __local arrays are passed as pointers to the kernel.
    As an example of the OpenCL issue:

jholewinski@aquila [tests]$ cat kernel1.cl
__kernel
void foo(__global float* a) {
__local float buffer[64];
buffer[0] = a[0];
// PTX-specific intrinsic
__builtin_ptx_bar_sync(0);
a[0] = buffer[0];
}

jholewinski@aquila [tests]$ clang -ccc-host-triple ptx64 -S -emit-llvm kernel1.cl -o kernel1.ll
jholewinski@aquila [tests]$ cat kernel1.ll
; ModuleID = ‘kernel1.cl
target datalayout = “e-p:64:64-i64:64:64-f64:64:64-n1:8:16:32:64”
target triple = “ptx64–”

@foo.buffer.0 = internal addrspace(4) unnamed_addr global float 0.000000e+00

define ptx_kernel void @foo(float* nocapture %a) nounwind {
entry:
%0 = load float* %a, align 4, !tbaa !1
store float %0, float addrspace(4)* @foo.buffer.0, align 4, !tbaa !1
tail call void @llvm.ptx.bar.sync(i32 0)
%1 = load float addrspace(4)* @foo.buffer.0, align 4, !tbaa !1
store float %1, float* %a, align 4, !tbaa !1
ret void
}

declare void @llvm.ptx.bar.sync(i32) nounwind

!opencl.kernels = !{!0}

!0 = metadata !{void (float*)* @foo}
!1 = metadata !{metadata !“float”, metadata !2}
!2 = metadata !{metadata !“omnipotent char”, metadata !3}
!3 = metadata !{metadata !“Simple C/C++ TBAA”, null}

The definition of the local array is present in the LLVM IR, but it does not provide an array size.

I'm looking into extending the PTX Clang target to support code generation
from OpenCL/CUDA code, so I'm wondering about the current state of these two
Clang implementations. As a test, I've implemented the AddrSpaceMap map in
the PTX target in lib/Basic/Targets.cpp, but I'm not sure what other hooks
are required. From what I can tell, basic functionality is working quite
well! I hope to commit a small patch soon to support the AddrSpaceMap for
PTX.

The AddrSpaceMap you added for PTX seems to be correct. The main
other hook for OpenCL is the CGOpenCLRuntime class, but PTX should
be able to use the default implementation.

I'm currently investigating the following issues/concerns:

   1. What is the plan for language-specific functions and other constructs,
   such as __syncthreads/barrier, get_local_id/threadIdx, etc.? Is it up to
   the back-end to define compatible definitions of these, or is there a plan
   to introduce generic LLVM intrinsics for these? Since OpenCL has
   pre-defined functions that do not require header files, it may be awkward to
   require OpenCL to include a back-end specific header file when compiling
   with Clang.

For OpenCL, the implementation should provide definitions of
the built-in functions described in section 6.11 of the OpenCL
specification. For at least some of those functions, the definitions
would be the same for any OpenCL implementation. (FWIW, I have
developed a set of generic implementations of section 6.11 built-ins
as part of an OpenCL implementation I have been working on, which I
will be open sourcing soon.)

For the rest (e.g. work-item functions), the implementation would
need to be specific to the OpenCL implementation. For example, on
a CPU, the exact implementation details of work-item functions would
be highly dependent on how the implementation stores work-item IDs,
so it would not be appropriate to use a generic intrinsic.

For CUDA, the NVIDIA header files provide appropriate declarations,
but as far as I can tell, variables such as threadIdx are handled
specially by nvcc, and functions such as __syncthreads are treated
as builtins. Clang does not currently implement the special handling
for these variables or functions.

   2. What is the status of the address space mapping? The CUDA frontend
   does not seem to respect the mapping (I get address-space-less alloca's for
   __shared__ arrays)

Clang does not currently implement CUDA address spaces correctly.
The major challenge is that OpenCL (and LLVM) address spaces are, in
C/C++ language terms, type qualifiers, while CUDA address spaces are
declaration specifiers (despite being referred to as "type qualifiers"
in the documentation). This means that in CUDA, pointers lack correct
type information. nvcc performs type inference to determine whether
a pointer is to __shared__ or __device__ memory, which Clang does
not currently implement.

As an alternative to implementing type inference, we could require
CUDA users to use a board with a unified address space, but this
is sub-optimal, and still depends on some support from Clang.

and the OpenCL frontend seems to respect the address
   mapping but does not emit complete array definitions for locally-defined
   __local arrays. Does the front-end currently not support __local arrays
   embedded in the code? It seems to work if the __local arrays are passed as
   pointers to the kernel.

Clang should support __local arrays, and this looks like a genuine
bug in the IR generator. I will investigate.

Thanks,

This actually seems to be an optimisation. Since only the first
element of the array is accessed, LLVM will only allocate storage for
that element. If you compile your example with -O0 (OpenCL compiles
with optimisations turned on by default), you will see that the 64
element array is created.

Thanks,

I’m looking into extending the PTX Clang target to support code generation
from OpenCL/CUDA code, so I’m wondering about the current state of these two
Clang implementations. As a test, I’ve implemented the AddrSpaceMap map in
the PTX target in lib/Basic/Targets.cpp, but I’m not sure what other hooks
are required. From what I can tell, basic functionality is working quite
well! I hope to commit a small patch soon to support the AddrSpaceMap for
PTX.

The AddrSpaceMap you added for PTX seems to be correct. The main
other hook for OpenCL is the CGOpenCLRuntime class, but PTX should
be able to use the default implementation.

I’m currently investigating the following issues/concerns:

  1. What is the plan for language-specific functions and other constructs,

such as __syncthreads/barrier, get_local_id/threadIdx, etc.? Is it up to
the back-end to define compatible definitions of these, or is there a plan
to introduce generic LLVM intrinsics for these? Since OpenCL has
pre-defined functions that do not require header files, it may be awkward to
require OpenCL to include a back-end specific header file when compiling
with Clang.

For OpenCL, the implementation should provide definitions of
the built-in functions described in section 6.11 of the OpenCL
specification. For at least some of those functions, the definitions
would be the same for any OpenCL implementation. (FWIW, I have
developed a set of generic implementations of section 6.11 built-ins
as part of an OpenCL implementation I have been working on, which I
will be open sourcing soon.)

For the rest (e.g. work-item functions), the implementation would
need to be specific to the OpenCL implementation. For example, on
a CPU, the exact implementation details of work-item functions would
be highly dependent on how the implementation stores work-item IDs,
so it would not be appropriate to use a generic intrinsic.

Right. I’m wondering what the implementation plan for this is with Clang. Are you going to expose the OpenCL functions as LLVM intrinsics, and let back-ends provide appropriate implementations? Right now, I’m defining these functions in terms of PTX builtin functions, but this is obviously not optimal because you need to include an additional header in OpenCL code.

For CUDA, the NVIDIA header files provide appropriate declarations,
but as far as I can tell, variables such as threadIdx are handled
specially by nvcc, and functions such as __syncthreads are treated
as builtins. Clang does not currently implement the special handling
for these variables or functions.

Are there any plans to implement any of these?

  1. What is the status of the address space mapping? The CUDA frontend

does not seem to respect the mapping (I get address-space-less alloca’s for
shared arrays)

Clang does not currently implement CUDA address spaces correctly.
The major challenge is that OpenCL (and LLVM) address spaces are, in
C/C++ language terms, type qualifiers, while CUDA address spaces are
declaration specifiers (despite being referred to as “type qualifiers”
in the documentation). This means that in CUDA, pointers lack correct
type information. nvcc performs type inference to determine whether
a pointer is to shared or device memory, which Clang does
not currently implement.

Fair enough. OpenCL is my main interest at the moment. Ideally, I would like to create a semi-functional workflow from OpenCL → PTX with Clang before the LLVM 3.0 branch.

and the OpenCL frontend seems to respect the address
mapping but does not emit complete array definitions for locally-defined
__local arrays. Does the front-end currently not support __local arrays
embedded in the code? It seems to work if the __local arrays are passed as
pointers to the kernel.

Clang should support __local arrays, and this looks like a genuine
bug in the IR generator. I will investigate.

This actually seems to be an optimisation. Since only the first
element of the array is accessed, LLVM will only allocate storage for
that element. If you compile your example with -O0 (OpenCL compiles
with optimisations turned on by default), you will see that the 64
element array is created.

I’m not really convinced this is a legal optimization. What if you purposely allocate arrays with extra padding to prevent bank conflicts in the kernel?

> > I'm currently investigating the following issues/concerns:
> >
> > 1. What is the plan for language-specific functions and other
> constructs,
> > such as __syncthreads/barrier, get_local_id/threadIdx, etc.? Is it up
> to
> > the back-end to define compatible definitions of these, or is there a
> plan
> > to introduce generic LLVM intrinsics for these? Since OpenCL has
> > pre-defined functions that do not require header files, it may be
> awkward to
> > require OpenCL to include a back-end specific header file when
> compiling
> > with Clang.
>
> For OpenCL, the implementation should provide definitions of
> the built-in functions described in section 6.11 of the OpenCL
> specification. For at least some of those functions, the definitions
> would be the same for any OpenCL implementation. (FWIW, I have
> developed a set of generic implementations of section 6.11 built-ins
> as part of an OpenCL implementation I have been working on, which I
> will be open sourcing soon.)
>
> For the rest (e.g. work-item functions), the implementation would
> need to be specific to the OpenCL implementation. For example, on
> a CPU, the exact implementation details of work-item functions would
> be highly dependent on how the implementation stores work-item IDs,
> so it would not be appropriate to use a generic intrinsic.
>

Right. I'm wondering what the implementation plan for this is with Clang.
Are you going to expose the OpenCL functions as LLVM intrinsics, and let
back-ends provide appropriate implementations? Right now, I'm defining
these functions in terms of PTX builtin functions, but this is obviously not
optimal because you need to include an additional header in OpenCL code.

This is how I imagine the built-ins should be implemented:

The built-in functions would be declared by a header file that belongs
to an OpenCL C runtime library (not to be confused with the OpenCL
Platform Layer or OpenCL Runtime defined by sections 4 and 5 of the
OpenCL specification). The runtime library in this case would consist
of a set of header files and (optionally) a static or shared library
file which together implement section 6.11 of the OpenCL specification.
The runtime library as a project would be a separate project from Clang
(but it may be a potential LLVM sub-project).

The driver would be extended to support locating the runtime
library's main header file, which could be installed in a known
location, pre-including it using the -include command line option
to the frontend (so that the functions declared by the header file
are available to every OpenCL program), and setting linker options
so that the runtime library is linked into the final executable.

Since my implementation of OpenCL is slightly unconventional (it
is built into KLEE, a symbolic execution engine) I have not needed
to implement any of the driver functionality (KLEE calls into the
frontend directly, and the paths to the header and library files
are hardcoded paths into the KLEE source and build directories),
so I haven't thought too closely about the details.

> For CUDA, the NVIDIA header files provide appropriate declarations,
> but as far as I can tell, variables such as threadIdx are handled
> specially by nvcc, and functions such as __syncthreads are treated
> as builtins. Clang does not currently implement the special handling
> for these variables or functions.
>

Are there any plans to implement any of these?

I doubt that I will have time to implement this myself, and I am
unaware of anyone else who is willing to.

> > > and the OpenCL frontend seems to respect the address
> > > mapping but does not emit complete array definitions for
> locally-defined
> > > __local arrays. Does the front-end currently not support __local
> arrays
> > > embedded in the code? It seems to work if the __local arrays are
> passed as
> > > pointers to the kernel.
> >
> > Clang should support __local arrays, and this looks like a genuine
> > bug in the IR generator. I will investigate.
>
> This actually seems to be an optimisation. Since only the first
> element of the array is accessed, LLVM will only allocate storage for
> that element. If you compile your example with -O0 (OpenCL compiles
> with optimisations turned on by default), you will see that the 64
> element array is created.
>

I'm not really convinced this is a legal optimization. What if you
purposely allocate arrays with extra padding to prevent bank conflicts in
the kernel?

Preventing bank conflicts is a reasonable thing for one to want to do,
but allocating arrays with extra padding is not a standards-compliant
way to do it, given that (as far as I'm aware) the OpenCL specification
says nothing about how storage is allocated. If you are willing
to go outside the requirements of the specification, Clang supports
the C1X _Alignas keyword as an extension in all languages. So for
example if you know that the target bank size is 1024, you could write:

_Alignas(1024) __local float buffer[64];

Thanks,

Hi,

I can report my experience with my OpenCL runtime. First of all consider
I'm developing it in spare time.

The built-in functions would be declared by a header file that belongs
to an OpenCL C runtime library (not to be confused with the OpenCL
Platform Layer or OpenCL Runtime defined by sections 4 and 5 of the
OpenCL specification). The runtime library in this case would consist
of a set of header files and (optionally) a static or shared library
file which together implement section 6.11 of the OpenCL specification.
The runtime library as a project would be a separate project from Clang
(but it may be a potential LLVM sub-project).

The first observation is that many OpenCL C library functions are
overloaded for vector types. The semantic of those functions is simply
to apply the scalar version of the function to all elements of the
vector. Thus, I described the runtime functions I need through TableGen
files, then I wrote a couple of TableGen backends. The first generates
the implementations of vector versions using scalar versions. Scalar
versions are coded by hand, but now I have only used functions that
easely map on a mathematic builting (e.g. cos*, sin*, ...). The second
generates an "ocldef.h" header files, that have been added to clang
Header lib. That file is automatically included by clang every time it
compiles OpenCL C sources.

The driver would be extended to support locating the runtime
library's main header file, which could be installed in a known
location, pre-including it using the -include command line option
to the frontend (so that the functions declared by the header file
are available to every OpenCL program), and setting linker options
so that the runtime library is linked into the final executable.

Linking is performed at codegen time inside the OpenCL runtime. Now I
support only CPU devices, so I use the llvm JIT to generate the kernel
and link it with the OpenCL C runtime library. It must be in the
standard llvm bitcode library path -- it is distributed in bitcode
format.

To support my runtime, I have to generate some metadata about kernels
when compiling OpenCL C. Thus I have subclassed the CGOpenCLRuntime
class provided by Peter, by adding my extra functionalities. The last
thing I have added to clang is fixing the size of data types -- e.g. int
must be 32 bits on all targets.

The runtime will be released as soon as it will reach a minimum level of
quality. For the clang patches, I can send them to the ML if anyone are
interested.

Best regards,
speziale.ettore@gmail.com

I’m currently investigating the following issues/concerns:

  1. What is the plan for language-specific functions and other
    constructs,
    such as __syncthreads/barrier, get_local_id/threadIdx, etc.? Is it up
    to
    the back-end to define compatible definitions of these, or is there a
    plan
    to introduce generic LLVM intrinsics for these? Since OpenCL has
    pre-defined functions that do not require header files, it may be
    awkward to
    require OpenCL to include a back-end specific header file when
    compiling
    with Clang.

For OpenCL, the implementation should provide definitions of
the built-in functions described in section 6.11 of the OpenCL
specification. For at least some of those functions, the definitions
would be the same for any OpenCL implementation. (FWIW, I have
developed a set of generic implementations of section 6.11 built-ins
as part of an OpenCL implementation I have been working on, which I
will be open sourcing soon.)

For the rest (e.g. work-item functions), the implementation would
need to be specific to the OpenCL implementation. For example, on
a CPU, the exact implementation details of work-item functions would
be highly dependent on how the implementation stores work-item IDs,
so it would not be appropriate to use a generic intrinsic.

Right. I’m wondering what the implementation plan for this is with Clang.
Are you going to expose the OpenCL functions as LLVM intrinsics, and let
back-ends provide appropriate implementations? Right now, I’m defining
these functions in terms of PTX builtin functions, but this is obviously not
optimal because you need to include an additional header in OpenCL code.

This is how I imagine the built-ins should be implemented:

The built-in functions would be declared by a header file that belongs
to an OpenCL C runtime library (not to be confused with the OpenCL
Platform Layer or OpenCL Runtime defined by sections 4 and 5 of the
OpenCL specification). The runtime library in this case would consist
of a set of header files and (optionally) a static or shared library
file which together implement section 6.11 of the OpenCL specification.
The runtime library as a project would be a separate project from Clang
(but it may be a potential LLVM sub-project).

The driver would be extended to support locating the runtime
library’s main header file, which could be installed in a known
location, pre-including it using the -include command line option
to the frontend (so that the functions declared by the header file
are available to every OpenCL program), and setting linker options
so that the runtime library is linked into the final executable.

This makes sense to me. The run-time library for PTX would be fairly easy, since it would mostly just be stubs that call into PTX builtin functions.

Since my implementation of OpenCL is slightly unconventional (it
is built into KLEE, a symbolic execution engine) I have not needed
to implement any of the driver functionality (KLEE calls into the
frontend directly, and the paths to the header and library files
are hardcoded paths into the KLEE source and build directories),
so I haven’t thought too closely about the details.

For CUDA, the NVIDIA header files provide appropriate declarations,
but as far as I can tell, variables such as threadIdx are handled
specially by nvcc, and functions such as __syncthreads are treated
as builtins. Clang does not currently implement the special handling
for these variables or functions.

Are there any plans to implement any of these?

I doubt that I will have time to implement this myself, and I am
unaware of anyone else who is willing to.

I may take a look at the code to see what all would be involved.

Hi,

I can report my experience with my OpenCL runtime. First of all consider
I'm developing it in spare time.

> The built-in functions would be declared by a header file that belongs
> to an OpenCL C runtime library (not to be confused with the OpenCL
> Platform Layer or OpenCL Runtime defined by sections 4 and 5 of the
> OpenCL specification). The runtime library in this case would consist
> of a set of header files and (optionally) a static or shared library
> file which together implement section 6.11 of the OpenCL specification.
> The runtime library as a project would be a separate project from Clang
> (but it may be a potential LLVM sub-project).

The first observation is that many OpenCL C library functions are
overloaded for vector types. The semantic of those functions is simply
to apply the scalar version of the function to all elements of the
vector. Thus, I described the runtime functions I need through TableGen
files, then I wrote a couple of TableGen backends. The first generates
the implementations of vector versions using scalar versions. Scalar
versions are coded by hand, but now I have only used functions that
easely map on a mathematic builting (e.g. cos*, sin*, ...). The second
generates an "ocldef.h" header files, that have been added to clang
Header lib. That file is automatically included by clang every time it
compiles OpenCL C sources.

...

The runtime will be released as soon as it will reach a minimum level of
quality. For the clang patches, I can send them to the ML if anyone are
interested.

I would be very interested in gaining access to your OpenCL C runtime
library, even if in an incomplete state, since I have also been working
on a runtime library, and it would be best to avoid duplicating effort.

If you like, I can provide assistance with open sourcing it. I can
also ask the LLVM project maintainers if they would be willing to
create a sub-project as I mentioned above.

Coincidentally enough, I have been working on separating the TableGen
parser from its backends. So if your runtime library project were
separated from Clang, it would be relatively trivial to maintain its
TableGen backends in the runtime library tree.

Thanks,

I’m currently investigating the following issues/concerns:

  1. What is the plan for language-specific functions and other
    constructs,
    such as __syncthreads/barrier, get_local_id/threadIdx, etc.? Is it up
    to
    the back-end to define compatible definitions of these, or is there a
    plan
    to introduce generic LLVM intrinsics for these? Since OpenCL has
    pre-defined functions that do not require header files, it may be
    awkward to
    require OpenCL to include a back-end specific header file when
    compiling
    with Clang.

For OpenCL, the implementation should provide definitions of
the built-in functions described in section 6.11 of the OpenCL
specification. For at least some of those functions, the definitions
would be the same for any OpenCL implementation. (FWIW, I have
developed a set of generic implementations of section 6.11 built-ins
as part of an OpenCL implementation I have been working on, which I
will be open sourcing soon.)

For the rest (e.g. work-item functions), the implementation would
need to be specific to the OpenCL implementation. For example, on
a CPU, the exact implementation details of work-item functions would
be highly dependent on how the implementation stores work-item IDs,
so it would not be appropriate to use a generic intrinsic.

Right. I’m wondering what the implementation plan for this is with Clang.
Are you going to expose the OpenCL functions as LLVM intrinsics, and let
back-ends provide appropriate implementations? Right now, I’m defining
these functions in terms of PTX builtin functions, but this is obviously not
optimal because you need to include an additional header in OpenCL code.

This is how I imagine the built-ins should be implemented:

The built-in functions would be declared by a header file that belongs
to an OpenCL C runtime library (not to be confused with the OpenCL
Platform Layer or OpenCL Runtime defined by sections 4 and 5 of the
OpenCL specification). The runtime library in this case would consist
of a set of header files and (optionally) a static or shared library
file which together implement section 6.11 of the OpenCL specification.
The runtime library as a project would be a separate project from Clang
(but it may be a potential LLVM sub-project).

The driver would be extended to support locating the runtime
library’s main header file, which could be installed in a known
location, pre-including it using the -include command line option
to the frontend (so that the functions declared by the header file
are available to every OpenCL program), and setting linker options
so that the runtime library is linked into the final executable.

This makes sense to me. The run-time library for PTX would be fairly easy, since it would mostly just be stubs that call into PTX builtin functions.

Since my implementation of OpenCL is slightly unconventional (it
is built into KLEE, a symbolic execution engine) I have not needed
to implement any of the driver functionality (KLEE calls into the
frontend directly, and the paths to the header and library files
are hardcoded paths into the KLEE source and build directories),
so I haven’t thought too closely about the details.

For CUDA, the NVIDIA header files provide appropriate declarations,
but as far as I can tell, variables such as threadIdx are handled
specially by nvcc, and functions such as __syncthreads are treated
as builtins. Clang does not currently implement the special handling
for these variables or functions.

Are there any plans to implement any of these?

I doubt that I will have time to implement this myself, and I am
unaware of anyone else who is willing to.

I may take a look at the code to see what all would be involved.

and the OpenCL frontend seems to respect the address
mapping but does not emit complete array definitions for
locally-defined

__local arrays. Does the front-end currently not support __local
arrays

embedded in the code? It seems to work if the __local arrays are
passed as

pointers to the kernel.

Clang should support __local arrays, and this looks like a genuine
bug in the IR generator. I will investigate.

This actually seems to be an optimisation. Since only the first
element of the array is accessed, LLVM will only allocate storage for
that element. If you compile your example with -O0 (OpenCL compiles
with optimisations turned on by default), you will see that the 64
element array is created.

I’m not really convinced this is a legal optimization. What if you
purposely allocate arrays with extra padding to prevent bank conflicts in
the kernel?

Preventing bank conflicts is a reasonable thing for one to want to do,
but allocating arrays with extra padding is not a standards-compliant
way to do it, given that (as far as I’m aware) the OpenCL specification
says nothing about how storage is allocated. If you are willing
to go outside the requirements of the specification, Clang supports
the C1X _Alignas keyword as an extension in all languages. So for
example if you know that the target bank size is 1024, you could write:

_Alignas(1024) __local float buffer[64];

Peter, one more question. Is the “opencl.kernels” metadata a permanent thing, or is it a short-term hack? I ask because I’m working on how to identify kernel vs. device functions in the PTX back-end. The way I see it, I have two options:

  1. Use a pass in the back-end to assign the proper calling convention to each function, if the metadata is present.
  2. Modify Clang (maybe through an extension of the CGOpenCLRuntime class) to set the proper PTX calling convention when in OpenCL-mode.

I really don't like this metadata at all, and I'd prefer that we get
rid of it, but I think that ARM's (non-open-source) GPU backend relies
on it.

I think that the best approach is to modify Clang to set the
correct calling convention. Specifically, we can override the
SetTargetAttributes function in PTXTargetCodeGenInfo to set the
calling convention based on the presence of the attribute if we are
in OpenCL mode.

Thanks,

Peter, one more question. Is the “opencl.kernels” metadata a permanent
thing, or is it a short-term hack? I ask because I’m working on how to
identify kernel vs. device functions in the PTX back-end. The way I see it,
I have two options:

  1. Use a pass in the back-end to assign the proper calling convention to

each function, if the metadata is present.

  1. Modify Clang (maybe through an extension of the CGOpenCLRuntime class)

to set the proper PTX calling convention when in OpenCL-mode.

I really don’t like this metadata at all, and I’d prefer that we get
rid of it, but I think that ARM’s (non-open-source) GPU backend relies
on it.

I think that the best approach is to modify Clang to set the
correct calling convention. Specifically, we can override the
SetTargetAttributes function in PTXTargetCodeGenInfo to set the
calling convention based on the presence of the attribute if we are
in OpenCL mode.

I agree, but I wasn’t sure if the “opencl.kernels” metadata was the “proper” way to do it with Clang.

I just committed a patch for this. The only variation from what you suggested is an additional check to force device functions to be the default in PTXABIInfo. Otherwise, if kernels functions were the default (as could be the case, depending on triple), then some of the optimization logic was getting confused and replacing entire functions with llvm.trap calls due to calling convention mis-matches. I think the problem had to do with device functions being marked as kernel functions initially, then changed to device functions in SetTargetAttributes.

Hi,

I would be very interested in gaining access to your OpenCL C runtime
library, even if in an incomplete state, since I have also been working
on a runtime library, and it would be best to avoid duplicating effort.

I agree.

If you like, I can provide assistance with open sourcing it. I can
also ask the LLVM project maintainers if they would be willing to
create a sub-project as I mentioned above.

There are no problems on my side, if the quality is high enough for you.
My goal was to generate as fast as possible a minimal library to compile
AMD benchmarks. I wrote the generator and checked it for the functions I
needed -- gentype foo(gentype) -- but I think the code is quite general
to be easely extended.

I attach the following files:

OCLGen.tar.gz: the TableGen backends. There are two backends, one for
generating implemementations, the other for the ocldef.h file. The
backends are built inside my project, but I was not aware of your work
about refactoring TableGen, so I have re-used old TableGen source to
build the OCLGen driver.

GenLib.tar.gz: the .td files used as TableGen inputs. More specifically,
Math.td is a partial definition of OpenCL Math library, OCLDef.td must
includes all library *.td and it is used to generate the ocldef.h
header, and OCLTypes.td should contains OpenCL C type definitions -- now
it includes only types I need -- and generic builtin declarations.

Math.inc: the subset of the math library generated by the backend.

ocldef.h: the header file generated by the backend. In order to avoid
clashes with the standard math library and to avoid major modification
of clang/llvm, I have prefixed all OpenCL builtins with the
__builtin_ocl_ prefix. At the end of the file, I define a macro for each
builtin, allowing to use the undecorated name -- I do not known if this
is the right approach, but it looks working on X86 and X86_64.

Coincidentally enough, I have been working on separating the TableGen
parser from its backends. So if your runtime library project were
separated from Clang, it would be relatively trivial to maintain its
TableGen backends in the runtime library tree.

So, I think I need to update my code. Thank you for the news.

I apologize for the delay -- deadlines,
speziale.ettore@gmail.com

OCLGen.tar.gz (7.66 KB)

GenLib.tar.gz (1.43 KB)

Math.inc (18.7 KB)

ocldef.h (6.74 KB)