ANN: libclc (OpenCL C library implementation)

Hi,

This is to announce the availability of libclc, an open source, BSD
licensed implementation of the library requirements of the OpenCL C
programming language, as specified by the OpenCL 1.1 Specification.
libclc is intended to be used with Clang's OpenCL frontend.

libclc website: http://www.pcc.me.uk/~peter/libclc/

libclc is designed to be portable and extensible. To this end,
it provides generic implementations of most library requirements,
allowing the target to override the generic implementation at the
granularity of individual functions.

libclc currently only supports the PTX target, but support for more
targets is welcome.

How does this project relate to the recently announced Portable OpenCL
(POCL) project? Unlike POCL, this project is not intended to provide
an OpenCL host library (i.e. the OpenCL Platform Layer and OpenCL
Runtime specified in sections 4-5 of the OpenCL specification).
Instead, it provides only the requirements for the OpenCL C
Programming Language (section 6 et seq). It is intended to be used
with an existing host library implementation, and comply with its
ABI requirements.

An example of such a host library is NVIDIA's OpenCL host library
for PTX -- the intention is to at some point provide a mechanism
for using the NVIDIA implementation of OpenCL with Clang, libclc
and LLVM's PTX backend instead of NVIDIA's own OpenCL compiler.
Another example would be POCL's host library, and the POCL developers
have expressed an interest in using libclc as their OpenCL C library
instead of developing their own.

I will hope to find time over the next few weeks to add libclc support
to the Clang driver. The intention is that compiling OpenCL C programs
to PTX would be as easy as (something like this):

clang -target ptx32 -S file.cl

such that the driver would automatically locate the libclc headers,
add them to the include path and pre-include the main header file.
(The libclc support will of course be optional, and a -cl-stdlib=
flag will be provided to allow for switching between OpenCL standard
library implementations.)

Thanks,

Just a followup note to say that libclc is at a very early stage
of development, and that only a few functions are implemented.
Developers wanted :slight_smile:

Thanks,

Hi,

This is to announce the availability of libclc, an open source, BSD
licensed implementation of the library requirements of the OpenCL C
programming language, as specified by the OpenCL 1.1 Specification.
libclc is intended to be used with Clang’s OpenCL frontend.

libclc website: http://www.pcc.me.uk/~peter/libclc/

libclc is designed to be portable and extensible. To this end,
it provides generic implementations of most library requirements,
allowing the target to override the generic implementation at the
granularity of individual functions.

libclc currently only supports the PTX target, but support for more
targets is welcome.

How does this project relate to the recently announced Portable OpenCL
(POCL) project? Unlike POCL, this project is not intended to provide
an OpenCL host library (i.e. the OpenCL Platform Layer and OpenCL
Runtime specified in sections 4-5 of the OpenCL specification).
Instead, it provides only the requirements for the OpenCL C
Programming Language (section 6 et seq). It is intended to be used
with an existing host library implementation, and comply with its
ABI requirements.

An example of such a host library is NVIDIA’s OpenCL host library
for PTX – the intention is to at some point provide a mechanism
for using the NVIDIA implementation of OpenCL with Clang, libclc
and LLVM’s PTX backend instead of NVIDIA’s own OpenCL compiler.

This is already semi-functional. You can compile OpenCL code with Clang+libclc for ptx32, and load the resulting PTX as an OpenCL binary using Nvidia’s OpenCL platform. :slight_smile:

Do we have a list of these open-source LLVM-based OpenCL projects
somewhere? Off the top of my head, we have:

libclc: http://www.pcc.me.uk/~peter/libclc/
pocl: https://launchpad.net/pocl
clover: http://cgit.freedesktop.org/~steckdenis/clover/

(I think that all of these have BSD- or MIT-style licenses).

Are there any others?

-Hal

Hi everybody,

the compiler design lab at Saarland University (chair of Sebastian Hack) is also working on an LLVM-based OpenCL driver.

The project started as a use-case for our "Whole-Function Vectorization" library, which allows to transform a function to compute the same as W executions of the original code by using SIMD instructions (W = 4 for SSE/AltiVec, 8 for AVX).

The algorithm, a few details on the OpenCL driver, and some results were published at CGO'11:
http://dx.doi.org/10.1109/CGO.2011.5764682

In contrast to Clover and pocl, we aimed at maximum performance before full support of the API (which simply requires more manpower than one PhD student).
The driver was evaluated using benchmarks from the AMD APP SDK, e.g. Mandelbrot, NBody, FastWalshTransform, Histogram, BlackScholes, DCT, ...
Our measurements show that our driver outperforms both Intel's and AMD's CPU driver for almost all of the benchmarks.

We plan to release both the vectorization library as well as the OpenCL driver under a BSD-style license, but there is still significant work to be done in terms of stability and feature completeness (e.g. the ICD mechanism only works as expected under Windows).
Also, due to simplicity we currently employ AMD's OpenCL-to-LLVM frontend and OpenCL-library from an earlier SDK. libclc sounds like the perfect fit for us to get rid of that.

I agree on the fact that we should try to merge the different open-source OpenCL projects, but at least at first glance they do not seem to share too many design decisions.

More information on Whole-Function Vectorization and our driver can be found on the project web page:
http://www.cdl.uni-saarland.de/projects/wfv

Best regards,
Ralf

Ralf,
What version of the SDK were you using for your analysis? I don't see that in the slides/pdf.

Thanks,
Micah

Hi Micah,

The numbers from the paper were measured with the ATI Stream SDK v2.1 (it's only mentioned in the references I think).
The most recent measurements I have were done with the current v2.5.

Best,
Ralf

Anywhere the updated numbers are posted?

Thanks,
Micah

Hi Ralf,

I agree on the fact that we should try to merge the different
open-source OpenCL projects, but at least at first glance they do not
seem to share too many design decisions.

One of the primary goals of libclc is to pool efforts between OpenCL
implementations in terms of implementing the large proportion of OpenCL C
builtin functions which can be implemented generically, since this task is
tedious and hardly ever the focus of any individual implementation but is a
conformance requirement. I therefore consider libclc to be an important
step towards merging OpenCL projects.

In contrast to Clover and pocl, we aimed at maximum performance before
full support of the API (which simply requires more manpower than one
PhD student).

I was in a similar position when I decided to start the libclc
project. While my ultimate goal was not performance, the sheer
number of built-in functions was far too much for my one-man research
project to handle. What libclc represents right now are the generic
functions I have implemented so far as part of this project, plus a
few new functions.

Also, due to simplicity we currently employ AMD's OpenCL-to-LLVM
frontend and OpenCL-library from an earlier SDK. libclc sounds like the
perfect fit for us to get rid of that.

Great. If you do decide to get involved in libclc development,
please get in touch on the mailing list:
http://www.pcc.me.uk/cgi-bin/mailman/listinfo/libclc-dev

Thanks,

The driver was evaluated using benchmarks from the AMD APP SDK, e.g.
Mandelbrot, NBody, FastWalshTransform, Histogram, BlackScholes, DCT, ...
Our measurements show that our driver outperforms both Intel's and AMD's
CPU driver for almost all of the benchmarks.

The latest Intel OpenCL SDK with our vectorization technology is available here:

http://software.intel.com/en-us/articles/opencl-sdk/

I'd be happy to see the numbers for the comparisons that you've made.

Hi Ralf,

The project started as a use-case for our “Whole-Function Vectorization”
library, which allows to transform a function to compute the same as W
executions of the original code by using SIMD instructions (W = 4 for
SSE/AltiVec, 8 for AVX).

Quite interesting. We were planning to add “vectorization” to our passes also, but if I understood the paper correctly your approach uses full speculation, which is all right for SIMD architectures but might not be so for multi-issue processors. The pocl project comes from a generalization of our work in using OpenCL as entry language for static ILP architectures (http://dx.doi.org/10.1109/ICSAMOS.2010.5642061), so we do not only focus on DLP but also ILP (ultimate goal of pocl is portable performance of OpenCL code among different platforms).

Our idea was to use an hybrid method, with vectorized code in the unconditional sections and replicated/looped as we do now on the conditional parts of the kernels.

In contrast to Clover and pocl, we aimed at maximum performance before
full support of the API (which simply requires more manpower than one
PhD student).

That is wrong, at least for pocl. We do not (by far) support the whole API, the main new point on pocl is the LLVM passes to statically create the different work items in a workgroup, and the barrier handling. Our kernel runtime library is currently in fact fairly small, including just a little more then the implementation-dependent functions. We are considering merging efforts with liblcl in that point.

BR,

Carlos

Hi Carlos,

The project started as a use-case for our "Whole-Function Vectorization"
library, which allows to transform a function to compute the same as W
executions of the original code by using SIMD instructions (W = 4 for
SSE/AltiVec, 8 for AVX).

Quite interesting. We were planning to add "vectorization" to our passes
also, but if I understood the paper correctly your approach uses full
speculation, which is all right for SIMD architectures but might not be
so for multi-issue processors.

I don't know what you mean with "speculation" here, but other than that you are right: for best performance, we explicitly target machines with SIMD instruction sets.

In contrast to Clover and pocl, we aimed at maximum performance before
full support of the API (which simply requires more manpower than one
PhD student).

That is wrong, at least for pocl. We do not (by far) support the whole
API, the main new point on pocl is the LLVM passes to statically create
the different work items in a workgroup, and the barrier handling. Our
kernel runtime library is currently in fact fairly small, including just
a little more then the implementation-dependent functions. We are
considering merging efforts with liblcl in that point.

Please excuse me for getting that wrong.

I think we should really stick our heads together (also including Denis Steckelmacher who implemented Clover) and somehow combine all our efforts.
Otherwise, we will probably just all solve the same problems in parallel. Additionally, no user will gain anything if he has to decide between multiple, half-baked solutions.

Best,
Ralf

Hello,

I am the developer of Clover, and so much activity about OpenCL these days is really exciting. Here is my point of view, mainly on Clover and how the projects could use each other.

Clover is made in a way that allow a certain level of modularity. Although POCL would be very difficult to merge into Clover (or Clover into POCL), as these two projects are nearly exactly doing the same things (an OpenCL platform layer), Libclc and the German driver are very interesting.

I'll begin with the German driver, maybe the project the easiest to integrate or bind to Clover. Clover uses a hardware abstraction layer, a set of virtual classes a driver has to implement in order to be usable by Clover. They are DeviceInterface, DeviceBuffer, DeviceKernel and DeviceProgram. You can see their prototype here : http://cgit.freedesktop.org/~steckdenis/clover/tree/src/core/deviceinterface.h . Currently, I have developed a small driver, normally not too slow, that simply runs the LLVM IR produced by Clang using the LLVM JIT. The kernels are split in work-groups, split in work-items in such a way that multithreading is efficiently used.

What would be interesting is to try to integrate the German driver into Clover using this interface, or to have this driver built as a library on which Clover links (if there is a problem of license, but Clover is BSD and it seems that the driver will be the same). I would personally be very excited to see how another driver would perform in Clover, feature-wise and performance-wise.

Libclc could also be useful to Clover, but less likely. The goal of this project is to implement all the OpenCL built-in functions. It's good, but Clover already does the same, using a different technique. Libclc is very elegant (I think), it seems to use custom LLVM intrinsics, and is built around pure C macros. Clover uses a slightly more complex system, involving a Python script "compiling" a set of built-ins into four files. For example, this declaration (REPL is a macro that does a simple for()) :

Additionally, no user will gain anything if he has to decide
between multiple, half-baked solutions.

Truer words were never spoken. :wink:

Greg

Hello,

I am the developer of Clover, and so much activity about OpenCL these days is really exciting. Here is my point of view, mainly on Clover and how the projects could use each other.

Clover is made in a way that allow a certain level of modularity. Although POCL would be very difficult to merge into Clover (or Clover into POCL), as these two projects are nearly exactly doing the same things (an OpenCL platform layer), Libclc and the German driver are very interesting.

I’ll begin with the German driver, maybe the project the easiest to integrate or bind to Clover. Clover uses a hardware abstraction layer, a set of virtual classes a driver has to implement in order to be usable by Clover. They are DeviceInterface, DeviceBuffer, DeviceKernel and DeviceProgram. You can see their prototype here : http://cgit.freedesktop.org/~steckdenis/clover/tree/src/core/deviceinterface.h . Currently, I have developed a small driver, normally not too slow, that simply runs the LLVM IR produced by Clang using the LLVM JIT. The kernels are split in work-groups, split in work-items in such a way that multithreading is efficiently used.

What would be interesting is to try to integrate the German driver into Clover using this interface, or to have this driver built as a library on which Clover links (if there is a problem of license, but Clover is BSD and it seems that the driver will be the same). I would personally be very excited to see how another driver would perform in Clover, feature-wise and performance-wise.

Libclc could also be useful to Clover, but less likely. The goal of this project is to implement all the OpenCL built-in functions. It’s good, but Clover already does the same, using a different technique. Libclc is very elegant (I think), it seems to use custom LLVM intrinsics, and is built around pure C macros.

libclc only uses LLVM intrinsics (currently) for back-end specific functionality. For example, the get_local_id() function is implemented separately for each target, and uses LLVM PTX intrinsics if compiling for the PTX back-end. This is not something you could implement in a generic way without back-end hooks (at least not without dirty hacks in the back-end).

Clover uses a slightly more complex system, involving a Python script “compiling” a set of built-ins into four files. For example, this declaration (REPL is a macro that does a simple for()) :


def vecf : float2 float3 float4 float8 float16

native $type acospi $vecf : x:$type
REPL($vecdim)
result[i] = std::acos(x[i]) / M_PI;
end

Is compiled to these fragments, one for each vector type (float2, float3, etc) :


// In stdlib_def.h : what the OpenCL C kernel sees
float2 OVERLOAD acospi(float2 x);

// In stdlib_impl.h : what gets compiled to LLVM IR at Clover compile time, and then linked to each kernel
void __cpu_float2_acospi_float2(float *result, float *x);
float2 OVERLOAD acospi(float2 x)
{
float2 result;

__cpu_float2_acospi_float2((float *)&result, (float *)&x);

return result;
}

// __cpu_float2_acospi_float2 is a function implemented in the Clover .so library, using llvm::JIT::registerLasyFunctionCreator
// In builtins_impl.h : the actual C++ implementation, included in src/core/cpu/builtins.cpp
static void float2_acospi_float2(float *result, float *x)
{
REPL(2)
result[i] = std::acos(x[i]) / M_PI;

}

// And then a small else if in the lazy function creator, in order to bind everything together
else if (name == “__cpu_float2_acospi_float2”)
return (void *)&float2_acospi_float2;

If the LLVM JIT picks up these functions at run-time, then there is no chance of inlining these math functions. This is not good for performance.

The system works fairly well, and I was able to implement a dozen of built-in functions in only two hours. It’s very fast to simply declare “native” functions using STL or Boost math functions, and hardware drivers simply can replace the LLVM “call” statements with what they need to accelerate the functions on the GPU.

But then the hardware driver layer has to have GPU implementations for all of these functions.

I was hoping you would come to the collaborative, joint solution. I’ve been waiting for Clang to have a settled OpenCL implementation to start working on OpenCL.

Dealing with 3 parallel projects would be just that, a pain in the rear.

  • Marc

I was hoping you would come to the collaborative, joint solution. I've been
waiting for Clang to have a settled OpenCL implementation to start working on
OpenCL.

We at pocl are looking forward for the German code to be released as it seems
to be closest to our interests of providing a performance portable OpenCL
implementation. Hopefully there will be some co-operation with that.

Meanwhile we keep on working on our code base to make it fulfill our research
goals in the TCE project, e.g. add multithreading at WG level, more
flexible work item parallelization options, more API functions to run
more benchmarks, etc.

Dealing with 3 parallel projects would be just that, a pain in the rear.

Just pick your favorite and contribute. I'm sure there will be merging of
efforts at some point and your work doesn't go wasted.

Hi,

libclc: http://www.pcc.me.uk/~peter/libclc/
pocl: https://launchpad.net/pocl
clover: http://cgit.freedesktop.org/~steckdenis/clover/

I have pushed our implementation on GitHub:

https://github.com/speziale-ettore/OpenCRun

I have focused on desiging a modular system, because OpenCRun is
intended to be used for research purpose -- not many people, no too much
time, so clean design is not an option. For the same reason, I have
tried to re-use as much as possible what LLVM/CLANG offers: compiler,
jit, build system, diagnostic routines, testing infrastructure, ...

Currently I have implemented an OpenCL backend for i386 and X86_64 CPUs.
I have coded work-item stacks by hand because after some sintetic
benchmarks I have observed that both ucontext and setjmp/longjmp contain
some extra overheads.

Code is not fully documented -- sorry -- and some mandatory features are
still missing (e.g. local variables).

Looking at the code, you can observe that there are a lot of LLVM
passes. I have implemented some features through passed because I want
to keep the code as much modular as possible.

Best regards,
speziale.ettore@gmail.com

PS-1: OpenCRun requires some extra metadata to be generated by clang. In
sources root directory there is a clang patch to add support for
metadata generation

PS-2: some parts are very bad-coded. I know, but I haven't had yet time
to do some refactoring

If you have not already seen it, you (and anyone else working on
OpenCL runtimes) might be interested in this paper from AMD:
http://dl.acm.org/citation.cfm?id=1854302
In particular, Section 4 describes the implementation approach for
their x86 OpenCL runtime and mentions a number of optimizations they
applied to things like the work-item stack.

- Michael

Hi,

If you have not already seen it, you (and anyone else working on
OpenCL runtimes) might be interested in this paper from AMD:
http://dl.acm.org/citation.cfm?id=1854302
In particular, Section 4 describes the implementation approach for
their x86 OpenCL runtime and mentions a number of optimizations they
applied to things like the work-item stack.

I know, but I hadn't time to apply all the optimizations -- I have only
coded the most trivial.

Best regards,
speziale.ettore@gmail.com