Backend for C and OpenCL

Hi everybody,

for a research project I would like to use LLVM to optimize
OpenCL programs for GPUs.

Due to the lack of open-source back-ends and runtimes for
GPUs my idea is the following:

  1. compile OpenCL C into LLVM-IR (for what I read on the ML
    full support is close, at least foreseeable),
  2. apply LLVM transformations to the bitcode,
  3. generate the OpenCL C code from the optimized bitcode,
  4. use the official (Nvidia, AMD, Intel, …) OpenCL compilers
    and runtimes for the actual execution of the optimized code

I know that the C backend is buggy and it is no more
supported but it still works with simple C programs.
Remeber that OpenCL programs are usually quite simple
(no function pointers, etc…)

The main features to be added to the backend are:

  1. the “__kernel” keyword,
  2. the four address spaces keywords
  3. vector data types
  4. the half keyword

My idea is to extensively verify the functionality the C-backend for
C programs (similar to OpenCL-C ones) and possibly add the listed features.

What do you think of this ? Is it feasible ?

Thank you,

Alberto

At least for me, I believe this is a very interesting project. I may consider contribute later on when it actually moves on.
To really optimizing OpenCL codes at backend is too challenging, because this relies on very much hardware specific information, which may not be disclosed.
So I fully agree to transform the code back to source when it is optimized. But why at the IR level, not the AST? Isn’t transforming on AST much easier and reasonable?

At least for me, I believe this is a very interesting project. I may consider contribute later on when it actually moves on.
To really optimizing OpenCL codes at backend is too challenging, because this relies on very much hardware specific information, which may not be disclosed.
So I fully agree to transform the code back to source when it is optimized. But why at the IR level, not the AST? Isn’t transforming on AST much easier and reasonable?

If you’re interested in NVidia hardware, the OpenCL->PTX workflow is mostly implemented in LLVM ToT (will be part of LLVM 3.0). The main missing part at the moment is OpenCL work-item function calls, which have to be implemented in terms of PTX intrinsics. It’s not difficult, but I haven’t written a library to do that yet. Once you have PTX, you can use the CUDA Driver API to load and execute the kernel.

You can play around with it by using the following Clang options:

$ clang -ccc-host-triple ptx32 -S .cl

You can define your work-item functions as something like the following:

attribute((always_inline))
int get_group_id(int dim) {
if (dim == 0)
return __builtin_ptx_read_ctaid_x();
else if (dim == 1)
return __builtin_ptx_read_ctaid_y();
else
return __builtin_ptx_read_ctaid_z();
}

attribute((always_inline))
int get_global_id(int dim) {
if (dim == 0)
return __builtin_ptx_read_ctaid_x()__builtin_ptx_read_ntid_x()+__builtin_ptx_read_tid_x();
else if (dim == 1)
return __builtin_ptx_read_ctaid_y()
__builtin_ptx_read_ntid_y()+__builtin_ptx_read_tid_y();
else
return __builtin_ptx_read_ctaid_z()*__builtin_ptx_read_ntid_z()+__builtin_ptx_read_tid_z();
}

attribute((always_inline))
int get_local_id(int dim) {
if (dim == 0)
return __builtin_ptx_read_tid_x();
else if (dim == 1)
return __builtin_ptx_read_tid_y();
else
return __builtin_ptx_read_tid_z();
}

attribute((always_inline))
int get_global_size(int dim) {
if (dim == 0)
return __builtin_ptx_read_nctaid_x()__builtin_ptx_read_ntid_x();
else if (dim == 1)
return __builtin_ptx_read_nctaid_y()
__builtin_ptx_read_ntid_y();
else
return __builtin_ptx_read_nctaid_z()*__builtin_ptx_read_ntid_z();
}

#define barrier(kind) __builtin_ptx_bar_sync(kind)

#define CLK_LOCAL_MEM_FENCE 0

Even if this works for Nvidia hardware, what about ATI series? By saying this, I do not quite understand the underlying motivation to implement another (workable, optimized) GPU backend. This requires non-trivial work, and there is already very good commercial support. In addition, since such backend optimizations inherently rely on specific hardware details, it’s hard for me to be optimistic that an open source version can work comparably well with the vendor provided version.

I heartedly agree that providing support for OpenCL and CUDA on Clang is important and useful. Working on source level optimizations may be more interesting. I remember somebody proposed such an interesting idea in this community: transform CUDA to OpenCL code and do optimizations. I would love such kind of ideas, is it unfeasible or too trivial to implement?

Please let me know if I miss something important. Thanks.

Even if this works for Nvidia hardware, what about ATI series? By saying this, I do not quite understand the underlying motivation to implement another (workable, optimized) GPU backend. This requires non-trivial work, and there is already very good commercial support. In addition, since such backend optimizations inherently rely on specific hardware details, it’s hard for me to be optimistic that an open source version can work comparably well with the vendor provided version.

The idea is to harness the LLVM optimization and analysis passes to generate optimized GPU kernels. On one hand, this has nothing to do with OpenCL/CUDA; it provides a way for front-ends to directly target NVidia GPU devices without having to first convert to OpenCL or CUDA. Such a conversion puts you at the mercy of the vendor front-ends. On the other hand, it provides a way to try to go beyond what nvcc can do, in terms of optimizations. There is definitely good commercial support, but that support is in terms of black boxes that we ultimately have no control over.

I heartedly agree that providing support for OpenCL and CUDA on Clang is important and useful. Working on source level optimizations may be more interesting. I remember somebody proposed such an interesting idea in this community: transform CUDA to OpenCL code and do optimizations. I would love such kind of ideas, is it unfeasible or too trivial to implement?

Converting CUDA to OpenCL would definitely be non-trivial, especially when you start considering the CUDA C++ support. It’s probably feasible, though not really in the scope of Clang.

Hi Alberto,

this depends what you want to achieve and what kind of optimizations you want to apply.

Your proposal suggests you want to transform OpenCL-C programs into LLVM-IR to apply transformations on LLVM-IR level. What kind of LLVM-IR transformations are you planning to run? To my knowledge at least the AMD, Intel and Apple OpenCL implementations use LLVM internally, so existing LLVM optimizations will not give you any benefits as they are already run in the OpenCL compilers.

Depending on what kind of optimizations you want to perform, several approaches are possible. As Guoping Long suggested, you can
use clang to create an OpenCL AST and use its rewriter capabilities to
perform source to source transformations. This will be more a pattern match approach, but for research or for projects where you can educate
people to write canonical code this might be a good choice. It is definitely an approach where it is easy to get access to higher level constructs like for-loops. On the other hand, analysis like scalar evolution are not available at this layer.

Another possibility is to take an approach similar to Polly[1] (a project I work on). Here we use LLVM analysis passes to recover high level constructs from LLVM-IR and to subsequently apply higher level
transformations on the LLVM-IR.

If you want to apply your optimizations on LLVM-IR the translation OpenCL -> LLVM-IR should be straightforward. clang's OpenCL
support is pretty good and people continue to improve it. The way back from LLVM-IR to OpenCL is more difficult. One approach you pointed out is to generate OpenCL-C with the C backend. I must admit I never looked into the C-backend, but from my knowledge it seems to work OK for selected examples, but has some problems that are regarded unsolvable. People propose to rewrite it as a regular LLVM backend. You probably need to investigate yourself how much work it is to make it useable.

Another option is to pass LLVM-IR directly into the backends without ever regenerating OpenCL-C. This is a very interesting approach as most (all?) OpenCL implementations use LLVM and so you could skip the useless
LLVM-IR -> OpenCL-C -> LLVM-IR conversion.
As Justin pointed out LLVM includes a PTX backend that could be used to directly target NVIDIA hardware. Micah Villmow, from AMD recently submitted the first patches to open source the AMD-IL backend. As getting open source support for AMD will take some time and it is unknown how complete it will be, I will point you to another option. OpenCL has support to export and reimport implementation specific binaries through clGetProgramInfo(CL_PROGRAM_BINARIES) and createProgramWithBinary(). The binary of the AMD OpenCL SDK is a normal elf file, that contains a section called .llvmir. You should be able to insert an optimized OpenCL program into the AMD SDK by exporting the binary of the OpenCL-C program, replacing the elf '.llvmir' section with your optimized LLVM-IR and by finally reimporting the changed binary (Let me know if you need help here). On the Euro-LLVM meeting it was also discussed if it is worthwhile to standardize the injection of LLVM-IR into OpenCL backends. Several people seemed to be interested, but also some problems regarding non target agnostic LLVM-IR were raised.

Cheers
Tobi

P.S.: I am very interested in OpenCL optimizations. If possible, it would be nice if you could point me information about the stuff you plan to work on.

Thanks everybody for your replies,

Guoping,
as Tobias has pointed out many of the LLVM analysis and transformation
passes are designed to work on the IR. So I think this is the most natural choice.

Justin,
I followed the development of the PTX backend and the upcoming full support
is great news.

Tobias,
I don’t know yet what kind of optimizations I will work on.
It will depend also on the target architecture.
Polyhedral transformations are a big candidate though (I am reading your PhD thesis…).
Thank you very much for the runtime trick, I will try it.
As you said one possible drawback is related to the fact that the
specific ABI of the different vendors must be taken into considerations.
Eg. how the kernels in the module are represented (with metadata or name
mangling), address spaces, etc
My goal is to manage this aspects on my own in the simplest possible way with the
OpenCL->LLVM->OpenCL loop. Taking also into consideration that, as far as I
can see, there is no agreement on how to manage address spaces.

I will keep you updated with the development of the work.

Bye,

Alberto

2011/10/5 Tobias Grosser <tobias@grosser.es>