Re presenting SIMT programs in LLVM

I would like to start by thanking every developer who has contributed to LLVM
for releasing such a high quality project. It has been incredibly valuable
to several projects that I have worked on.

My name is Gregory Diamos, I am a PhD student at Georgia Tech working on
Ocelot (http://code.google.com/p/gpuocelot/). Ocelot is a dynamic binary
translator from PTX (a virtual instruction set used by NVIDIA GPUs) to
multi-core x86. We currently use LLVM's JIT as our x86 code generator. We
have a prototype implementation finished that can execute most CUDA
applications on our google code page using LLVM as a backend.

Most of the program transformations currently available in LLVM are equally
applicable to PTX programs as well as single threaded programs since the PTX
instruction set closely resembles LLVM. However, PTX explicitly deals with
single-instruction multiple-thread (SIMT) programs where many hundreds or
thousands of threads cooperatively execute a program. All threads begin
executing the same program and then can take different paths depending on
input data. During execution, threads may synchronize via barriers, votes,
and atomic operations which are made visible in the instruction set.
Furthermore, the number and synchronziation domains of threads are
explicitly declared before a program is executed.

When expressing PTX programs in LLVM, we run into several problems, most
significantly the handling of synchronization instructions (barriers). In
order to correctly handle barriers with low overhead (ie without using
locks), we have to significantly change the control flow structure of a
program, which has implications on the types and quality of optimizations
that can be performed on translated programs. For example, we do not
currently have the ability to use LLVM passes to move invariant code around
barriers or fuse threads together than can be proved to always take the same
path through the control flow graph between barriers.

We could, and are planning to, implement these optimizations on our end
before we translate from PTX to LLVM, but I thought that I would ask here to
see if there was anything similar that was being done within the LLVM
project. A way of representing a synchronization operation in LLVM (a
barrier across all threads) and a way of exposing the number of threads
executing the program to the optimization passes would be particularly
useful.

I have heard that several industry compilers are planning to use LLVM as an
IR for OpenCL backend compilers and it seems like these same problems will
show up again when those projects become more mature. It might make sense
to at least start thinking about them now.

Cheers

Greg D