C++AMP -> OpenCL (NVPTX) prototype

After reading about Intel's 'Shevlin Park' project to implement C++AMP in
llvm/clang, and failing to find any code for it, I decided to try to implement
something similar. I did it as an excuse to explore and hack on llvm/clang,
which I hadn't done before, but it's now at the point where it will run the
simplest matrix multiplication sample from MSDN, so I thought I might as well
share it.

The source is in:

https://github.com/corngood/compiler-rt.git [unchanged]
https://github.com/corngood/amp.git [simple test project]

It's fairly hacky, and very fragile, so don't expect anything that isn't used
in the sample to work. I also haven't tested it on large datasets, and there
are some things that definitely need fixing before I'd expect good performance
(e.g. workgroup size). It currently works only on NVIDIA GPUs, and has only
been tested on my shitty old 9600GT on amd64 linux with the stable binary
drivers.

The compilation process currently works like this:

.cpp -> [clang++ -fc++-amp] -> .ll
  - compile non-amp code

.cpp -> [clang++ -fc++-amp -famp-is-kernel] -> .amp.ll
  - compile amp kernels only

.amp.ll -> [opt -amp-to-opencl] -> .nvvm.ll
  - create kernel wrapper to deal with buffer/const inputs
  - add nvvm annotations

.nvvm.ll -> [llc -march=nvptx] -> .ptx
  - compile kernels to NVPTX (unchanged)

.ll + .ptx -> [opt -amp-create-stubs .ptx] -> .opt.ll
  - embed ptx as array data
  - create functions to get kernel info, load inputs, etc

.opt.ll -> [llc] -> .o
  - unchanged

The clang steps only differ in codegen, so eventually they should be combined
into one clang call. NVPTX is meant to be replaced with SPIR at some point,
to make it portable, which is why I didn't bother with text kernel generation.

I won't go into implementation details, but if anyone is interested, or
working on something similar, feel free to get in touch.

Thanks,
Dave McFarland

From: corngood@gmail.com
To: llvmdev@cs.uiuc.edu
Sent: Saturday, April 13, 2013 9:13:57 PM
Subject: [LLVMdev] C++AMP -> OpenCL (NVPTX) prototype

After reading about Intel's 'Shevlin Park' project to implement
C++AMP in
llvm/clang, and failing to find any code for it, I decided to try to
implement
something similar. I did it as an excuse to explore and hack on
llvm/clang,
which I hadn't done before, but it's now at the point where it will
run the
simplest matrix multiplication sample from MSDN, so I thought I might
as well
share it.

The source is in:
https://github.com/corngood/llvm.git
https://github.com/corngood/clang.git
https://github.com/corngood/compiler-rt.git [unchanged]
https://github.com/corngood/amp.git [simple test project]

It's fairly hacky, and very fragile, so don't expect anything that
isn't used
in the sample to work. I also haven't tested it on large datasets,
and there
are some things that definitely need fixing before I'd expect good
performance
(e.g. workgroup size). It currently works only on NVIDIA GPUs, and
has only
been tested on my shitty old 9600GT on amd64 linux with the stable
binary
drivers.

The compilation process currently works like this:

.cpp -> [clang++ -fc++-amp] -> .ll
  - compile non-amp code

.cpp -> [clang++ -fc++-amp -famp-is-kernel] -> .amp.ll
  - compile amp kernels only

.amp.ll -> [opt -amp-to-opencl] -> .nvvm.ll
  - create kernel wrapper to deal with buffer/const inputs
  - add nvvm annotations

.nvvm.ll -> [llc -march=nvptx] -> .ptx
  - compile kernels to NVPTX (unchanged)

.ll + .ptx -> [opt -amp-create-stubs .ptx] -> .opt.ll
  - embed ptx as array data
  - create functions to get kernel info, load inputs, etc

.opt.ll -> [llc] -> .o
  - unchanged

The clang steps only differ in codegen, so eventually they should be
combined
into one clang call. NVPTX is meant to be replaced with SPIR at some
point,
to make it portable, which is why I didn't bother with text kernel
generation.

I won't go into implementation details, but if anyone is interested,
or
working on something similar, feel free to get in touch.

Dave,

[I've copied the cfe-dev list as well.]

Thanks for sharing this! I think this sounds very interesting. I don't know much about AMP, but I do have users who are also interested in accelerator targeting, and I'd like you to share your thoughts on:

1. Does your implementation share common functionality with the 'captured statement' work that Intel is currently doing (in order to support Cilk, OpenMP, etc.)? If you're not aware of it, see: http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20130408/077615.html -- This should end up in trunk soon. I ask because if the current captured statement patches would almost, but not quite, work for you, then it would be interesting to understand why.

2. What will be necessary to eliminate the two-clang-invocations problem. If we ever grow support for embedded accelerator targeting (through AMP, OpenACC, OpenMP 4+, etc.), it sounds like this will be a common requirement, and if I had to guess, there is common interest in putting the necessary infrastructure in place.

-Hal

Dave,

[I've copied the cfe-dev list as well.]

Thanks for sharing this! I think this sounds very interesting. I don't know
much about AMP, but I do have users who are also interested in accelerator
targeting, and I'd like you to share your thoughts on:

1. Does your implementation share common functionality with the 'captured
statement' work that Intel is currently doing (in order to support Cilk,
OpenMP, etc.)? If you're not aware of it, see:
http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20130408/077615.
html -- This should end up in trunk soon. I ask because if the current
captured statement patches would almost, but not quite, work for you, then
it would be interesting to understand why.

Kernels in AMP are represented by a lambda, so I haven't had to do anything
special to capture variables. I do some work in the opt passes to marshal
certain types (buffer references so far; also textures, etc in the future), so
maybe there's some overlap there.

Thanks for the link, I'll have to read more about it.

2. What will be necessary to eliminate the two-clang-invocations problem.
If we ever grow support for embedded accelerator targeting (through AMP,
OpenACC, OpenMP 4+, etc.), it sounds like this will be a common
requirement, and if I had to guess, there is common interest in putting the
necessary infrastructure in place.

The only reason I have two clang invokations right now is because of how I
dealt with adress-spaces. In the Shevlin Park presentation, they mentioned
doing analysis and assigning address-spaces after codegen, but I just assign
them using __attribute__((addressspace)) for now, and zero them out for CPU
codegen with a TargetOpt. It sort of piggybacks on the OpenCL ->
NVPTX/SPIR/AMD/etc address space abstraction. The other differences are
similar to how CodeGenOpts.CUDAIsDevice works.

Unfortunately it won't be sufficient for a full implementation of AMP, which
doesn't specify (to my knowledge) any address-space declaration on pointer
types, but still allows pointers into buffers in various address-spaces.

-Hal

To be honest, I'm not crazy about the AMP specification, I just like the idea
of compiling a heterogenous module for host/device code, which can be easily
integrated into existing C++ application. I'd be happy for it to drop the MS
specific syntax like properties, use C++ attributes wherever possible instead
of keywords, and have explicit address spaces like cuda/opencl.

I think the big problem is going to be making it robustly target two very
different targets in one pass. Most obviously, supporting different bitness for
host/device. My testing was all on 64/32 bit, but all other combinations are
available in practice.

- Dave

Hi!

I'm very sure that there's great public interest in CLang being able to
compile C++AMP code. Although how keywords are given versus attributes, I'd
like myelf too if code looked even more native, but we'll just have to go
with this for the time being. I have no clue how the compiler works under
the hood, but once these restrictions are implemented, it cannot be too big
work to redirect where they might originate from in the source code.

There are a few things I don't understand and it would really rock if
someone could explain.

CLang compiles into LLVM IR which then creates some platform specific code
(binary) out of it. In the present state of this feature it is fed to LLVM
to create PTX which then can be fed to the NV drivers pretty much directly
for execution. My question is, how can this be extended for optimal and
portable compilations? I take it that RadeonSI driver developers are puting
great effort <GLSL 1.30 Support For AMD RadeonSI Driver With LLVM - Phoronix;
into making an LLVM back-end. But how does this fit into the bigger picture?

CLang as a compiler should be aiming on translating C++AMP decorated code to
LLVM IR in a manner that decorations are represented in the IR. Then it
should be LLVM's job to turn this IR into OpenCL SPIR, which functions
similarly to DX bytecode in the case of the Microsoft implementation.
Sometime during execution (I can't tell where the best place could be), this
SPIR must be compiled by the chosen OpenCL platform into either PTX (by the
NV driver, not LLVM) or ISA (by the AMD driver).

This is the neatest toolchain design that I can think of, but I do not
understand how Gallium comes into place if they are working on an LLVM
backend. Or is that only a choice of optimizing their own shaders solely?

To make some corrections, there is a means of using address spaces in
C++AMP, namely all variables declared in amp restricted functions are
__private as far as OpenCL is concerned, all variables declared as
"tile_static" are inside the __local namespace, and memory inside a
concurrency::array<T,N> is stored in __global. The reason why these might
not be available for pointers if because the AMP spec forbids storing
pointers to such types. There is a similar restriction in OpenCL, where you
cannot store a pointer to __global memory in a __private variable, so
address spaces cannot mingle. AMP restricts storing such pointers
alltogether.

I'm very much interested in this project, as I see that either AMP, or
something very similar will be the future of GPU parallelism after OpenCL,
which will remain to evolve and will most likely serve as a back-end to AMP.
Could we get some status update, as to what progress has been (or is planned
to be) made?