GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

Hi all,

I am a phd student from Huazhong University of Sci&Tech, China. The following is my GSoC 2012 proposal.
Comments are welcome!

Title: Automatic GPGPU Code Generation for LLVM

Abstract
Very often, manually developing an GPGPU application is a time-consuming, complex, error-prone and iterative process. In this project, I propose to build an automatic GPGPU code generation framework for LLVM, based on two successful LLVM (sub-)projects - Polly and PTX backend. This can be very useful to ease the burden of the long learning curve of various GPU programming model.

Motivation
With the broad proliferation of GPU computing, it is very important to provide an easy and automatic tool to develop or port the applications to GPU for normal developers, especially for those domain experts who want to harness the huge computing power of GPU. Polly has implemented many transformations, such as tiling, auto-vectorization and openmp code generation. With the help of LLVM’s PTX backend, I plan to extend Polly with the feature of GPGPU code generation.

Project Detail
In this project, we target various parallel loops which can be described by Polly’s polyhedral model. We first translated the selected SCoPs (Static Control Parts) into 4-depth loops with Polly’s schedule optimization. Then we extract the loop body (or inner non-parallel loops) into a LLVM sub-function, tagged with PTX_Kernel or PTX_Device call convention. After that, we use PTX backend to translate the subfunctions into a string of the corresponding PTX codes. Finally, we provide an runtime library to generate the executable program.

There are three key challenges in this project here.

  1. How to get the optimal execution configure of GPU codes.
    The execution configure is essential to the performance of the GPU codes. It is limited by many factors, including hardware, source codes, register usage, local store (device) usage, original memory access patterns and so on. We must take all the staff into consideration.

  2. How to automatically insert the synchronization codes.
    This is very important to preserve the original semantics. We must detect where we need insert them correctly.

  3. How to automatically generate the memory copy operation between host and device.
    We must transport the input data to GPU and copy the results back. Fortunately, Polly has implemented a very expressive way to describe memory access.

Timeline
May 21 ~ June 3 preliminary code generation for 1-d and 2d parallel loops.
June 4 ~ June 11 code generation for parallel loops with non-parallel inner loops.
June 11 ~ June 24 automatic memory copy insertions.
June 25 ~ July 8 auto-tuning for GPU execution configuration.
July 9 ~ July 15 Midterm evaluation and writing documents.
July 16 ~ July 22 automatic synchronization insertion.
July 23 ~ August 3 test on polybench benchmarks.
August 4 ~ August 12 summarize and complete the final documents.

Project experience
I participated in several projects related to binary translation (optimization) and run-time system. And I implemented a frontend for numerical computing languages like octave/matlab, following the style of clang. Recently, I work very close with Polly team to contribute some patches and investigate lots of details about polyhedral transformation.

References

  1. Tobias Grosser, Ragesh A. Polly - First Successful Optimizations - How to proceed? LLVM Developer Meeting 2011.
  2. Muthu Manikandan Baskaran, J. Ramanujam and P. Sadayappan. Automatic C-to-CUDA Code Generation for Affine Programs. CC 2010.
  3. Soufiane Baghdadi, Armin Größlinger, and Albert Cohen. Putting Automatic Polyhedral Compilation for GPGPU to Work. In Proc. of Compilers for Parallel Computers (CPC), 2010.

Hi all,

I am a phd student from Huazhong University of Sci&Tech, China. The
following is my GSoC 2012 proposal.

Hi Yabin,

Comments are welcome!

*Title: Automatic GPGPU Code Generation for LLVM*

*Abstract*
Very often, manually developing an GPGPU application is a

                        developing a GPGPU

time-consuming, complex, error-prone and iterativeprocess. In this

                                            iterative process.

project, I propose to build an automatic GPGPU code generation framework
for LLVM, based on two successful LLVM (sub-)projects - Polly and PTX
backend. This can be very useful to ease the burden of the long learning
curve of various GPU programming model.

                                    models.
I like the idea :wink:

Please submit a first version of this proposal to the Google SoC web application. You can refine it later, but it is important that it is officially registered. Like this you are on the save side, in case something unexpected happens the last days.

*Motivation*
With the broad proliferation of GPU computing, it is very important to
provide an easy and automatic tool to develop or port the applications
to GPU for normal developers, especially for those domain experts who
want to harness the huge computing power of GPU.
Polly has implemented
many transformations, such as tiling, auto-vectorization and openmp code
generation. With the help of LLVM's PTX backend, I plan to extend Polly
with the feature of GPGPU code generation.

*Project Detail*
In this project, we target various parallel loops which can be described
by Polly's polyhedral model. We first translated the selected SCoPs
(Static Control Parts) into 4-depth loops with Polly's schedule
optimization.
Then we extract the loop body (or inner non-parallel
loops) into a LLVM sub-function, tagged with PTX_Kernel or PTX_Device
call convention. After that, we use PTX backend to translate the
subfunctions into a string of the corresponding PTX codes. Finally, we
provide an runtime library to generate the executable program.

I would distinguish here between the infrastructure features that you add to Polly and the actual code generation/scheduling strategy you will follow. It should become clear that the infrastructure changes are independent of the actual code generation strategy you use.
This is especially important as automatic GPGPU code generation is a complex problem. I doubt it will be possible to implement a perfect solution within three months. Hence, I would target a (very) simple code
generation strategy that brings all the necessary infrastructure into Polly. When the infrastructure is read and proven to work, you can start
to implement (and evaluate) more complex code generation strategies.

There are three key challenges in this project here.
1. How to get the optimal execution configure of GPU codes.
The execution configure is essential to the performance of the GPU
codes. It is limited by many factors, including hardware, source codes,
register usage, local store (device) usage, original memory access
patterns and so on. We must take all the staff into consideration.

Yes and no. Don't try to solve everything withing 3 months. Rather try to limit yourself to some very simple but certainly achievable goals.
I would probably go either with a very simple

2. How to automatically insert the synchronization codes.
This is very important to preserve the original semantics. We must
detect where we need insert them correctly.

Again, distinguish here between the infrastructure of adding synchronizations and the algorithm to derive optimal synchronizations.

3. How to automatically generate the memory copy operation between host
and device.
We must transport the input data to GPU and copy the
results back. Fortunately, Polly has implemented a very expressive way
to describe memory access.

In general, I think in general it may be helpful to have some examples that where you show what you want to do.

*Timeline*
May 21 ~ June 3 preliminary code generation for 1-d and 2d parallel loops.
June 4 ~ June 11 code generation for parallel loops with non-parallel
inner loops.
June 11 ~ June 24 automatic memory copy insertions.
June 25 ~ July 8 auto-tuning for GPU execution configuration.

What do you mean by auto-tuning? What do you want to tune?

For me it does not seem to be essential.

Due to the short time of a GSoC I would suggest to just require the user to define such values and give a little bit more time to the other
features. You can put it into a nice to have list, where you put ideas that can be implemented after having fulfilled the success criteria.

July 9 ~ July 15 Midterm evaluation and writing documents.
July 16 ~ July 22 automatic synchronization insertion.
July 23 ~ August 3 test on polybench benchmarks.
August 4 ~ August 12 summarize and complete the final documents.

An additional list with details for the individual steps would be good.

When are you planning to add what infrastructure. You may also add example codes.

*Project experience*
I participated in several projects related to binary translation
(optimization) and run-time system. And I implemented a frontend for
numerical computing languages like octave/matlab, following the style of
clang. Recently, I work very close with Polly team to contribute some
patches and investigate lots of details about polyhedral transformation.

You may add links to the corresponding commit messages.

*References*
1. Tobias Grosser, Ragesh A. /Polly - First Successful Optimizations -
How to proceed?/ LLVM Developer Meeting 2011.
2. Muthu Manikandan Baskaran, J. Ramanujam and P.
Sadayappan.///Automatic C-to-CUDA Code Generation for Affine Programs/.
CC 2010.
3. Soufiane Baghdadi, Armin Größlinger, and Albert Cohen. /Putting
Automatic Polyhedral Compilation for GPGPU to Work/. In Proc. of
Compilers for Parallel Computers (CPC), 2010.

You are adding references, but don't reference them in your text. Is this intentional?

Overall, this looks interesting. Looking forward to your final submission.

Tobi

P.S. Feel free to post again to get further comments.

Hi Yabin,

Instead of compile the LLVM IR to PTX asm string in a ScopPass, you
can also the improve llc/lli or create new tools to support the code
generation for Heterogeneous platforms[1], i.e. generate code for more
than one target architecture at the same time. Something like this is
not very complicated and had been implemented[2,3] by some people, but
not available in LLVM mainstream. Implement this could make your GPU
project more complete.

best regards
ether

[1]http://en.wikipedia.org/wiki/Heterogeneous_computing
[2]http://llvm.org/devmtg/2010-11/Villmow-OpenCL.pdf
[3]http://llvm.org/devmtg/2008-08/Sander_HW-SW-CoDesignflowWithLLVM.pdf

Hi all,

I am a phd student from Huazhong University of Sci&Tech, China. The following is my GSoC 2012 proposal.
Comments are welcome!

Title: Automatic GPGPU Code Generation for LLVM

Abstract
Very often, manually developing an GPGPU application is a time-consuming, complex, error-prone and iterative process. In this project, I propose to build an automatic GPGPU code generation framework for LLVM, based on two successful LLVM (sub-)projects - Polly and PTX backend. This can be very useful to ease the burden of the long learning curve of various GPU programming model.

Motivation
With the broad proliferation of GPU computing, it is very important to provide an easy and automatic tool to develop or port the applications to GPU for normal developers, especially for those domain experts who want to harness the huge computing power of GPU. Polly has implemented many transformations, such as tiling, auto-vectorization and openmp code generation. With the help of LLVM’s PTX backend, I plan to extend Polly with the feature of GPGPU code generation.

Very interesting! I’m quite familiar with Muthu’s work, and putting that into LLVM would be great. If done right, it could apply to any heterogeneous systems, including AMD GPUs.

As the maintainer and primary developer on the PTX back-end, please feel free to contact me with any issues/suggestions you have regarding the PTX back-end!

Project Detail
In this project, we target various parallel loops which can be described by Polly’s polyhedral model. We first translated the selected SCoPs (Static Control Parts) into 4-depth loops with Polly’s schedule optimization. Then we extract the loop body (or inner non-parallel loops) into a LLVM sub-function, tagged with PTX_Kernel or PTX_Device call convention. After that, we use PTX backend to translate the subfunctions into a string of the corresponding PTX codes. Finally, we provide an runtime library to generate the executable program.

I’m a bit confused by the wording here. What do you mean by ‘LLVM sub-function?’ I’m assuming you mean extracting the relevant code into a separate function, but I would just use the word ‘function’.

And what do you mean by a run-time library to generate the executable program? Are you proposing to side-step the LLVM code generator LLC? It seems like a reasonable approach would be to write an LLVM pass (or set of passes) that takes as input a single IR file, and produces two: (1) the GPU kernel/device code, and (2) the non-translatable IR with GPU code replaced by appropriate CUDA Driver API calls. Then, both of these can pass through the opt/llc tools with the appropriate selection for optimization passes and target back-end.

This way, you could fairly easily create a GPGPU compiler by writing a simple wrapper around Clang (or better yet, improve Clang to support multiple targets simultaneously!)

Hi Hongbin,

2012/4/3 Hongbin Zheng <etherzhhb@gmail.com>

Instead of compile the LLVM IR to PTX asm string in a ScopPass, you
can also the improve llc/lli or create new tools to support the code
generation for Heterogeneous platforms[1], i.e. generate code for more
than one target architecture at the same time. Something like this is
not very complicated and had been implemented[2,3] by some people, but
not available in LLVM mainstream. Implement this could make your GPU
project more complete.

[1]http://en.wikipedia.org/wiki/Heterogeneous_computing
[2]http://llvm.org/devmtg/2010-11/Villmow-OpenCL.pdf
[3]http://llvm.org/devmtg/2008-08/Sander_HW-SW-CoDesignflowWithLLVM.pdf

The original motivation we do this, is to provide a jit compiler for our language frontend (a subset of matlab/octave). I’ve extended lli to implement a jit compiler (named gvm) to use polly dynamically. However, preliminary results show that the overhead is heavy. I choose to offload the dynamic optimization from the jitting process. And also putting the LLVM to PTX asm string pass into polly can provide a kind of one-touch experience to users.

Please imagine such a user scenario. When a user open a matlab source file or a folder contained source files, we can start to compile the source statically and use polly and opt to optimize it to get the optimal version llvm ir. Finally, when the user click run or the enter key, we just need jit the llvm ir as normal one, minimizing the dynamic overhead.

Thanks for the recommendation of the references

best regards,
Yabin.

Hi Justin,

2012/4/3 Justin Holewinski <justin.holewinski@gmail.com>

Motivation
With the broad proliferation of GPU computing, it is very important to provide an easy and automatic tool to develop or port the applications to GPU for normal developers, especially for those domain experts who want to harness the huge computing power of GPU. Polly has implemented many transformations, such as tiling, auto-vectorization and openmp code generation. With the help of LLVM’s PTX backend, I plan to extend Polly with the feature of GPGPU code generation.

Very interesting! I’m quite familiar with Muthu’s work, and putting that into LLVM would be great. If done right, it could apply to any heterogeneous systems, including AMD GPUs.

As the maintainer and primary developer on the PTX back-end, please feel free to contact me with any issues/suggestions you have regarding the PTX back-end!

Thanks for your interest and help.

I’m a bit confused by the wording here. What do you mean by ‘LLVM sub-function?’ I’m assuming you mean extracting the relevant code into a separate function, but I would just use the word ‘function’.

Yes, it is indeed a function. I use this word by following the methods naming style of polly’s openmp code generation. I will fix this.

And what do you mean by a run-time library to generate the executable program?

The runtime library is just a wrapper of cuda driver APIs in my mind. But we can add our debug info and make the cuda APIs changes apparent to users.

Are you proposing to side-step the LLVM code generator LLC? It seems like a reasonable approach would be to write an LLVM pass (or set of passes) that takes as input a single IR file, and produces two: (1) the GPU kernel/device code, and (2) the non-translatable IR with GPU code replaced by appropriate CUDA Driver API calls. Then, both of these can pass through the opt/llc tools with the appropriate selection for optimization passes and target back-end.

This way, you could fairly easily create a GPGPU compiler by writing a simple wrapper around Clang (or better yet, improve Clang to support multiple targets simultaneously!)

Ether give a similar suggestion to this point. Here I copy the reply to him to explain why I choose to put the transformation pass embedded in my implementation.

The original motivation we do this, is to provide a jit compiler for our language frontend (a subset of matlab/octave). I’ve extended lli to implement a jit compiler (named gvm) to use polly dynamically. However, preliminary results show that the overhead is heavy. I choose to offload the dynamic optimization from the jitting process. And also putting the LLVM to PTX asm string pass into polly can provide a kind of one-touch experience to users. Please imagine such a user scenario. When a user open a matlab source file or a folder contained source files, we can start to compile the source statically and use polly and opt to optimize it to get the optimal version llvm ir. Finally, when the user click run or the enter key, we just need jit the llvm ir as normal one, minimizing the dynamic overhead.

Thanks again!

best regards,
Yabin

Hi Justin,

the non-translatable IR with GPU code replaced by appropriate CUDA Driver API calls.

One of CUDA driver apis (cuLaunch) need a ptx asm string as its input. So if I want to provide a one-touch solution and don’t introduce any changes to tools outside polly, I must prepare the ptx string before I can generate the correct non-translatable IR part.

As your suggestion, It may be implemented as leaving an input parameter slot for ptx string in the main method of the non-translatable IR part. Maybe I can implement both versions of this. Let Tobi judge which one is better to be integrated into polly.

best regards,
Yabin

Hi Tobi,

I revise the proposal here. Can you review for me and give comments again? Thanks.

AbstractVery often, developing an GPGPU application is a time-consuming, complex, error-prone and iterative process. In this project, I propose to build an automatic GPGPU code generation framework for LLVM, based on two successful LLVM (sub-)projects - Polly and PTX backend. This can be very useful to ease the burden of the long learning curve of various GPU programming model.

MotivationWith the broad proliferation of GPU computing, it is very important to provide an easy and automatic tool to develop or port the applications to GPU for normal developers, especially for those domain experts who want to harness the huge computing power of GPU. Polly has implemented many transformations, such as tiling, auto-vectorization and openmp code generation. And GPGPU code generation has been planned in [1]. With the help of LLVM’s PTX backend, I plan to extend Polly with the feature of GPGPU code generation.

Project Detail

There are several successful projects on source to source automatic gpu code transformation. In this project, I will follow the method proposed in [2] by Muthu Manikandan Baskaran etc. Since automatic GPGPU code generation is quite a complex problem, specifically, we target two kinds of test cases. One is comprised of pure parallel loops, just like the following codes.

Another one is that all the loops in it are parallel except the inner-most one, just like this:

The LoopBody part should be limited to instructions or functions calls (intrinsic) which can be handled by LLVM’s PTX backend.

The work flow of our code generator is as follows. We first use Polly’s jscop file importer to get a wanted 4-level parallel tiled code. Then we extract the loop body (or inner non-parallel loops) into a LLVM function, tagging it with PTX_Kernel or PTX_Device call convention. Then we use PTX backend to translate the PTX_Kernel and PTX_Device functions into strings of the corresponding PTX codes. After that we transformed non-translatable part of the LLVM IRs with GPU runtime library calls inserted. The execution configure of GPU is acquired from external user-specified jscop files, which has been implemented by Polly. Finally, we provide an runtime library to generate the executable program or run the optimized LLVM IRs with JIT compiler like li.

There are two key challenges in this project.

  1. How to automatically insert the synchronization codes.
    This is very important to preserve the original semantics. We must detect where we need insert them correctly.

  2. How to automatically generate the memory copy operation between host and device.
    We must transport the input data to GPU and copy the results back. Fortunately, Polly has implemented a very expressive way to describe memory access. We will follow the taxonomy proposed in [3] by Chris Gregg etc.

Timeline

  • May 21 ~ June 11 Preliminary GPGPU Code Generation

In this stage, implement gpu code generation for 1d and 2d parallel loops test cases which needn’t to copy host memory as input. Verify that our method is workable.

  • June 12 ~ June 24 automatic memory copy insertions.

In this stage, insert memory copy operation for all the array accesses correctly according to the Read/Write property provided by Polly.

  • June 25 ~ July 8 Code Generation for Parallel Loops With Non-parallel Inner-most Loop.

In this stage, implement gpgpu code generation for classical matrix multiplication test case.

**for**(i=0; i<N; i++) { 
 **for**(j=0; j<N; j++) { 
 **for**(k=0; k<N; k++)
 C[i][j] = C[i][j] + A[i][k] * B[k][j];
 }
}
  • July 9 ~ July 15 Midterm evaluation and writing documents.
  • July 16 ~ July 22 Automatic Synchronization Insertion.
    In this stage, implement Muthu’s method instroduced in Section 4.3 in [2] to insert barrier synchronizations to preserve semantic-equivalent.
  • July 23 ~ August 5 Test on Polybench Benchmarks and Report Results.
  • August 6 ~ August 12 Summarize and Complete the Final Documents.

Project Experience

I participated in several projects related to binary translation (optimization) and run-time system. And I implemented a frontend for numerical computing languages like octave/matlab, following the style of clang. Recently, I work very close with Polly team to contribute some patches [4] and investigate lots of details about polyhedral transformation.

References1. Tobias Grosser, Ragesh A. Polly - First Successful Optimizations - How to proceed? LLVM Developer Meeting 2011.

  1. Muthu Manikandan Baskaran, J. Ramanujam and P. Sadayappan. Automatic C-to-CUDA Code Generation for Affine Programs. International Conference on Compiler Construction (CC) 2010.
  2. Chris Gregg and Kim Hazelwood. Where is the Data? Why You Cannot Debate GPU vs. CPU Performance Without the Answer__. International Symposium on Performance Analysis of Systems and Software (ISPASS) 2011.
  3. http://llvm.org/viewvc/llvm-project?view=rev&revision=153319.

I agree with ether that we should ensure as much work as possible is done within generic, not Polly specific code.

In terms of heterogeneous code generation the approach Yabin proposed seems to work, but we should discuss other approaches. For the moment,
I believe his proposal is very similar the model of OpenCL and CUDA. He splits the code into host and kernel code. The host code is directly compiled to machine code by the existing tools (clang/llc). The kernel code is stored as a string and only at execution time it is compiled to platform specific code.

Are there any other approaches that could be taken? What specific heterogeneous platform support would be needed. At the moment, it seems to me we actually do not need too much additional support.

Cheers
Tobi

Hi Yabin,

Instead of compile the LLVM IR to PTX asm string in a ScopPass, you
can also the improve llc/lli or create new tools to support the code
generation for Heterogeneous platforms[1], i.e. generate code for more
than one target architecture at the same time. Something like this is
not very complicated and had been implemented[2,3] by some people, but
not available in LLVM mainstream. Implement this could make your GPU
project more complete.

I agree with ether that we should ensure as much work as possible is
done within generic, not Polly specific code.

Right, this has the potential to impact more people that the users of Polly. By moving as much as possible to generic LLVM, that infrastructure can be leveraged by people doing work outside of the polyhedral model.

In terms of heterogeneous code generation the approach Yabin proposed
seems to work, but we should discuss other approaches. For the moment,
I believe his proposal is very similar the model of OpenCL and CUDA. He
splits the code into host and kernel code. The host code is directly
compiled to machine code by the existing tools (clang/llc). The kernel
code is stored as a string and only at execution time it is compiled to
platform specific code.

Depending on your target, that may be the only way. If your target is OpenCL-compatible accelerators, then your only portable option is save the kernel code as OpenCL text and let the driver JIT compiler it at run-time. Any other approach is not guaranteed to be compatible across platforms or even driver versions.

In this case, the target is the CUDA Driver API, so you’re free to pass along any valid PTX assembly. In this case, you still pass the PTX code as a string to the driver, which JIT compiles it to actual GPU device code at run-time.

Are there any other approaches that could be taken? What specific
heterogeneous platform support would be needed. At the moment, it seems
to me we actually do not need too much additional support.

I could see this working without any additional support, if needed. It seems like this proposal is dealing with LLVM IR → LLVM IR code generation, so the only thing that is really needed is a way to split the IR into multiple separate IRs (one for host, and one for each accelerator target). This does not really need any supporting infrastructure, as you could imagine an opt pass processing the input IR and transforming it to the host IR, and emitting the device IR as a separate module.

Now if you’re talking about source-level support for heterogeneous platforms (e.g. C++ AMP), then you would need to adapt Clang to support emission of multiple IR modules. Basically, the AST would need to be split into host and device portions, and codegen’d appropriately. I feel that is far beyond the scope of this proposal, though.

     > Hi Yabin,
     >
     > Instead of compile the LLVM IR to PTX asm string in a ScopPass, you
     > can also the improve llc/lli or create new tools to support the code
     > generation for Heterogeneous platforms[1], i.e. generate code for
    more
     > than one target architecture at the same time. Something like this is
     > not very complicated and had been implemented[2,3] by some
    people, but
     > not available in LLVM mainstream. Implement this could make your GPU
     > project more complete.

    I agree with ether that we should ensure as much work as possible is
    done within generic, not Polly specific code.

Right, this has the potential to impact more people that the users of
Polly. By moving as much as possible to generic LLVM, that
infrastructure can be leveraged by people doing work outside of the
polyhedral model.

To make stuff generic it is often helpful to know the other possible use cases. I consequently encourage everybody to point out such use cases or to state which exact functionality they might want to reuse. Otherwise, there it may happen that we focus a little too much on the needs of Polly.

    In terms of heterogeneous code generation the approach Yabin proposed
    seems to work, but we should discuss other approaches. For the moment,
    I believe his proposal is very similar the model of OpenCL and CUDA. He
    splits the code into host and kernel code. The host code is directly
    compiled to machine code by the existing tools (clang/llc). The kernel
    code is stored as a string and only at execution time it is compiled to
    platform specific code.

Depending on your target, that may be the only way. If your target is
OpenCL-compatible accelerators, then your only portable option is save
the kernel code as OpenCL text and let the driver JIT compiler it at
run-time. Any other approach is not guaranteed to be compatible across
platforms or even driver versions.
In this case, the target is the CUDA Driver API, so you're free to pass
along any valid PTX assembly. In this case, you still pass the PTX code
as a string to the driver, which JIT compiles it to actual GPU device
code at run-time.

I would like to highlight that with the word 'string' I was not referring to 'OpenCL C code'. I don't think it is a practical approach to recover OpenCL C code, especially as the LLVM-IR C backend was recently removed.

I meant to describe that the kernel code is stored as a global variable in the host binary (in some intermediate representation such as LLVM-IR, PTX or a vendor specific OpenCLBinary) and is loaded at execution time into the OpenCL or CUDA runtime, where it is compiled down to hardware specific machine code.

    Are there any other approaches that could be taken? What specific
    heterogeneous platform support would be needed. At the moment, it seems
    to me we actually do not need too much additional support.

I could see this working without any additional support, if needed. It
seems like this proposal is dealing with LLVM IR -> LLVM IR code
generation, so the only thing that is really needed is a way to split
the IR into multiple separate IRs (one for host, and one for each
accelerator target). This does not really need any supporting
infrastructure, as you could imagine an opt pass processing the input IR
and transforming it to the host IR, and emitting the device IR as a
separate module.

Yes. And instead of saving the two modules in separate files, we can store the kernel modul as a 'string' in the host module and add the necessary library calls to load it at run time. This will give a smooth user experience and requires almost no additional infrastructure.

(At the moment this will only work with NVidia, but I am confident there will be OpenCL vendor extensions that allow loading LLVM-IR kernels. AMD OpenCL can e.g. load LLVM-IR, even though it is not officially supported)

Now if you're talking about source-level support for heterogeneous
platforms (e.g. C++ AMP), then you would need to adapt Clang to support
emission of multiple IR modules. Basically, the AST would need to be
split into host and device portions, and codegen'd appropriately. I
feel that is far beyond the scope of this proposal, though.

Yes. No source level transformations or targeting anything else than PTX, AMDIL or LLVM-IR.

Cheers
Tobi

oops, forget to cc the dev-list

hi tobi,

Yes. And instead of saving the two modules in separate files, we can store
the kernel modul as a 'string' in the host module and add the necessary
library calls to load it at run time. This will give a smooth user
experience and requires almost no additional infrastructure.

We may lost some co-optimization opportunities if we translate the
device function to string too early. Instead we can mark the device
functions with a special calling convention and translate the device
functions in lli/llc.

best regards
ether