Supporting heterogeneous computing in llvm.

Hello All,

The last two months I have been working on the design and implementation of a heterogeneous execution engine for LLVM. I started this project as an intern at the Qualcomm Innovation Center and I believe it can be useful to different people and use cases. I am planning to share more details and a set of patches in the next
days. However, I would first like to see if there is an interest for this.

The project is about providing compiler and runtime support for the automatic and transparent offloading of loop or function workloads to accelerators.

It is composed of the following:
a) Compiler and Transformation Passes for extracting loops or functions for offloading.

b) A runtime library that handles scheduling, data sharing and coherency between the
host and accelerator sides.
c) A modular codebase and design. Adaptors specialize the code transformations for the target accelerators. Runtime plugins manage the interaction with the different accelerator environments.

So far, this work so far supports the Qualcomm DSP accelerator but I am planning to extend it to support OpenCL accelerators. I have also developed a debug port where I can test the passes and the runtime without requiring an accelerator.

The project is still in early R&D stage and I am looking forward for feedback and to gauge the interest level. I am willing to continue working on this as an open source project and bring it to the right shape so it can be merged with the LLVM tree.


P.S. I intent to join the llvm social in Bay Area tonight and I will be more than happy to talk about it.


I can see even the homogenous variant of this to be useful. Just having the capability of extracting loops and wrapping them into functions and/or modules could help speeding up performance analysis and experiments. It would also help with testing the basic infrastructure + heterogenous environments.


This sounds really cool. I’m thinking about FPGA offloading. -Rich

Hi Chris-

Are you offloading the vectorizable loops or are you looking at auto-par loops also ?



Hi Christos,

your idea can certainly go very far, and the capability of extracting and executing one loop nest at a time should enable progress in multiple directions, among them fast performance evaluation of loop transformations and benefits of accelerators. It could also be useful for measuring and tuning off-loading overhead.


Hi Gerolf,

Thanks for the interest. I agree that this project may help in different directions and this is the reason I tried to keep it modular and independent of particular use cases.


Hi Dibyendu,

The design is quite modular. It can support offloading for serial execution, vectorized execution or parallel execution. I will provide more details soon.


Hi Richard,

Having an OpenCL plugin would simplify the use of both GPUs and FPGAs. However, I am not sure if programming FPGAs with OpenCL is mature enough.