Supporting heterogeneous computing in llvm, follow up.


As I first e-mailed yesterday, I have been working on a Heterogeneous Execution Engine (Hexe) that provides compiler and runtime support for the automatic and transparent offloading of loop and function workloads to accelerators.

Hexe is composed of the following:
a) Analysis and Transformation Passes for extracting loops or functions for offloading.

b) A runtime library that handles scheduling, data sharing and coherency between the
host and accelerator sides.
c) A modular codebase and design. Adaptors specialize the code transformations for the target accelerators. Runtime plugins manage the interaction with the different accelerator environments.

I have prepared a presentation that I would like to share. It provides a high level overview of the work. You can find it here:

In the next days I will push patches on phabricator, so people can give detailed feedback on the code and design. This may also help with coming up strategies on what needs to be done or changed.


Hi Christos,

I’ve taken a look at your slide deck and have been thinking about how to do this for a while as well. I definitely think this is a good start and am looking forward to the patches. I think getting the right compilation strategy is going to be important here and it’s going to take quite a bit of thought to work through what you’ve got. The patches are likely going to illuminate this quite a bit more. I’m also curious about any syntactic sugar (ala cuda/etc) that you’re thinking about here. There are some reasonable starts in llvm already for opencl/cuda and I’m curious how you see this extending those efforts in a more general fashion.

I’m looking forward to more work along these lines.



Hello Chris,

your work sounds very interesting.

We are working on something quite similar. Our system, called BAAR
(Binary Acceleration at Runtime), analyzes an application in LLVM IR at
runtime, identifies hotspots and generates parallelized and vectorized
implementations on-the-fly. Afterwards the offloading is done transparently.
In contrast to your work, we are targeting the Intel Xeon Phi accelerator
(Intel Many Integrated Core Architecture; MIC). To detect suitable loops we
are using Polly and to get the binary for the accelerator we are leveraging
the Intel compilers. The overall architecture is basically client/server-style.

You can find more details in our publication. The project is open source,
available at

We are looking forward to hear and see more from Hexe.


Hi Eric,

I agree that handling offloading and multiple architectures can be tricky and the design needs to be careful. I believe my design makes some sense :slight_smile: however I don’t “sell” it as a final solution. Many issues need to be considered and examined carefully. Apologies for delaying the patches. I am presenting at a conference this week so I have to prepare a presentation etc.

Hi Heinrich,

It looks quite interesting and I am definitely interested in auto-parallelization and auto-vectorization for Hexe. My code is on phabricator, however I have just organized it as two huge patches for now and I need to provide more information about it.