Supporting heterogeneous computing in llvm.

Christos,

We would be very interested in learning more about this.

In my group, we (Prakalp Srivastava, Maria Kotsifakou and I) have been working on LLVM extensions to make it easier to target a wide range of accelerators in a heterogeneous mobile device, such as Qualcomm's Snapdragon and other APUs. Our approach has been to (a) add better abstractions of parallelism to the LLVM instruction set that can be mapped down to a wide range of parallel hardware accelerators; and (b) to develop optimizing "back-end" translators to generate efficient code for the accelerators from the extended IR.

So far, we have been targeting GPUs and vector hardware, but semi-custom (programmable) accelerators are our next goal. We have discussed DSPs as a valuable potential goal as well.

Judging from the brief information here, I'm guessing that our projects have been quite complementary. We have not worked on the extraction passes, scheduling, or other run-time components you mention and would be happy to use an existing solution for those. Our hope is that the IR extensions and translators will give your schedulers greater flexibility to retarget the extracted code components to different accelerators.

--Vikram S. Adve
Visiting Professor, School of Computer and Communication Sciences, EPFL
Professor, Department of Computer Science
University of Illinois at Urbana-Champaign
vadve@illinois.edu
http://llvm.org

Hello,

Thank you a lot for the feedback. I believe that the heterogeneous engine should be strongly connected with parallelization and vectorization efforts. Most of the accelerators are parallel architectures where having efficient parallelization and vectorization can be critical for performance.

I am interested in these efforts and I hope that my code can help you managing the offloading operations. Your LLVM instruction set extensions may require some changes in the analysis code but I think is going to be straightforward.

I am planning to push my code on phabricator in the next days.

thanks,
Chris

If you're doing the extracting at the loop and llvm ir level - why
would you need to modify the IR? Wouldn't the target level lowering
happen later?

How are you actually determining to offload? Is this tied to
directives or using heuristics+some set of restrictions?

Lastly, are you handling 2 targets in the same module or end up
emitting 2 modules and dealing with recombining things later..

It’s not currently possible to do this using the current structure without some significant and, honestly, icky patches.

-eric

What's not possible? I agree some of our local patches and design may
not make it upstream as-is, but we are offloading to 2+ targets using
llvm ir *today*.

IMHO - you must (re)solve the problem about handling multiple targets
concurrently. That means 2 targets in a single Module or 2 Modules
basically glued one after the other.

Hello,

Thank you a lot for the feedback. I believe that the heterogeneous
engine
should be strongly connected with parallelization and vectorization
efforts.
Most of the accelerators are parallel architectures where having
efficient
parallelization and vectorization can be critical for performance.

I am interested in these efforts and I hope that my code can help you
managing the offloading operations. Your LLVM instruction set extensions
may
require some changes in the analysis code but I think is going to be
straightforward.

I am planning to push my code on phabricator in the next days.

If you’re doing the extracting at the loop and llvm ir level - why
would you need to modify the IR? Wouldn’t the target level lowering
happen later?

How are you actually determining to offload? Is this tied to
directives or using heuristics+some set of restrictions?

Lastly, are you handling 2 targets in the same module or end up
emitting 2 modules and dealing with recombining things later…

It’s not currently possible to do this using the current structure without
some significant and, honestly, icky patches.

What’s not possible? I agree some of our local patches and design may
not make it upstream as-is, but we are offloading to 2+ targets using
llvm ir today.

I’m not sure how much more clear I can be. It’s not possible, in the same module, to handle multiple targets at the same time.

IMHO - you must (re)solve the problem about handling multiple targets
concurrently. That means 2 targets in a single Module or 2 Modules
basically glued one after the other.

Patches welcome.

-eric

While I appreciate your taste in music - Canned (troll) replies are
typically a waste of time..

This is uncalled for and unacceptable. I’ve done an immense amount of work so that we can support different subtargets in the same module and get better LTO and target features. If you have a feature above and beyond what I’ve been able to do (and you say you do) then a request for patches is more than acceptable as a response. I’ve yet to see any work from you and a lot of talk about what other people should do.

-eric

Umm.. don't get your feathers in a ruffle - you provided *zero*
content and I was just saying it wasn't impossible. To pop back all
huffy is just funny.

Anyway, to bring this conversation back to something technical instead
of just stupid comments.. I'd agree that flipping targets back and
forth (intermixed) in the same Module *is* probably a substantial
amount of work. If the optimization passes worked at a PU (program
unit) aka function level it wouldn't be.

Why can't you append 1 Module after another and switch?

As you point out whole program analysis/optimization will face a
similar problem - same question as above.

Hello,

Thank you a lot for the feedback. I believe that the heterogeneous
engine
should be strongly connected with parallelization and
vectorization
efforts.
Most of the accelerators are parallel architectures where having
efficient
parallelization and vectorization can be critical for performance.

I am interested in these efforts and I hope that my code can help
you
managing the offloading operations. Your LLVM instruction set
extensions
may
require some changes in the analysis code but I think is going to
be
straightforward.

I am planning to push my code on phabricator in the next days.

If you’re doing the extracting at the loop and llvm ir level - why
would you need to modify the IR? Wouldn’t the target level lowering
happen later?

How are you actually determining to offload? Is this tied to
directives or using heuristics+some set of restrictions?

Lastly, are you handling 2 targets in the same module or end up
emitting 2 modules and dealing with recombining things later…

It’s not currently possible to do this using the current structure
without
some significant and, honestly, icky patches.

What’s not possible? I agree some of our local patches and design may
not make it upstream as-is, but we are offloading to 2+ targets using
llvm ir today.

I’m not sure how much more clear I can be. It’s not possible, in the
same
module, to handle multiple targets at the same time.

IMHO - you must (re)solve the problem about handling multiple targets
concurrently. That means 2 targets in a single Module or 2 Modules
basically glued one after the other.

Patches welcome.

While I appreciate your taste in music - Canned (troll) replies are
typically a waste of time…

This is uncalled for and unacceptable. I’ve done an immense amount of work
so that we can support different subtargets in the same module and get
better LTO and target features. If you have a feature above and beyond what
I’ve been able to do (and you say you do) then a request for patches is more
than acceptable as a response. I’ve yet to see any work from you and a lot
of talk about what other people should do.

Umm… don’t get your feathers in a ruffle - you provided zero
content and I was just saying it wasn’t impossible. To pop back all
huffy is just funny.

I can say the same and calling my post trolling was unacceptable.

Anyway, to bring this conversation back to something technical instead
of just stupid comments… I’d agree that flipping targets back and
forth (intermixed) in the same Module is probably a substantial
amount of work. If the optimization passes worked at a PU (program
unit) aka function level it wouldn’t be.

It’s just another level of indirection essentially - and a lot of work. It’s much easier to do what’s being proposed and outline work into another module. To do what you’ve said (and I’ve looked at) is basically turning each function into it’s own little module - ala what the ORC JIT does with per-function compilation.

Why can’t you append 1 Module after another and switch?

This is, effectively, two modules and it’ll behave the same. The reasons are data transfer etc for module level attributes, data layout, etc. We’ve still got some lingering issues at the function level let alone at the module level with side data taking over. Akira and I are working on them as we can.

As you point out whole program analysis/optimization will face a
similar problem - same question as above.

Currently - (I don’t know about DSP - TI/Qualcomm), but most people in
the industry are using custom runtimes to parse the GPU code and
load/execute. It would be great if the linker/loader actually had
better support for this built-in.

I don’t know the exact capabilities of gnu/sun linker/loader, but
something along the lines of managling the function to also include
target details

so compiler would emit multiple mangled versions of foo() and
linker/loader could pick the most optimized.

Something like this
nvc0_foo
avx2_foo
avx512_foo
(Also I’d agree that the above would be quite hard)

There’s quite a bit of work in this direction in a lot of different ways. You can take a look at the gnu ifunc ELF extensions as a way of doing this on a per-subtarget feature level. The obvious extension of this to accelerators is something that we’ve had discussions about (GNU Tools Cauldron a couple of years ago) and I believe it’s been discussed as part of a C++ working group.

At any rate, it’s a much bigger discussion than a weekend on the mailing list, but there’s been some thought about how it’ll need to happen on each architecture/OS and, as you can tell, it’s a matter of ongoing experimentation and development. (References: CUDA work, Movidius work, etc).

-eric

Anyway, to bring this conversation back to something technical instead
of just stupid comments.. I'd agree that flipping targets back and
forth (intermixed) in the same Module *is* probably a substantial
amount of work. If the optimization passes worked at a PU (program
unit) aka function level it wouldn't be.

It's just another level of indirection essentially - and a lot of work. It's
much easier to do what's being proposed and outline work into another
module. To do what you've said (and I've looked at) is basically turning
each function into it's own little module - ala what the ORC JIT does with
per-function compilation.

/* Non-jit example - Old Pro64/MIPSPro from SGI is per PU as well..
I'm not sure what kernelgen is doing.. */

I'm not sure I was clear - I'lll try to elaborate

You take the region of code or cuda kernel. etc being offloaded and
outline it into a seperate PU (function) which goes into a new module,
which is appended to the 1st.

This isn't exactly the clang model today, but *if* llvm is a library -
it's easier to handle the 2 modules one after the other.

Why can't you append 1 Module after another and switch?

This is, effectively, two modules and it'll behave the same. The reasons are
data transfer etc for module level attributes, data layout, etc. We've still
got some lingering issues at the function level let alone at the module
level with side data taking over. Akira and I are working on them as we can.

cool - good to hear.

As you point out whole program analysis/optimization will face a
similar problem - same question as above.
---------------------
Currently - (I don't know about DSP - TI/Qualcomm), but most people in
the industry are using custom runtimes to parse the GPU code and
load/execute. It would be great if the linker/loader actually had
better support for this built-in.

I don't know the exact capabilities of gnu/sun linker/loader, but
something along the lines of managling the function to also include
target details

so compiler would emit multiple mangled versions of foo() and
linker/loader could pick the most optimized.

Something like this
nvc0_foo
avx2_foo
avx512_foo
(Also I'd agree that the above would be quite hard)

There's quite a bit of work in this direction in a lot of different ways.
You can take a look at the gnu ifunc ELF extensions as a way of doing this
on a per-subtarget feature level. The obvious extension of this to
accelerators is something that we've had discussions about (GNU Tools
Cauldron a couple of years ago) and I believe it's been discussed as part of
a C++ working group.

The ifunc stuff doesn't behave exactly as I'd like. It's sorta close.
Another example - On solaris at boot time they have a check for the
system capabilities and mount over libc/m with the most optimized
version the system is capable of. When I first saw this I thought it
was quite clever and cool. (Many years ago) Doing that for
accelerators wouldn't exactly work though - since they can hang and be
(slightly?) less reliable than the CPU. (Not to mention busy)

The upside to this is less work for the loader. The downside is you
have to build multiple versions of libc and friends.

At any rate, it's a much bigger discussion than a weekend on the mailing
list, but there's been some thought about how it'll need to happen on each
architecture/OS and, as you can tell, it's a matter of ongoing
experimentation and development. (References: CUDA work, Movidius work,
etc).

Yeah I agree - I probably won't be sending a patch any time soon, but
I thought I could ask questions around designs that I know have
functionally worked.

Chirs,

Have you seen an offloading infrastructure design proposal at
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084986.html ?
It relies on the long-standing OpenMP standard with recent updates to
support the heterogenous computations.
Could you please review it and comment on how it fits to your needs?

It's not quite clear from your proposal what source language standard
do you plat to support - you just metion that OpenCL will be one of
your backends, as far as I got it. What's your plan on sources -
C/C++/FORTRAN?
How would you control the offloading, data transfer, scheduling and so
on? Whether it will be new language constructs, similar to prallel_for
in Cilk Plus, or will it be pragma-based like in OpenMP or OpenACC?

The design I mentioned above has an operable implementation fon NVIDIA
target at the

https://github.com/clang-omp/llvm_trunk
https://github.com/clang-omp/clang_trunk

with runtime implemented at

https://github.com/clang-omp/libomptarget

you're welcome to try it out, if you have an appropriate device.

Regards,
Sergos

Hi Sergos,

I'd like to try this on our hardware. Is there some example code that I could use to get started?

Cheers,
  Roel

Roel,

You have to checkout and build llvm/clang as usual.
For runtime support you'll have to build the libomptarget and make a
plugin for your target. Samuel can help you some more.
As for the OpenMP examples I can recommend you the
http://openmp.org/mp-documents/OpenMP4.0.0.Examples.pdf
look into the target constructs.

Sergos

Hi Roel, Chris,

This is a summary on how you can add support for a a different offloading device on top of what we have in github for OpenMP:

a) Download and install lvm (https://github.com/clang-omp/llvm_trunk), and clang (https://github.com/clang-omp/clang_trunk) as usual

b) install the official llvm OpenMP runtime library openmp.llvm.org. Clang will expect that to be present in your library path in order to compile OpenMP code (even if you do not need any OpenMP feature other than offloading).

c) Install https://github.com/clang-omp/libomptarget (running ‘make’ should do it). This library implements the API to control offloading. It also contains a set of plugins to some targets we are testing this with - x86_64, powerpc64 and NVPTX - in ./RTLs. You will need to implement a plug in for your target as well. The interface used for these plugins is detailed in the document proposed in http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084986.html .You can look at the existing plugins for a hint. In a nutshell you would have to implement code that allocates and moves data to your device, returns a table of entry points and global variables given a device library and launches execution of a given entry point with the provided list of arguments.

d) The current implementation is expecting the device library to use ELF format. There is no reason for that other than the platforms we tested this with so far use ELF format. If your device does not use ELF __tgt_register_lib() (src/omptarget.cpp) would have to be extended to understand your desired format. Otherwise you may just update src/targets_info.cpp with your ELF ID and plugin name.

e) Offloading is driven by clang, so it has to be aware of the required by yourr device. If your device toolchain is not implemented in clang you would have to do that in lib/Driver/ToolChains.cpp.

f) Once everything is in place, you can compile your code by running something like “clang -fopenmp -omptargets=your-target-triple app.c”. If you do separate compilation you could see that two different files are generated for a given source file (the target file has the suffix tgt-your-target-triple).

I should say that in general OpenMP requires a runtime library for the device as well, however if you do not use any OpenMP pragmas inside your target code you won’t need that.

We started porting our code related with offloading currently in github to clang upstream. The driver support is currently under review in http://reviews.llvm.org/D9888. We are about to send our first offloading codegen patches as well.

I understand that what Chris is proposing is somewhat different that what we have in place, given that the transformations are intended to be in LLVM IR. However, the goal seems to be the same. Hope the summary above gives you some hints on whether your use cases can be accommodated.

Feel free to ask any questions you may have.

Thanks!

Samuel

In fact, I have two modules:
a) the Host one
b) the Accelerator one

Each one gets compiled independently. The runtime takes care of the offloading operations and loads the accelerator code. Imagine that you want to compile for amd64 and nvidia ptx. You cannot do it in a single module and even if you support it, it is gonna become scary. How are you gonna handle architecture differences that affect the IR in a nice way? e.g. pointer size, stack alignment and much more…

–chris

Hello,

I can see some fundamental differences between this work and my work. However, I think they are more complementary than “competitive”. My work handles code extraction for offloading (in IR level) and offloading control with a runtime library this design is portable and not limited to openmp or another specific annotation scheme. It can handle different types of source code for offloading e.g. offloading of sequential code, parallel loops or even OpenCL kernels.

The runtime library is responsible for managing communication, coherency and scheduling. The library exposes a simple interface and the compiler generates calls to it. Plugins then provide support for the individual accelerator types.

The scheme you refer could be supported on the top of my infrastructure. I personally believe that code extraction and transformations for offloading should be done in IR and not in source level. The reason is that in IR level you have enough information about your program (e.g. datatypes) and a good idea about your target architectures.

–chris

When you're detecting which region of code to offload - can you also
detect the difference between a computationally bound kernel and a
memory bound one?

Hi Sergos and Samuel,

Thanks for the links, I've got it mostly working now.

I still have a problem with linking the code. It seems that the clang driver doesn't pass its library search path to nvlink when linking the generated cuda code to the target library, resulting in it not correctly finding libtarget-nvptx.a. Is there some flag or environment variable that I should set here? Manually providing nvlink with a -L flag pointing to the appropriate path seems to work for the linking step.

Cheers,
  Roel

Hi Roel,

You’d have to set LIBRARY_PATH to point to where libtarget-nvptx.a lives. At this moment, we are not translating the -L commands for the target, they are considered to be meant for the host only. I should probably extend the documentation to explain this detail.

Thanks,
Samuel