[PROPOSAL] LLVM multi-module support

Hi,

a couple of weeks ago I discussed with Peter how to improve LLVM's support for heterogeneous computing. One weakness we (and others) have seen is the absence of multi-module support in LLVM. Peter came up with a nice idea how to improve here. I would like to put this idea up for discussion.

## The problem ##

LLVM-IR modules can currently only contain code for a single target architecture. However, there are multiple use cases where one translation unit could contain code for several architectures.

1) CUDA

cuda source files can contain both host and device code. The absence of multi-module support complicates adding CUDA support to clang, as clang would need to perform multi-module compilation on top of a single-module based compiler framework.

2) C++ AMP

C++ AMP [1] contains - similarly to CUDA - both host code and device code in the same source file. Even if C++ AMP is a Microsoft extension the use case itself is relevant to clang. It would be great if LLVM would provide infrastructure, such that front-ends could easily target accelerators. This would probably yield a lot of interesting experiments.

3) Optimizers

To fully automatically offload computations to an accelerator an optimization pass needs to extract the computation kernels and schedule
them as separate kernels on the device. Such kernels are normally LLVM-IR modules for different architectures. At the moment, passes have no way to create and store new LLVM-IR modules. There is also no way
to reference kernel LLVM-IR modules from a host module (which is necessary to pass them to the accelerator run-time).

## Goals ##

a) No major changes to existing tools and LLVM based applications

b) Human readable and writable LLVM-IR

c) FileCheck testability

d) Do not force a specific execution model

e) Unlimited number of embedded modules

## Detailed Goals

a)
  o No changes should be required, if a tool does not use multi-module
    support. Each LLVM-IR file valid today, should remain valid.

  o Major tools should support basic heterogeneous modules without large
    changes. Some of the commands that should work after smaller
    adaptions:

    clang -S -emit-llvm -o out.ll
    opt -O3 out.ll -o out.opt.ll
    llc out.opt.ll
    lli out.opt.ll
    bugpoint -O3 out.opt.ll

b) All (sub)modules should be directly human readable/writable.
    There should be no need to extract single modules before modifying
    them.

c) The LLVM-IR generated from a heterogeneous multi-module should
    easily be 'FileCheck'able. The same is true, if a multi-module is
    the result of an optimization.

d) In CUDA/OpenCL/C++ AMP kernels are scheduled from within the host
    code. This means arbitrary host code can decide under which
    conditions kernels are scheduled for execution. It is therefore
    necessary to reference individual sub-modules from within the host
    module.

e) CUDA/OpenCL allows to compile and schedule an arbitrary number of
    kernels. We do not want to put an artificial limit on the number of
    modules they are represented in. This means a single embedded
    submodule is not enough.

## Non Goals ##

o Modeling sub-architectures on a per-function basis

Functions could be specialized for a certain sub-architecture. This is helpful to have certain functions optimized e.g. with AVX2 enabled, but the general program being compiled for a more generic architecture.
We do not address per-function annotations in this proposal.

## Proposed solution ##

To bring multi-module support to LLVM, we propose to add a new type called 'llvmir' to LLVM-IR. It can be used to embed LLVM-IR submodules
as global variables.

Hi Tobias, I didn't really get it. Is the idea that the same bitcode is
going to be codegen'd for different architectures, or is each sub-module
going to contain different bitcode? In the later case you may as well
just use multiple modules, perhaps in conjunction with a scheme to store
more than one module in the same file on disk as a convenience.

Ciao, Duncan.

In our project we combine regular binary code and LLVM IR code for kernels, embedded as a special data symbol of ELF object. The LLVM IR for kernel existing at compile-time is preliminary, and may be optimized further during runtime (pointers analysis, polly, etc.). During application startup, runtime system builds an index of all kernels sources embedded into the executable. Host and kernel code interact by means of special “launch” call, which does not only optimize&compile&execute the kernel, but first makes an estimation if it is worth to, or better to fall back to host code equivalent.

Proposal made by Tobias is very elegant, but it seems to be addressing the case when host and sub-architectures’ code exist in the same time. May I kindly point out that to our experience the really efficient deeply specialized sub-architectures code may simply not exist at compile time, while the generic baseline host code always can.

Best,

  • Dima.

2012/7/26 Duncan Sands <baldrick@free.fr>

I’m not convinced that having multi-module IR files is the way to go. It just seems like a lot of infrastructure/design work for little gain. Can the embedded modules have embedded modules themselves? How deep can this go? If not, then the embedded LLVM IR language is really a subset of the full LLVM IR language. How do you share variables between parent and embedded modules?

I feel that this can be better solved by just using separate IR modules. For your purposes, the pass that generates the device code can simply create a new module and the host code can refer to the generated code by name. Then, you can run each module through opt and llc individually, and then link them together somehow, like Dmitry’s use of ELF symbols/sections. This is exactly how CUDA binaries work; device code is embedded into the host binary as special ELF sections. This would be a bit more work on the part of your toolchain to make sure opt and llc and executed for each produced module, but the changes are far fewer than supporting sub-modules in a single IR file. This also has the benefit that you do not need to change LLVM at all for this to work.

Is there some particular use-case that just won’t work without sub-module support? I know you like using the example of “clang -o - | opt -o - | llc” but I’m just not convinced that retaining the ability to pipe tools like that is justification enough to change such a fundamental part of the LLVM system.

Hi Dmitry,

In our project we combine regular binary code and LLVM IR code for kernels,
embedded as a special data symbol of ELF object. The LLVM IR for kernel existing
at compile-time is preliminary, and may be optimized further during runtime
(pointers analysis, polly, etc.). During application startup, runtime system
builds an index of all kernels sources embedded into the executable. Host and
kernel code interact by means of special "launch" call, which does not only
optimize&compile&execute the kernel, but first makes an estimation if it is
worth to, or better to fall back to host code equivalent.

in your case it doesn't sound like any modifications to what a module can hold
are needed, it's more a question of building stuff on top of the existing
infrastructure.

Proposal made by Tobias is very elegant, but it seems to be addressing the case
when host and sub-architectures' code exist in the same time. May I kindly point
out that to our experience the really efficient deeply specialized
sub-architectures code may simply not exist at compile time, while the generic
baseline host code always can.

I can't help feeling that Tobias is reinventing "tar", only upside down, and
rather than stuffing an archive inside modules he should be stuffing modules
inside an archive. But most likely I just completely failed to understand
where he's going.

Ciao, Duncan.

Tobias Grosser <tobias@grosser.es> writes:

o Modeling sub-architectures on a per-function basis

Functions could be specialized for a certain sub-architecture. This is
helpful to have certain functions optimized e.g. with AVX2 enabled, but
the general program being compiled for a more generic architecture.
We do not address per-function annotations in this proposal.

Could this be accomplished using a separate module for the specialized
function of interest under your proposal?

## Proposed solution ##

To bring multi-module support to LLVM, we propose to add a new type
called 'llvmir' to LLVM-IR. It can be used to embed LLVM-IR submodules
as global variables.

------------------------------------------------------------------------
target datalayout = ...
target triple = "x86_64-unknown-linux-gnu"

@llvm_kernel = private unnamed_addr constant llvm_kernel {
   target triple = nvptx64-unknown-unknown
   define internal ptx_kernel void @gpu_kernel(i8* %Array) {
     ...
   }
}
------------------------------------------------------------------------

By default the global will be compiled to a llvm string stored in the
object file. We could also think about translating it to PTX or AMD's
HSA-IL, such that e.g. PTX can be passed to a run-time library.

Hmm...I'm not sure about this model. Not every accelerator execution
model out there takes code as a string. Some want natively-compiled
binaries.

From my point of view, Peters idea allows us to add multi-module
support in a way that allows us to reach the goals described above.
However, to properly design and implement it, early feedback would be
valuable.

I really don't like this at first glance. Anything that results in a
string means that we can't use normal tools to manipulate it. I
understand the string representation is desirable for some targets but
it seems to really cripple others. The object file output should at
least be configurable. Some targets might even want separate asm files
for the various architectures.

                                 -Dave

Duncan Sands <baldrick@free.fr> writes:

Hi Tobias, I didn't really get it. Is the idea that the same bitcode is
going to be codegen'd for different architectures, or is each sub-module
going to contain different bitcode? In the later case you may as well
just use multiple modules, perhaps in conjunction with a scheme to store
more than one module in the same file on disk as a convenience.

I tend to agree. Why do we need a whole new submodule concept?

                              -Dave

"Dmitry N. Mikushin" <maemarcus@gmail.com> writes:

Proposal made by Tobias is very elegant, but it seems to be addressing
the case when host and sub-architectures' code exist in the same time.

I don't know why that would have to be the case. Couldn't your
accelerator backend simply read in the proposed IR string and
optimize/codegen it?

May I kindly point out that to our experience the really efficient
deeply specialized sub-architectures code may simply not exist at
compile time, while the generic baseline host code always can.

As I mentioned earlier, I am more concerned about the case where there
is no accelerator compiler executed at runtime. All the code for the
host and accelerate needs to be available in native format at run time.
A string representation in the object file doesn't allow that.

                                   -Dave

Couldn’t your accelerator backend simply read in the proposed IR string and optimize/codegen it?

Sure, it does, but that IR is long way from the final target-specific IR to be specialized in runtime. And in the proposed design both host and accelerator code seem to be intended for codegen before application execution. This is not always the case, moreover it implicitly reduces the visible use-scope of Polly, which is much more powerful and can also work together with JIT.

  • D.

2012/7/26 <dag@cray.com>

Hi Duncan,

thanks for your reply.

The proposal may allow both, sub-modules that contain different bitcode, but also sub-modules that are code generated differently.

Different bitcode may arise from sub-modules that represent different program parts, but also because we want to create different sub-modules for a single program part e.g. to optimize for specific hardware.

In the back-end, sub-modules could be code generated according to the
requirements of the run-time system that will load them. For NVIDIA
chips we could code generate PTX, for AMD systems AMD-IL may be an option.

You and several others (Justin e.g) pointed out that multi-modules in LLVM-IR (or the llvm.codegen intrinsics) just reinvent the tar archive system. I can follow your thoughts here.

Thinking of how to add cuda support to clang a possible approach here is to modify clang to emit device and host code to different modules, compile each module separately and than add logic to clang to merge the two modules in the end. This is a very reasonable approach and there are no doubts, adding multi-module support to LLVM just to simplify this single use case is not the right thing to do.

With multi-module support I am aiming for something else. As you know,
LLVM allows to "-load" optimizer plugins at run-time and every LLVM based compiler being it clang/ghc/dragonegg/lli/... can take advantage of them with almost no source code changes. I believe this is a very nice feature, as it allows to prototype and test new optimizations easily and without any changes to the core compilers itself. This works not only for simple IR transformations, but even autoparallelisation works well, as calls to libgomp can easily be added.

The next step we were looking into was automatically offloading some calculations to an accelerator. This is actually very similar to OpenMP parallelisation, but, instead of calls to libgomp, calls to libcuda or libopencl need to be scheduled. The only major difference is that the kernel code is not just a simple function in the host module, but a entirely new module. Hence an optimizer somehow needs to extract those modules and needs to pass a reference to them to the cuda or opencl runtime.

The driving motivation for my proposal was to extend LLVM, such that optimization passes for heterogeneous architectures can be run in
LLVM based compilers with no or little changes to the compiler source code. I think having this functionality will allow people to test new ideas more easily and will avoid the need for each project to create its own tool chain. It will also allow one optimizer to work most tools (clang/ghc/dragonegg/lli) without the need for larger changes.

From the discussion about our last proposal, the llvm.codegen() intrinsic, I took the conclusion that people are mostly concerned about
interpreting arbitrary strings embedded into an LLVM-IR file and that people suggested explicit LLVM-IR extensions as one possible solution. So I was hoping, this proposal could address some of the previously raised concern. However, apparently people do not really see a need for stronger support for heterogeneous compilation directly within LLVM. Or the other way around, I fail to see how to achieve the same goals with the existing infrastructure or some of the suggestions people made. I will probably need to understand some of the ideas pointed out.

Thanks again for your feedback

Cheers
Tobi

Hi Dimitry,

the proposal did not mean to say that all code needs to be optimized and target code generate at compile time. You may very well retain some kernels as LLVM-IR code and pass this code to your runtime system (similar how CUDA or OpenCL currently accept kernel code).

Btw, one question I always wanted to ask: What is the benefit of having the kernel embedded as data symbol in the ELF object, in contrast to having it as a global variable (which is then passed to the run-time). I know cell used mainly elf symbols, but e.g. OpenCL reads kernels by passing a pointer to the kernel string to the run-time library. Can you point out the difference to me?

Cheers and thanks
Tobi

I'm not convinced that having multi-module IR files is the way to go.
  It just seems like a lot of infrastructure/design work for little
gain. Can the embedded modules have embedded modules themselves? How
deep can this go? If not, then the embedded LLVM IR language is really
a subset of the full LLVM IR language. How do you share variables
between parent and embedded modules?

I don't have final answers to these questions, but here my current thoughts: I do not see a need for deeply nested modules, but I also don't see a big problem. Variables between parent and embedded modules are not shared. They are within separate address spaces.

I feel that this can be better solved by just using separate IR modules.
  For your purposes, the pass that generates the device code can simply
create a new module and the host code can refer to the generated code by
name. Then, you can run each module through opt and llc individually,
and then link them together somehow, like Dmitry's use of ELF
symbols/sections. This is exactly how CUDA binaries work; device code
is embedded into the host binary as special ELF sections. This would be
a bit more work on the part of your toolchain to make sure opt and llc
and executed for each produced module, but the changes are far fewer
than supporting sub-modules in a single IR file. This also has the
benefit that you do not need to change LLVM at all for this to work.

Is there some particular use-case that just won't work without
sub-module support? I know you like using the example of "clang -o - |
opt -o - | llc" but I'm just not convinced that retaining the ability to
pipe tools like that is justification enough to change such a
fundamental part of the LLVM system.

As I mentioned to Duncan, I agree with you that for a specific tool chain, the approach you mentioned is probably best. However, I am aiming for a more generic approach, which aims for optimizer plugins that can be used in various LLVM-based compilers, without the need for larger changes to each these compilers. Do you think that is a useful goal?

Thanks for your feedback
Tobi

Tobias Grosser <tobias@grosser.es> writes:

o Modeling sub-architectures on a per-function basis

Functions could be specialized for a certain sub-architecture. This is
helpful to have certain functions optimized e.g. with AVX2 enabled, but
the general program being compiled for a more generic architecture.
We do not address per-function annotations in this proposal.

Could this be accomplished using a separate module for the specialized
function of interest under your proposal?

In my proposal, different modules have different address spaces. Also, I don't aim to support function calls across module boundaries. So having a separate module for this function does not seem to be a solution.

## Proposed solution ##

To bring multi-module support to LLVM, we propose to add a new type
called 'llvmir' to LLVM-IR. It can be used to embed LLVM-IR submodules
as global variables.

------------------------------------------------------------------------
target datalayout = ...
target triple = "x86_64-unknown-linux-gnu"

@llvm_kernel = private unnamed_addr constant llvm_kernel {
    target triple = nvptx64-unknown-unknown
    define internal ptx_kernel void @gpu_kernel(i8* %Array) {
      ...
    }
}
------------------------------------------------------------------------

By default the global will be compiled to a llvm string stored in the
object file. We could also think about translating it to PTX or AMD's
HSA-IL, such that e.g. PTX can be passed to a run-time library.

Hmm...I'm not sure about this model. Not every accelerator execution
model out there takes code as a string. Some want natively-compiled
binaries.

If LLVM provides an object code emitter for the relevant back-end, we could also think about emitting native binaries. Storing the assembly as a string is just a 'default' output.

  From my point of view, Peters idea allows us to add multi-module
support in a way that allows us to reach the goals described above.
However, to properly design and implement it, early feedback would be
valuable.

I really don't like this at first glance. Anything that results in a
string means that we can't use normal tools to manipulate it.
I
understand the string representation is desirable for some targets but
it seems to really cripple others. The object file output should at
least be configurable. Some targets might even want separate asm files
for the various architectures.

I see 'string' just as a default output, but would have hoped we could provide other outputs as needed. Do you see any reason, why we could not emit native code for some of the embedded sub-modules?

Thanks for your feedback
Tobi

Hi Tobias,

What is the benefit of having the kernel embedded as data symbol in the ELF object, in contrast to having it as a global variable

This is for conventional link step & LTO. During compilation we allow kernels to depend on each other, which is resolved during linking. The whole process is built on top of gcc and its existing collect2/lto1 mechanisms. As result, we have hybrid objects/libraries/binaries containing two independent representations: regular binary output from gcc and a set of LLVM IR of kernels operated by their own entry point. And now the code is not in the data section, but in special one, similar to __gnu_lto_v1 for gcc’s LTO.

One question I realized while replying to yours: do you see your team more focusing on infrastructure things or on polyhedral analysis development?

The quality of CLooG/Polly is what we ultimately rely on in the first place. All other things are a_lot simpler. You will see: ecosystems, applications and testbeds will grow around themselves, once the core concepts is strong. There is probably no need to spend resources on leading the way for them in engineering topics. But they may wither soon, if math is not growing with the same speed. Just an opinion.

Best,

  • D.

2012/7/29 Tobias Grosser <tobias@grosser.es>

I'm not convinced that having multi-module IR files is the way to go.
  It just seems like a lot of infrastructure/design work for little
gain. Can the embedded modules have embedded modules themselves? How
deep can this go? If not, then the embedded LLVM IR language is really
a subset of the full LLVM IR language. How do you share variables
between parent and embedded modules?

I don't have final answers to these questions, but here my current thoughts: I do not see a need for deeply nested modules, but I also don't see a big problem. Variables between parent and embedded modules are not shared. They are within separate address spaces.

But some targets may allow sharing variables, how would this be implemented?

I feel that this can be better solved by just using separate IR modules.
  For your purposes, the pass that generates the device code can simply
create a new module and the host code can refer to the generated code by
name. Then, you can run each module through opt and llc individually,
and then link them together somehow, like Dmitry's use of ELF
symbols/sections. This is exactly how CUDA binaries work; device code
is embedded into the host binary as special ELF sections. This would be
a bit more work on the part of your toolchain to make sure opt and llc
and executed for each produced module, but the changes are far fewer
than supporting sub-modules in a single IR file. This also has the
benefit that you do not need to change LLVM at all for this to work.

Is there some particular use-case that just won't work without
sub-module support? I know you like using the example of "clang -o - |
opt -o - | llc" but I'm just not convinced that retaining the ability to
pipe tools like that is justification enough to change such a
fundamental part of the LLVM system.

As I mentioned to Duncan, I agree with you that for a specific tool chain, the approach you mentioned is probably best. However, I am aiming for a more generic approach, which aims for optimizer plugins that can be used in various LLVM-based compilers, without the need for larger changes to each these compilers. Do you think that is a useful goal?

I think that the same can be achieved using already-existing functionality, like archives. Granted, right now the command-line tools cannot directly process archives containing bit-code files, but I believe it would be more beneficial to support that than implementing nested bit-code files. In any optimizer, you would have to set up a different pass chain for different architectures anyway.

I feel that it would be reasonable to allow clang/opt to produce an archive with multiple bit-code files instead of a single module as they do today.

Part of the issue I see with nested modules is how to invoke the optimizer. To get the most performance out of the code, you'll probably have to pass different options to opt for the host and device code. So wouldn't you need to invoke opt multiple times anyway?

Justin Holewinski <justin.holewinski@gmail.com> writes:

I think that the same can be achieved using already-existing
functionality, like archives. Granted, right now the command-line tools
cannot directly process archives containing bit-code files, but I
believe it would be more beneficial to support that than implementing
nested bit-code files.

+1!

                                   -Dave