[PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

Hi guys,

Just catching up on an interesting thread :slight_smile:

I believe this can be a way worth going,
but I doubt now is the right moment for it. I don't share your opinion
that it is easy to move LLVM-IR in this direction, but I rather believe
that this is an engineering project that will take several months of
full time work.

From a philosophical perspective, there can be times when it makes sense to do something short-term to gain experience, but we try not to keep that sort of thing in for a whole release cycle, because then we have to be compatible with it forever.

Also, I know you're not saying it but the "I don't want to do the right thing, because it is too much work" sentiment grates against me: that's a perfect case for keeping a patch local and out of the llvm.org tree. Again, I know that this is not what you're trying to get at.

David wrote:

Again, we have many of the changes to make this possible. I hope to
send them for review as we upgrade to 3.1.

A vague promise to release some code that may or may not be useful is also not particularly useful.

I want clang to automatically create executables that use CUDA/OpenCL to
offload core computations (from plain C code). This should be
implemented in an external LLVM-IR optimization pass.

clang -Xclang -load -Xclang CUDAGenerator.so file.c -O3 -mllvm -offload-cuda

The very same should work for Pure, dragonegg and basically any compiler
based on LLVM. So I do not want to change clang at all (except of
possibly linking to -lcuda).

Ok, that *is* an interesting use case. It would be great for LLVM to support this kind of thing. We're clearly not set up for it out of the box right now.

In terms of the complexity. The only alternative proposal I have heard
of was making LLVM-IR multi module aware or adding multi-module support
to all LLVM-IR tools. Both of these changes are way more complex than
the codegen intrinsic. Actually, they are soo complex that I doubt that
they can be implemented any time soon. What is the simpler approach you
are talking about?

I also don't like the intrinsic, but not because of security ;-). For me, it is because embedding arbitrary blobs of IR in an *instruction* doesn't make sense. The position of the instruction in the parent function doesn't necessarily have anything to do with the code attached, the intrinsic can be duplicated, deleted, moved around, etc. It is also poorly specified what is allowed and legal.

Unlike the related-but-different problem of "multi-versioning", it also doesn't make sense for PTX code to be functions in the same module as X86 IR functions. If your desire was for a module to have an SSE2, SSE3, and SSE4 version of the same function, then it *would* make sense for them to be in the same module... because there is linkage between them, and a runtime dispatcher. We don't have the infrastructure yet for per-function CPU flags, but this is something that we will almost certainly grow at some point (just need a clean design). This doesn't help you though. :slight_smile:

The design that makes sense to me for this is the multi-module approach. The PTX and X86 code *should* be in different LLVM Modules from each other. I agree that this makes a "vectorize host code to the GPU" optimization pass different than other existing passes, but I don't think that's a bad thing. Realistically, the driver compiler that this is embedded into (clang, dragonegg or whatever) will need to know about both targets to some extent to handle command line options for selecting PTX/GPU version, deciding where and how to output both chunks of code in the output file, etc.

Given that the driver has to have *some* knowledge of this anyway, it doesn't seem problematic for the second module to be passed into the pass constructor. Instead of something like:

PM.add(new OffloadToCudaPass())

You end up doing:
  Module *CudaModule = new Module(...)
  PM.add(new OffloadToCudaPass(CudaModule))

This also means that the compiler driver is responsible for deciding what to do with the module after it is formed (and of course, it may be empty if nothing is offloaded). Based on the compiler its embedded into, it may immediately JIT to PTX and upload to a GPU, it may write the IR out to a file, it may run the PTX code generator and output the PTX to another section of the executable, or whatever. I do agree that this makes it more awkward to work with "opt" on the command line, and that clang plugins are ideally suited for this, but opt is already suboptimal for a lot of things (e.g. anything that requires target info) and we should improve clang plugins, not workaround their limitations IMO.

What do you think?

-Chris

Hi guys,

I believe this can be a way worth going,
but I doubt now is the right moment for it. I don't share your opinion
that it is easy to move LLVM-IR in this direction, but I rather believe
that this is an engineering project that will take several months of
full time work.

From a philosophical perspective, there can be times when it makes sense to do something short-term to gain experience, but we try not to keep that sort of thing in for a whole release cycle, because then we have to be compatible with it forever.

Also, I know you're not saying it but the "I don't want to do the right thing, because it is too much work" sentiment grates against me: that's a perfect case for keeping a patch local and out of the llvm.org tree. Again, I know that this is not what you're trying to get at.

I was afraid it would sound like this. I previously explained why I disagree. Other people disagreed with my disagreement. :wink:

This discussion definitely helps to understand the different solutions.
The multi-module approach seems interesting, even though I am not yet convinced it is a better solution.

In terms of the complexity. The only alternative proposal I have heard
of was making LLVM-IR multi module aware or adding multi-module support
to all LLVM-IR tools. Both of these changes are way more complex than
the codegen intrinsic. Actually, they are soo complex that I doubt that
they can be implemented any time soon. What is the simpler approach you
are talking about?

I also don't like the intrinsic, but not because of security ;-). For me, it is because embedding arbitrary blobs of IR in an *instruction* doesn't make sense. The position of the instruction in the parent function doesn't necessarily have anything to do with the code attached, the intrinsic can be duplicated, deleted, moved around, etc. It is also poorly specified what is allowed and legal.

The blobs are embedded as global unnamed const strings. The intrinsics
references them, if needed. This models directly how a simple OpenCL or
CUDA program would be written. The kernel code stored as PTX code in
some globals and functions like 'cuModuleLoadDataEx' are used to load
and compile such kernels at runtime.

The position of the instruction itself is defined by the context in
which it is used. I probably did not make it clear beforehand, but we
plan to replace a computation kernel, by a heterogenous mix of host
LLVM-IR and kernel calls. Something like this:

for (i
    for (j
    if (..)
       schedule_cuda(llvm.codegen("kernel", "ptx32"))
    else if (..)
       schedule_cuda(llvm.codegen("kernel", "ptx64"))
    else
       // Fallback CPU code
    for (...
    if (..)
       schedule_cuda(llvm.codegen("kernel", "ptx32"))
    else if (..)
       schedule_cuda(llvm.codegen("kernel", "ptx64"))
    else
       // Fallback CPU code
}

This means we have host code that performs calculations that are not
offloaded and that schedules the different kernel executions. The host
code (or a runtime library) will also take care of deciding which GPU
type we target or if fallback CPU code is needed. In case we execute on
a GPU, the host code passes the PTX string to the CUDA runtime, the CUDA
runtime JIT compiles it, and the host code caches the result for future
use.

This means the llvm.codegen() intrinsic is directly used by the host
code, which compiles and schedules the kernels. It can therefore only
be moved around with the corresponding host code. Moving and modifying
the intrinsic with the host code seems to make sense. If e.g. a code
path is provenly dead, we would automatically dead code eliminate the
kernel code with the surrounding host code. The same holds for function
versioning. If the host code is duplicated, we want to also duplicate
the intrinsic such that the kernel code is referenced from two
positions. (The kernel code is still only stored once, but it is
referenced from two places). In general, what is allowed and legal
follows the definition of an LLVM-IR function call (which can be marked
readonly). We were aiming to not require any special handling of the
intrinsic here. What do you think is not specified precisely? Maybe it
can/could be fixed.

Unlike the related-but-different problem of "multi-versioning", it also doesn't make sense for PTX code to be functions in the same module as X86 IR functions. If your desire was for a module to have an SSE2, SSE3, and SSE4 version of the same function, then it *would* make sense for them to be in the same module... because there is linkage between them, and a runtime dispatcher. We don't have the infrastructure yet for per-function CPU flags, but this is something that we will almost certainly grow at some point (just need a clean design). This doesn't help you though. :slight_smile:

I was also reasoning about combining this with multi-versioning, but I agree multi-versioning is related-but-different.

The design that makes sense to me for this is the multi-module approach. The PTX and X86 code *should* be in different LLVM Modules from each other. I agree that this makes a "vectorize host code to the GPU" optimization pass different than other existing passes, but I don't think that's a bad thing. Realistically, the driver compiler that this is embedded into (clang, dragonegg or whatever) will need to know about both targets to some extent to handle command line options for selecting PTX/GPU version, deciding where and how to output both chunks of code in the output file, etc.

Given that the driver has to have *some* knowledge of this anyway, it doesn't seem problematic for the second module to be passed into the pass constructor. Instead of something like:

PM.add(new OffloadToCudaPass())

You end up doing:
   Module *CudaModule = new Module(...)
   PM.add(new OffloadToCudaPass(CudaModule))

This also means that the compiler driver is responsible for deciding what to do with the module after it is formed (and of course, it may be empty if nothing is offloaded). Based on the compiler its embedded into, it may immediately JIT to PTX and upload to a GPU, it may write the IR out to a file, it may run the PTX code generator and output the PTX to another section of the executable, or whatever. I do agree that this makes it more awkward to work with "opt" on the command line, and that clang plugins are ideally suited for this, but opt is already suboptimal for a lot of things (e.g. anything that requires target info) and we should improve clang plugins, not workaround their limitations IMO.

What do you think?

It seems we agree that host and kernel code should be in different
modules. That is nice. Instead of embedding the kernel modules directly
into the host module, you propose to pass empty kernel modules to the
constructor of the CUDA offload pass and to extract the CUDA kernels
into those modules.

Your approach removes the need to add File I/O to the optimization pass.
This is a very positive point.

I am still unsure about the following questions

  o Extracting multiple kernels

    A single computation normally schedules several kernels,
    both, to specialize for different hardware, but also to
           calculate different parts of the problem. How would you
           model this? Returning a list of modules?

         o How to reference the kernel modules from host code

    We need a way to reference the kernel modules from the host
           code? Your proposal does not specify anything here. When
           the kernel code is directly embedded in the host IR the
           function calls to the CUDA/OpenCL runtime can directly
           reference it (possibly through the llvm.codegen() intrinsic).
    Are you suggesting some intrinsics to reference the externally
           stored modules?

         o How much logic to put into the driver

    Our current idea was to put the entire logic of loading,
    compiling and running kernels into the host code. This enables
    us to change this code independently of the driver and to
    embed complex logic here. The only driver change would be to
           ask clang to add -lcuda or -lopencl. This could be done with
           a clang plugin.

    It seems you want to put more logic into the driver.
           Where would you e.g. implement code that caches the kernels
           and compiles them just in time (and only if they are actually
    executed)? Would this be part of the driver? How would you
           link this with fallback host code?

           If different optimizer projects implement different strategies
           here, are you proposing to commit all those to the clang
           driver or to extend clang plugins to handle this? What about
           non clang plugins like dragonegg. Would the driver changes
           need to be ported to dragonegg, too?

Thanks again for your ideas
Tobi