Hi Justin,
Thanks very much for your comments.
2012/4/28 Justin Holewinski <justin.holewinski@gmail.com
<mailto:justin.holewinski@gmail.com>>
        The attached patch adds a new Intrinsic named "llvm.codegen" to
        support embedded LLVM IR code generation. The 'llvm.codegen'
        intrinsic uses the LLVM back ends to generate code for embedded
        LLVM IR strings. The code generation target can be same or
        different to the one of the parent module.
        The original motivation inspiring us to add this intrinsic, is
        to generate code for heterogeneous platform. A test case in the
        patch demos this.  In the test case, on a X86 host, we use this
        intrinsic to transform an embedded  LLVM IR into a string of PTX
        assembly. We can then employ a PTX  execution engine ( on CUDA
        Supported GPU ) to execute the newly generated assembly and copy
        back the result later.
    I have to admit, I'm not sold on this solution.  First, there is no
    clear way to pass codegen flags to the back-end.  In PTX parlance,
    how would I embed an .ll file and compile to compute_13?
We can handle this by provide a new argument (e.g. a string of
properly-configured Target Machine) instead of or in addition to the
Arch type string argument.
I think we may in general discuss the additional information needed for the back ends and provide the information as parameters. We may want to do this on demand, in case we agreed on the general usefulness of this intrinsic.
    Second, this adds a layer of obfuscation to the system.  If I look
    at an .ll file, I expect to see all of the assembly in a reasonably
    clean syntax.  If the device code is squashed into a constant array,
    it is much harder to read.
I agree with Justin. The embedded code is not readable within the constant array. For debugging purposes having the embedded module in separate files is better. I believe we can achieve this easily by adding a pass that extracts the embedded LLVM-IR code into separate files.
    Is the motivation for the intrinsic simply to preserve the ability
    to pipe LLVM commands together on the command-line, e.g. opt | llc?
      I really feel that the cleaner solution is to split the IR into
    separate files, each of which can be processed independently after
    initial generation.
Yes, it is. To preserve such an ability is the main benefit we got from
this intrinsic. It means we needn't to implement another compiler driver
or jit tool for our specific purpose. I agree with you that embedded
llvm ir harms the readability of the .ll file.
I would like to add that embedding the device IR into the host IR fits very well in the LLVM code generation chain. It obviously makes running 'opt | llc' possible, but it also enables us to write optimizations that yield embedded GPU code.
To write optimizations that yield embedded GPU code, we also looked into three other approaches:
1. Directly create embedded target code (e.g. PTX)
This would mean the optimization pass extracts device code internally and directly generate the relevant target code. This approach would require our generic optimization pass to be directly linked with the specific target back end. This is an ugly layering violation and, in addition, it causes major troubles in case the new optimization should be dynamically loaded.
2. Extend the LLVM-IR files to support heterogeneous modules
This would mean we extend LLVM-IR, such that IR for different targets
can be stored within a single IR file. This approach could be integrated nicely into the LLVM code generation flow and would yield readable LLVM-IR even for the device code. However, it adds another level of complexity to the LLVM-IR files and does not only require massive changes in the LLVM code base, but also in compilers built on top of LLVM-IR.
3. Generate two independent LLVM-IR files and pass them around together
The host and device LLVM-IR modules could be kept in separate files. This has the benefit of being user readable and not adding additional complexity to the LLVM-IR files itself. However, separate files do not provide information about how those files are related. Which files are kernel files, how.where do they need to be loaded, ...? Also this information could probably be put into meta-data or could be hard coded
into the generic compiler infrastructure, but this would require significant additional code.
Another weakness of this approach is that the entire LLVM optimization chain is currently built under the assumption that a single file/module passed around. This is most obvious with the 'opt | llc' idiom, but in general every tool that does currently exist would need to be adapted to handle multiple files and would possibly even need semantic knowledge about how to connect/use them together. Just running clang or
draggonegg with -load GPGPUOptimizer.so would not be possible.
All of the previous approaches require significant changes all over the code base and would cause trouble with loadable optimization passes. The intrinsic based approach seems to address most of the previous problems.
The intrinsic based approach requires little changes restricted to LLVM itself. It especially works without changes to the established LLVM optimization chain. 'opt | llc' will work out of the box, but, more importantly, any LLVM based compiler can directly load a GPGPUOptimzer.so file to gain a GPU based accelerator. Besides the need to load some runtime library, no additional knowledge needs to be embedded in individual compiler implementations, but all the logic of GPGPU code generation can remain within a single LLVM optimization pass. Another nice feature of the intrinsic is that the relation between host and device code is explicitly encoded in the LLVM-IR (with the llvm.codegen function calls). There is no need to put this information into individual tools and/or to carry it through meta-data. Instead the precise semantics are directly available through LLVM-IR.
Justin: With your proposed two-file approach? What changes would be needed to add e.g. GPGPU code generation support to clang/dragonegg or
haskell+LLVM? Can you see a way, this can be done without large changes
to each of these users?
        The usage of t his intrinsic is not limited to code generation
        for heterogeneous platform. It can also help lots of (run-time)
        optimization and security problems even when the code generation
        target is same as the one of the parent module.
    How does this help run-time optimization?
We implement this intrinsic by learning the implementation style of
llvm's garbage collector related intrinsics which support various GC
strategies. It can help if the ASMGenerator in the patch is revised to
be able to accept various optimization strategies provided by the user
of this intrinsic. Then the intrinsic will do what the user wants to the
input code string. When running the code with lli like jit tools, we can
choose one optimization strategy at run-time. Though haven't supported
this currently, we try to make the design as general as we can. The
essential functionality of this intrinsic is that we get an input code
string, transform it into a target-specific new one then replace the
call to the intrinsic.
There may be uses like this, but I am not sure if the llvm.codegen() intrinsic is the best way to implement this. Even though we made it generic and it can possibly be used in other ways, I suggest to currently focus on the use for heterogeneous computing. This is where it is needed today and where we can easily check if it does what we need.
        Each call to the intrinsic has two arguments. One is the LLVM IR
        string. The other is the name of the target architecture. When
        running with tools like llc, lli, etc, this intrinsic transforms
        the input LLVM IR string  to a new string of assembly code for
        the target architecture firstly. Then the call to the intrinsic
        is replaced by a pointer to the newly generated string. After
        this, we have in our module
    Is the Arch parameter to llvm.codegen really needed?  Since codegen
    happens when lowering the intrinsic, the target architecture must be
    known.  But if the target architecture is known, then it should be
    available in the triple for the embedded module.
Yes. It is better that the target data is set correctly in the embedded
module. It is the user's responsibility to do this.
OK. Why don't we require the triple to be set and remove the arch parameter again?
Tobi