[PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

Hi LLVMers,

The attached patch adds a new Intrinsic named “llvm.codegen” to support embedded LLVM IR code generation. The ‘llvm.codegen’ intrinsic uses the LLVM back ends to generate code for embedded LLVM IR strings. The code generation target can be same or different to the one of the parent module.

The original motivation inspiring us to add this intrinsic, is to generate code for heterogeneous platform. A test case in the patch demos this. In the test case, on a X86 host, we use this intrinsic to transform an embedded LLVM IR into a string of PTX assembly. We can then employ a PTX execution engine ( on CUDA Supported GPU) to execute the newly generated assembly and copy back the result later.

The usage of this intrinsic is not limited to code generation for heterogeneous platform. It can also help lots of (run-time) optimization and security problems even when the code generation target is same as the one of the parent module.

Each call to the intrinsic has two arguments. One is the LLVM IR string. The other is the name of the target architecture. When running with tools like llc, lli, etc, this intrinsic transforms the input LLVM IR string to a new string of assembly code for the target architecture firstly. Then the call to the intrinsic is replaced by a pointer to the newly generated string. After this, we have in our module

We would like to get the community’s feedback on this so as to make sure this patch is as universally applicable as possible.

Thanks a lot!

best regards,

Yabin Hu

0001-Add-llvm.codegen-intrinsic.patch (19.2 KB)

Hi LLVMers,

The attached patch adds a new Intrinsic named "llvm.codegen" to support
embedded LLVM IR code generation. **The 'llvm.codegen' intrinsic uses
the LLVM back ends to generate code for embedded LLVM IR strings. The code
generation target can be same or different to the one of the parent module.

The original motivation inspiring us to add this intrinsic, is to generate
code for heterogeneous platform. A test case in the patch demos this. In
the test case, on a X86 host, we use this intrinsic to transform an
embedded LLVM IR into a string of PTX assembly. We can then employ a PTX
execution engine ( on CUDA Supported GPU) to execute the newly generated
assembly and copy back the result later.

I have to admit, I'm not sold on this solution. First, there is no clear
way to pass codegen flags to the back-end. In PTX parlance, how would I
embed an .ll file and compile to compute_13? Second, this adds a layer of
obfuscation to the system. If I look at an .ll file, I expect to see all
of the assembly in a reasonably clean syntax. If the device code is
squashed into a constant array, it is much harder to read.

Is the motivation for the intrinsic simply to preserve the ability to pipe
LLVM commands together on the command-line, e.g. opt | llc? I really feel
that the cleaner solution is to split the IR into separate files, each of
which can be processed independently after initial generation.

The usage of this intrinsic is not limited to code generation for
heterogeneous platform. It can also help lots of (run-time) optimization
and security problems even when the code generation target is same as the
one of the parent module.

How does this help run-time optimization?

Each call to the intrinsic has two arguments. One is the LLVM IR string.
The other is the name of the target architecture. When running with tools
like llc, lli, etc, this intrinsic transforms the input LLVM IR string to
a new string of assembly code for the target architecture firstly. Then
the call to the intrinsic is replaced by a pointer to the newly generated
string. After this, we have in our module

Is the Arch parameter to llvm.codegen really needed? Since codegen happens
when lowering the intrinsic, the target architecture must be known. But if
the target architecture is known, then it should be available in the triple
for the embedded module.

Hi Justin,

Thanks very much for your comments.

The attached patch adds a new Intrinsic named "llvm.codegen" to support
embedded LLVM IR code generation. **The 'llvm.codegen' intrinsic uses
the LLVM back ends to generate code for embedded LLVM IR strings. The code
generation target can be same or different to the one of the parent module.

The original motivation inspiring us to add this intrinsic, is to
generate code for heterogeneous platform. A test case in the patch demos
this. In the test case, on a X86 host, we use this intrinsic to transform
an embedded LLVM IR into a string of PTX assembly. We can then employ a
PTX execution engine ( on CUDA Supported GPU) to execute the newly
generated assembly and copy back the result later.

I have to admit, I'm not sold on this solution. First, there is no clear
way to pass codegen flags to the back-end. In PTX parlance, how would I
embed an .ll file and compile to compute_13?

We can handle this by provide a new argument (e.g. a string of
properly-configured Target Machine) instead of or in addition to the Arch
type string argument.

Second, this adds a layer of obfuscation to the system. If I look at an
.ll file, I expect to see all of the assembly in a reasonably clean syntax.
If the device code is squashed into a constant array, it is much harder to
read.

Is the motivation for the intrinsic simply to preserve the ability to pipe

LLVM commands together on the command-line, e.g. opt | llc? I really feel
that the cleaner solution is to split the IR into separate files, each of
which can be processed independently after initial generation.

Yes, it is. To preserve such an ability is the main benefit we got from
this intrinsic. It means we needn't to implement another compiler driver or
jit tool for our specific purpose. I agree with you that embedded llvm ir
harms the readability of the .ll file.

The usage of this intrinsic is not limited to code generation for

heterogeneous platform. It can also help lots of (run-time) optimization
and security problems even when the code generation target is same as the
one of the parent module.

How does this help run-time optimization?

We implement this intrinsic by learning the implementation style of llvm's
garbage collector related intrinsics which support various GC strategies.
It can help if the ASMGenerator in the patch is revised to be able to
accept various optimization strategies provided by the user of this
intrinsic. Then the intrinsic will do what the user wants to the input code
string. When running the code with lli like jit tools, we can choose one
optimization strategy at run-time. Though haven't supported this currently,
we try to make the design as general as we can. The essential functionality
of this intrinsic is that we get an input code string, transform it into a
target-specific new one then replace the call to the intrinsic.

Each call to the intrinsic has two arguments. One is the LLVM IR string.

The other is the name of the target architecture. When running with tools
like llc, lli, etc, this intrinsic transforms the input LLVM IR string to
a new string of assembly code for the target architecture firstly. Then
the call to the intrinsic is replaced by a pointer to the newly generated
string. After this, we have in our module

Is the Arch parameter to llvm.codegen really needed? Since codegen
happens when lowering the intrinsic, the target architecture must be known.
But if the target architecture is known, then it should be available in
the triple for the embedded module.

Yes. It is better that the target data is set correctly in the embedded
module. It is the user's responsibility to do this.

Thanks again!

best regards,
Yabin

Hi Justin,

Thanks very much for your comments.

2012/4/28 Justin Holewinski <justin.holewinski@gmail.com
<mailto:justin.holewinski@gmail.com>>

        The attached patch adds a new Intrinsic named "llvm.codegen" to
        support embedded LLVM IR code generation. The 'llvm.codegen'
        intrinsic uses the LLVM back ends to generate code for embedded
        LLVM IR strings. The code generation target can be same or
        different to the one of the parent module.

        The original motivation inspiring us to add this intrinsic, is
        to generate code for heterogeneous platform. A test case in the
        patch demos this. In the test case, on a X86 host, we use this
        intrinsic to transform an embedded LLVM IR into a string of PTX
        assembly. We can then employ a PTX execution engine ( on CUDA
        Supported GPU ) to execute the newly generated assembly and copy
        back the result later.

    I have to admit, I'm not sold on this solution. First, there is no
    clear way to pass codegen flags to the back-end. In PTX parlance,
    how would I embed an .ll file and compile to compute_13?

We can handle this by provide a new argument (e.g. a string of
properly-configured Target Machine) instead of or in addition to the
Arch type string argument.

I think we may in general discuss the additional information needed for the back ends and provide the information as parameters. We may want to do this on demand, in case we agreed on the general usefulness of this intrinsic.

    Second, this adds a layer of obfuscation to the system. If I look
    at an .ll file, I expect to see all of the assembly in a reasonably
    clean syntax. If the device code is squashed into a constant array,
    it is much harder to read.

I agree with Justin. The embedded code is not readable within the constant array. For debugging purposes having the embedded module in separate files is better. I believe we can achieve this easily by adding a pass that extracts the embedded LLVM-IR code into separate files.

    Is the motivation for the intrinsic simply to preserve the ability
    to pipe LLVM commands together on the command-line, e.g. opt | llc?
      I really feel that the cleaner solution is to split the IR into
    separate files, each of which can be processed independently after
    initial generation.

Yes, it is. To preserve such an ability is the main benefit we got from
this intrinsic. It means we needn't to implement another compiler driver
or jit tool for our specific purpose. I agree with you that embedded
llvm ir harms the readability of the .ll file.

I would like to add that embedding the device IR into the host IR fits very well in the LLVM code generation chain. It obviously makes running 'opt | llc' possible, but it also enables us to write optimizations that yield embedded GPU code.

To write optimizations that yield embedded GPU code, we also looked into three other approaches:

1. Directly create embedded target code (e.g. PTX)

This would mean the optimization pass extracts device code internally and directly generate the relevant target code. This approach would require our generic optimization pass to be directly linked with the specific target back end. This is an ugly layering violation and, in addition, it causes major troubles in case the new optimization should be dynamically loaded.

2. Extend the LLVM-IR files to support heterogeneous modules

This would mean we extend LLVM-IR, such that IR for different targets
can be stored within a single IR file. This approach could be integrated nicely into the LLVM code generation flow and would yield readable LLVM-IR even for the device code. However, it adds another level of complexity to the LLVM-IR files and does not only require massive changes in the LLVM code base, but also in compilers built on top of LLVM-IR.

3. Generate two independent LLVM-IR files and pass them around together

The host and device LLVM-IR modules could be kept in separate files. This has the benefit of being user readable and not adding additional complexity to the LLVM-IR files itself. However, separate files do not provide information about how those files are related. Which files are kernel files, how.where do they need to be loaded, ...? Also this information could probably be put into meta-data or could be hard coded
into the generic compiler infrastructure, but this would require significant additional code.
Another weakness of this approach is that the entire LLVM optimization chain is currently built under the assumption that a single file/module passed around. This is most obvious with the 'opt | llc' idiom, but in general every tool that does currently exist would need to be adapted to handle multiple files and would possibly even need semantic knowledge about how to connect/use them together. Just running clang or
draggonegg with -load GPGPUOptimizer.so would not be possible.

All of the previous approaches require significant changes all over the code base and would cause trouble with loadable optimization passes. The intrinsic based approach seems to address most of the previous problems.

The intrinsic based approach requires little changes restricted to LLVM itself. It especially works without changes to the established LLVM optimization chain. 'opt | llc' will work out of the box, but, more importantly, any LLVM based compiler can directly load a GPGPUOptimzer.so file to gain a GPU based accelerator. Besides the need to load some runtime library, no additional knowledge needs to be embedded in individual compiler implementations, but all the logic of GPGPU code generation can remain within a single LLVM optimization pass. Another nice feature of the intrinsic is that the relation between host and device code is explicitly encoded in the LLVM-IR (with the llvm.codegen function calls). There is no need to put this information into individual tools and/or to carry it through meta-data. Instead the precise semantics are directly available through LLVM-IR.

Justin: With your proposed two-file approach? What changes would be needed to add e.g. GPGPU code generation support to clang/dragonegg or
haskell+LLVM? Can you see a way, this can be done without large changes
to each of these users?

        The usage of t his intrinsic is not limited to code generation
        for heterogeneous platform. It can also help lots of (run-time)
        optimization and security problems even when the code generation
        target is same as the one of the parent module.

    How does this help run-time optimization?

We implement this intrinsic by learning the implementation style of
llvm's garbage collector related intrinsics which support various GC
strategies. It can help if the ASMGenerator in the patch is revised to
be able to accept various optimization strategies provided by the user
of this intrinsic. Then the intrinsic will do what the user wants to the
input code string. When running the code with lli like jit tools, we can
choose one optimization strategy at run-time. Though haven't supported
this currently, we try to make the design as general as we can. The
essential functionality of this intrinsic is that we get an input code
string, transform it into a target-specific new one then replace the
call to the intrinsic.

There may be uses like this, but I am not sure if the llvm.codegen() intrinsic is the best way to implement this. Even though we made it generic and it can possibly be used in other ways, I suggest to currently focus on the use for heterogeneous computing. This is where it is needed today and where we can easily check if it does what we need.

        Each call to the intrinsic has two arguments. One is the LLVM IR
        string. The other is the name of the target architecture. When
        running with tools like llc, lli, etc, this intrinsic transforms
        the input LLVM IR string to a new string of assembly code for
        the target architecture firstly. Then the call to the intrinsic
        is replaced by a pointer to the newly generated string. After
        this, we have in our module

    Is the Arch parameter to llvm.codegen really needed? Since codegen
    happens when lowering the intrinsic, the target architecture must be
    known. But if the target architecture is known, then it should be
    available in the triple for the embedded module.

Yes. It is better that the target data is set correctly in the embedded
module. It is the user's responsibility to do this.

OK. Why don't we require the triple to be set and remove the arch parameter again?

Tobi

Hi Tobi,

2012/4/28 Tobias Grosser <tobias@grosser.es>

Each call to the intrinsic has two arguments. One is the LLVM IR
string. The other is the name of the target architecture. When
running with tools like llc, lli, etc, this intrinsic transforms
the input LLVM IR string to a new string of assembly code for
the target architecture firstly. Then the call to the intrinsic
is replaced by a pointer to the newly generated string. After
this, we have in our module

Is the Arch parameter to llvm.codegen really needed? Since codegen
happens when lowering the intrinsic, the target architecture must be
known. But if the target architecture is known, then it should be
available in the triple for the embedded module.

Yes. It is better that the target data is set correctly in the embedded
module. It is the user’s responsibility to do this.

OK. Why don’t we require the triple to be set and remove the arch parameter again?

I am afraid I didn’t make it clear in the previous email. And I am sorry that I didn’t get it when you pointed out this before.

There are two approaches we deal with the triple of the embedded module.

  1. The embedded LLVM IR string contains a relatively complete module, in which the target triple is properly set. It means when a user of the intrinsic generates the embedded LLVM IR string, he need add not only the function definitions but also the target triple information. When the intrinsic extract the string into a module, we check whether the triple is empty. If it is, we return immediately or report errors. In this case, we needn’t the arch parameter.

  2. There is no triple information in the embedded LLVM IR string. We get it from the arch parameter.

With the 1st approach, we avoid some codes about getting the arch string from arch llvm::Value and generate the triple from the arch string. It leads less changes to llvm than the 2nd approach. So maybe it is better. We should add some words to the document that tell the user to set the target triple info properly in the embedded LLVM IR string.

best regards,
Yabin

Hi Tobi,

2012/4/28 Tobias Grosser <tobias@grosser.es <mailto:tobias@grosser.es>>

                Each call to the intrinsic has two arguments. One is the
        LLVM IR
                string. The other is the name of the target
        architecture. When
                running with tools like llc, lli, etc, this intrinsic
        transforms
                the input LLVM IR string to a new string of assembly
        code for
                the target architecture firstly. Then the call to the
        intrinsic
                is replaced by a pointer to the newly generated string.
        After
                this, we have in our module

            Is the Arch parameter to llvm.codegen really needed? Since
        codegen
            happens when lowering the intrinsic, the target architecture
        must be
            known. But if the target architecture is known, then it
        should be
            available in the triple for the embedded module.

        Yes. It is better that the target data is set correctly in the
        embedded
        module. It is the user's responsibility to do this.

    OK. Why don't we require the triple to be set and remove the arch
    parameter again?

I am afraid I didn't make it clear in the previous email. And I am
sorry that I didn't get it when you pointed out this before.

:wink: I did not point it out like this. But Justin's argument seems correct. Requiring a triple removes the need for the 'arch' flag. And,
as requiring a triple makes sense, can and should probably take the chance to simplify the interface.

There are two approaches we deal with the triple of the embedded module.
1. The embedded LLVM IR string contains a relatively *complete* module,
in which the target triple is properly set. It means when a user of the
intrinsic generates the embedded LLVM IR string, he need add not only
the function definitions but also the target triple information. When
the intrinsic extract the string into a module, we check whether the
triple is empty. If it is, we return immediately or report errors. In
this case, we needn't the arch parameter.

2. There is no triple information in the embedded LLVM IR string. We get
it from the arch parameter.

With the 1st approach, we avoid some codes about getting the arch string
from arch llvm::Value and generate the triple from the arch string. It
leads less changes to llvm than the 2nd approach. So maybe it is better.
We should add some words to the document that tell the user to set the
target triple info properly in the embedded LLVM IR string.

Yes. I think the 1st approach is a good one.

Tobi

Hi Justin,

Thanks very much for your comments.

2012/4/28 Justin Holewinski <justin.holewinski@gmail.com
<mailto:justin.holewinski@**gmail.com <justin.holewinski@gmail.com>>>

       The attached patch adds a new Intrinsic named "llvm.codegen" to
       support embedded LLVM IR code generation. The 'llvm.codegen'
       intrinsic uses the LLVM back ends to generate code for embedded
       LLVM IR strings. The code generation target can be same or
       different to the one of the parent module.

       The original motivation inspiring us to add this intrinsic, is
       to generate code for heterogeneous platform. A test case in the
       patch demos this. In the test case, on a X86 host, we use this
       intrinsic to transform an embedded LLVM IR into a string of PTX
       assembly. We can then employ a PTX execution engine ( on CUDA
       Supported GPU ) to execute the newly generated assembly and copy

       back the result later.

   I have to admit, I'm not sold on this solution. First, there is no
   clear way to pass codegen flags to the back-end. In PTX parlance,
   how would I embed an .ll file and compile to compute_13?

We can handle this by provide a new argument (e.g. a string of
properly-configured Target Machine) instead of or in addition to the
Arch type string argument.

I think we may in general discuss the additional information needed for
the back ends and provide the information as parameters. We may want to do
this on demand, in case we agreed on the general usefulness of this
intrinsic.

Any solution would need to be able to handle Feature flags (e.g.
-mattr=+sm_20), as well as generic llc options (e.g. -regalloc=greedy).
What happens when the options conflict with the original options passed to
llc? The CodeGenIntrinsic pass would need to emulate all (most?) of llc,
but in a way that doesn't interfere with llc's global state.
Unfortunately, parameters like "regalloc=" are globals. To do this
without massive LLVM changes, you may need to spawn another instance of llc
as a separate process.

    Second, this adds a layer of obfuscation to the system. If I look

   at an .ll file, I expect to see all of the assembly in a reasonably
   clean syntax. If the device code is squashed into a constant array,
   it is much harder to read.

I agree with Justin. The embedded code is not readable within the constant
array. For debugging purposes having the embedded module in separate files
is better. I believe we can achieve this easily by adding a pass that
extracts the embedded LLVM-IR code into separate files.

    Is the motivation for the intrinsic simply to preserve the ability

   to pipe LLVM commands together on the command-line, e.g. opt | llc?
     I really feel that the cleaner solution is to split the IR into
   separate files, each of which can be processed independently after
   initial generation.

Yes, it is. To preserve such an ability is the main benefit we got from
this intrinsic. It means we needn't to implement another compiler driver
or jit tool for our specific purpose. I agree with you that embedded
llvm ir harms the readability of the .ll file.

I would like to add that embedding the device IR into the host IR fits
very well in the LLVM code generation chain. It obviously makes running
'opt | llc' possible, but it also enables us to write optimizations that
yield embedded GPU code.

To write optimizations that yield embedded GPU code, we also looked into
three other approaches:

1. Directly create embedded target code (e.g. PTX)

This would mean the optimization pass extracts device code internally and
directly generate the relevant target code. This approach would require our
generic optimization pass to be directly linked with the specific target
back end. This is an ugly layering violation and, in addition, it causes
major troubles in case the new optimization should be dynamically loaded.

I agree that this isn't desirable. The optimizer should never have to
generate device code.

2. Extend the LLVM-IR files to support heterogeneous modules

This would mean we extend LLVM-IR, such that IR for different targets
can be stored within a single IR file. This approach could be integrated
nicely into the LLVM code generation flow and would yield readable LLVM-IR
even for the device code. However, it adds another level of complexity to
the LLVM-IR files and does not only require massive changes in the LLVM
code base, but also in compilers built on top of LLVM-IR.

3. Generate two independent LLVM-IR files and pass them around together

The host and device LLVM-IR modules could be kept in separate files. This
has the benefit of being user readable and not adding additional complexity
to the LLVM-IR files itself. However, separate files do not provide
information about how those files are related. Which files are kernel
files, how.where do they need to be loaded, ...? Also this information
could probably be put into meta-data or could be hard coded
into the generic compiler infrastructure, but this would require
significant additional code.
Another weakness of this approach is that the entire LLVM optimization
chain is currently built under the assumption that a single file/module
passed around. This is most obvious with the 'opt | llc' idiom, but in
general every tool that does currently exist would need to be adapted to
handle multiple files and would possibly even need semantic knowledge about
how to connect/use them together. Just running clang or
draggonegg with -load GPGPUOptimizer.so would not be possible.

All of the previous approaches require significant changes all over the
code base and would cause trouble with loadable optimization passes. The
intrinsic based approach seems to address most of the previous problems.

The intrinsic based approach requires little changes restricted to LLVM
itself. It especially works without changes to the established LLVM
optimization chain. 'opt | llc' will work out of the box, but, more
importantly, any LLVM based compiler can directly load a GPGPUOptimzer.so
file to gain a GPU based accelerator. Besides the need to load some runtime
library, no additional knowledge needs to be embedded in individual
compiler implementations, but all the logic of GPGPU code generation can
remain within a single LLVM optimization pass. Another nice feature of the
intrinsic is that the relation between host and device code is explicitly
encoded in the LLVM-IR (with the llvm.codegen function calls). There is no
need to put this information into individual tools and/or to carry it
through meta-data. Instead the precise semantics are directly available
through LLVM-IR.

I just worry about the scalability of this approach. Once you embed the
IR, no optimizer can touch it, so this potentially creates problems with
pass scheduling. When you generate the IR, you want it to be fully
optimized before embedding. Or, you could invoke opt+llc when lowering the
llvm.codegen intrinsic.

Justin: With your proposed two-file approach? What changes would be needed
to add e.g. GPGPU code generation support to clang/dragonegg or
haskell+LLVM? Can you see a way, this can be done without large changes
to each of these users?

To be fair, I'm not necessarily advocating the two-file approach. It has
its shortcomings, too. But this is in some sense the crux of the problem.
The intrinsic approach is clearly the path of least resistance, especially
in the case of the GSoC project. However, I think a more long-term
solution involves looking at this problem from the IR level. The current
LLVM approach is "one arch in, one arch out". As far as I know, even ARM
needs separate modules for ARM vs. Thumb (please correct me if I'm
mistaken). Whether the tools are extended to support multiple outputs with
some linking information or the IR is extended to support something like
per-function target triples, that is a decision that would need to be
addressed by the entire LLVM community.

        We can handle this by provide a new argument (e.g. a string of
        properly-configured Target Machine) instead of or in addition to the
        Arch type string argument.

    I think we may in general discuss the additional information needed
    for the back ends and provide the information as parameters. We may
    want to do this on demand, in case we agreed on the general
    usefulness of this intrinsic.

Any solution would need to be able to handle Feature flags (e.g.
-mattr=+sm_20), as well as generic llc options (e.g. -regalloc=greedy).
  What happens when the options conflict with the original options
passed to llc? The CodeGenIntrinsic pass would need to emulate all
(most?) of llc, but in a way that doesn't interfere with llc's global
state. Unfortunately, parameters like "regalloc=" are globals. To do
this without massive LLVM changes, you may need to spawn another
instance of llc as a separate process.

I think feature flags should not be a problem. The function createTargetMachine() takes a feature string. We can get this string as a parameter of the intrinsic and use it to parametrize the target machine. If needed, we can also add parameters to define the relocation model, mcpu, the code model, the optimization level or the target options. All those parameters are not influenced by the command line options of the llc invocation and will, for now, be set to default values for the embedded code generation.

We should probably add the most important options now and add others on demand. Which are the options you suggest to be added initially? I suppose we need 1) the feature string and 2) mcpu. Is there anything else you would suggest?

regalloc= is different. It is global and consequently influences both host and device code generation. However, to me it is rather a debugging option. It is never set by clang and targets provide a reasonable default based on the optimization level. I believe we can
assume that for our use case it is not set. In case it is really necessary to explicitly set the register allocator, the right solution would be to make regalloc a target option.

    The intrinsic based approach requires little changes restricted to
    LLVM itself. It especially works without changes to the established
    LLVM optimization chain. 'opt | llc' will work out of the box, but,
    more importantly, any LLVM based compiler can directly load a
    GPGPUOptimzer.so file to gain a GPU based accelerator. Besides the
    need to load some runtime library, no additional knowledge needs to
    be embedded in individual compiler implementations, but all the
    logic of GPGPU code generation can remain within a single LLVM
    optimization pass. Another nice feature of the intrinsic is that the
    relation between host and device code is explicitly encoded in the
    LLVM-IR (with the llvm.codegen function calls). There is no need to
    put this information into individual tools and/or to carry it
    through meta-data. Instead the precise semantics are directly
    available through LLVM-IR.

I just worry about the scalability of this approach. Once you embed the
IR, no optimizer can touch it, so this potentially creates problems with
pass scheduling. When you generate the IR, you want it to be fully
optimized before embedding. Or, you could invoke opt+llc when lowering
the llvm.codegen intrinsic.

Where do you see scalability problems?

I agree that the llvm.codegen intrinsic is limited to plain code generation. Meaning it is an embedded llc. I do not expect any part of LLVM to be extended to reason about optimizing the embedded IR. The optimization that created this intrinsic is in charge of optimizing the embedded IR as needed. However, this is not a big problem. A generic LLVM-IR optimization pass can schedule the required optimizations as needed.

    Justin: With your proposed two-file approach? What changes would be
    needed to add e.g. GPGPU code generation support to clang/dragonegg or
    haskell+LLVM? Can you see a way, this can be done without large changes
    to each of these users?

To be fair, I'm not necessarily advocating the two-file approach. It
has its shortcomings, too. But this is in some sense the crux of the
problem. The intrinsic approach is clearly the path of least
resistance, especially in the case of the GSoC project. However, I
think a more long-term solution involves looking at this problem from
the IR level. The current LLVM approach is "one arch in, one arch out".
  As far as I know, even ARM needs separate modules for ARM vs. Thumb
(please correct me if I'm mistaken). Whether the tools are extended to
support multiple outputs with some linking information or the IR is
extended to support something like per-function target triples, that is
a decision that would need to be addressed by the entire LLVM community.

I agree that future work can be useful here. However, before spending a large amount of time to engineer a complex solution, I propose to start with the proposed light-weight approach. It is sufficient for our needs and will allow us to get the experience and infrastructure that can help us to choose and implement a more complex later on.

Tobi

       We can handle this by provide a new argument (e.g. a string of
       properly-configured Target Machine) instead of or in addition to
the
       Arch type string argument.

   I think we may in general discuss the additional information needed
   for the back ends and provide the information as parameters. We may
   want to do this on demand, in case we agreed on the general
   usefulness of this intrinsic.

Any solution would need to be able to handle Feature flags (e.g.
-mattr=+sm_20), as well as generic llc options (e.g. -regalloc=greedy).
What happens when the options conflict with the original options
passed to llc? The CodeGenIntrinsic pass would need to emulate all
(most?) of llc, but in a way that doesn't interfere with llc's global
state. Unfortunately, parameters like "regalloc=" are globals. To do
this without massive LLVM changes, you may need to spawn another
instance of llc as a separate process.

I think feature flags should not be a problem. The function
createTargetMachine() takes a feature string. We can get this string as a
parameter of the intrinsic and use it to parametrize the target machine. If
needed, we can also add parameters to define the relocation model, mcpu,
the code model, the optimization level or the target options. All those
parameters are not influenced by the command line options of the llc
invocation and will, for now, be set to default values for the embedded
code generation.

We should probably add the most important options now and add others on
demand. Which are the options you suggest to be added initially? I suppose
we need 1) the feature string and 2) mcpu. Is there anything else you would
suggest?

regalloc= is different. It is global and consequently influences both host
and device code generation. However, to me it is rather a debugging option.
It is never set by clang and targets provide a reasonable default based on
the optimization level. I believe we can
assume that for our use case it is not set. In case it is really necessary
to explicitly set the register allocator, the right solution would be to
make regalloc a target option.

The regalloc= option was just an example of the types of flags that can be
passed to llc, which are handled as global options instead of target
options.

    The intrinsic based approach requires little changes restricted to

   LLVM itself. It especially works without changes to the established
   LLVM optimization chain. 'opt | llc' will work out of the box, but,
   more importantly, any LLVM based compiler can directly load a
   GPGPUOptimzer.so file to gain a GPU based accelerator. Besides the
   need to load some runtime library, no additional knowledge needs to
   be embedded in individual compiler implementations, but all the
   logic of GPGPU code generation can remain within a single LLVM
   optimization pass. Another nice feature of the intrinsic is that the
   relation between host and device code is explicitly encoded in the
   LLVM-IR (with the llvm.codegen function calls). There is no need to
   put this information into individual tools and/or to carry it
   through meta-data. Instead the precise semantics are directly
   available through LLVM-IR.

I just worry about the scalability of this approach. Once you embed the
IR, no optimizer can touch it, so this potentially creates problems with
pass scheduling. When you generate the IR, you want it to be fully
optimized before embedding. Or, you could invoke opt+llc when lowering
the llvm.codegen intrinsic.

Where do you see scalability problems?

I agree that the llvm.codegen intrinsic is limited to plain code
generation. Meaning it is an embedded llc. I do not expect any part of LLVM
to be extended to reason about optimizing the embedded IR. The optimization
that created this intrinsic is in charge of optimizing the embedded IR as
needed. However, this is not a big problem. A generic LLVM-IR optimization
pass can schedule the required optimizations as needed.

The implicit assumption seems to be that the host code wants the device
code as assembly text. What happens when you need to link the device
binary and upload it separately? Think automatic SPU codegen on Cell. Is
it up to the host program to invoke the other target's linker?

    Justin: With your proposed two-file approach? What changes would be

   needed to add e.g. GPGPU code generation support to clang/dragonegg or
   haskell+LLVM? Can you see a way, this can be done without large changes
   to each of these users?

To be fair, I'm not necessarily advocating the two-file approach. It
has its shortcomings, too. But this is in some sense the crux of the
problem. The intrinsic approach is clearly the path of least
resistance, especially in the case of the GSoC project. However, I
think a more long-term solution involves looking at this problem from
the IR level. The current LLVM approach is "one arch in, one arch out".
As far as I know, even ARM needs separate modules for ARM vs. Thumb
(please correct me if I'm mistaken). Whether the tools are extended to
support multiple outputs with some linking information or the IR is
extended to support something like per-function target triples, that is
a decision that would need to be addressed by the entire LLVM community.

I agree that future work can be useful here. However, before spending a
large amount of time to engineer a complex solution, I propose to start
with the proposed light-weight approach. It is sufficient for our needs and
will allow us to get the experience and infrastructure that can help us to
choose and implement a more complex later on.

I agree that this approach is the best way to get short-term results,
especially for the GSoC project.

    regalloc= is different. It is global and consequently influences
    both host and device code generation. However, to me it is rather a
    debugging option. It is never set by clang and targets provide a
    reasonable default based on the optimization level. I believe we can
    assume that for our use case it is not set. In case it is really
    necessary to explicitly set the register allocator, the right
    solution would be to make regalloc a target option.

The regalloc= option was just an example of the types of flags that can
be passed to llc, which are handled as global options instead of target
options.

Yes, thanks for pointing us to this problem. For now I think we can ignore them as they are mostly debugging options and they can be included in the target options if needed.

The implicit assumption seems to be that the host code wants the device
code as assembly text. What happens when you need to link the device
binary and upload it separately? Think automatic SPU codegen on Cell.
  Is it up to the host program to invoke the other target's linker?

OK, I get what you mean. The intrinsic is currently targeted at the OpenCL/CUDA model. It is the most widely used. Stuff like cell sounds interesting, but probably needs further thoughts. Even with OpenCL/CUDA,
this intrinsic works currently only for PTX code generation, but I hope we can gain support for other GPU devices later on.

    I agree that future work can be useful here. However, before
    spending a large amount of time to engineer a complex solution, I
    propose to start with the proposed light-weight approach. It is
    sufficient for our needs and will allow us to get the experience and
    infrastructure that can help us to choose and implement a more
    complex later on.

I agree that this approach is the best way to get short-term results,
especially for the GSoC project.

OK, let's go ahead.

Yabin, can you update the patch with the following changes:

- Remove the Arch flag
- Document that we require a triple
- Add two new arguments that take a feature string and a mcpu
   flag (can be set to "", which means we use the default)

Cheers
Tobi

Hi ,
ÔÚ 2012-4-29£¬ÏÂÎç9:37£¬ Tobias Grosser дµÀ£º

OK, I get what you mean. The intrinsic is currently targeted at the
OpenCL/CUDA model. It is the most widely used. Stuff like cell sounds
interesting, but probably needs further thoughts. Even with OpenCL/CUDA,
this intrinsic works currently only for PTX code generation, but I hope
we can gain support for other GPU devices later on.

   I agree that future work can be useful here. However, before
   spending a large amount of time to engineer a complex solution, I
   propose to start with the proposed light-weight approach. It is
   sufficient for our needs and will allow us to get the experience and
   infrastructure that can help us to choose and implement a more
   complex later on.

I agree that this approach is the best way to get short-term results,
especially for the GSoC project.

OK, let's go ahead.

Yabin, can you update the patch with the following changes:

- Remove the Arch flag
- Document that we require a triple
- Add two new arguments that take a feature string and a mcpu
  flag (can be set to "", which means we use the default)

Wait. I don't think there is enough justification for this to move forward. Apart from the technical issues that have already been raised. I can also see this introduces a safety issue since the embedded IR code is not checked / verified at compile time. Unless Chris says otherwise, I don't see this patch being accepted on trunk.

Evan

OK, I get what you mean. The intrinsic is currently targeted at the
OpenCL/CUDA model. It is the most widely used. Stuff like cell sounds
interesting, but probably needs further thoughts. Even with OpenCL/CUDA,
this intrinsic works currently only for PTX code generation, but I hope
we can gain support for other GPU devices later on.

    I agree that future work can be useful here. However, before
    spending a large amount of time to engineer a complex solution, I
    propose to start with the proposed light-weight approach. It is
    sufficient for our needs and will allow us to get the experience and
    infrastructure that can help us to choose and implement a more
    complex later on.

I agree that this approach is the best way to get short-term results,
especially for the GSoC project.

OK, let's go ahead.

Yabin, can you update the patch with the following changes:

- Remove the Arch flag
- Document that we require a triple
- Add two new arguments that take a feature string and a mcpu
   flag (can be set to "", which means we use the default)

Wait. I don't think there is enough justification for this to move forward. Apart from the technical issues that have already been raised. I can also see this introduces a safety issue since the embedded IR code is not checked / verified at compile time. Unless Chris says otherwise, I don't see this patch being accepted on trunk.

Hi Even,

sure, this patch needs further discussion and an OK of some kind of global maintainer. Neither me or Justin can give permission for a change at this place. I mainly asked Yabin to provide a version that addresses concerns raised by Justin, to give further reviewers an up to date version to look at.

With your comment, you actually pointed out a bug.

Instead of:
Target->addPassesToEmitFile(PM, FOS,
                             TargetMachine::CGFT_AssemblyFile,
                             CodeGenOpt::Default)

We should use:
bool UseVerifier = true;
Target->addPassesToEmitFile(PM, FOS,
                             TargetMachine::CGFT_AssemblyFile,
                             UseVerifier)

Though, I don't think that is the problem you where talking about. Could you explain what security issues you exactly see? The embedded LLVM-IR is checked by the IR verifier the same as the host IR is. For both LLVM-IR modules the target code is generated and consequently verified at the same time. The embedded IR is _not_ compiled later than the host IR. What did I miss?

Cheers

Hi all,

I revised the llvm.codegen intrinsic according to all your comments.
Further discussion and review are welcome.
Main changes are list here.

  1. remove the arg parameter.
  2. add mcpu and features parameters.
  3. Document that users of the intinsic should set the target triple in the llvm IR string properly.
  4. fix the bug of calling addPassesToEmitFile, using true for the verifier flag.

Thanks a lot!

best regards,
Yabin

2012/4/30 Tobias Grosser <tobias@grosser.es>

0001-Add-llvm.codegen-intrinsic.patch (19.9 KB)

Tobias Grosser <tobias@grosser.es> writes:

To write optimizations that yield embedded GPU code, we also looked into
three other approaches:

1. Directly create embedded target code (e.g. PTX)

This would mean the optimization pass extracts device code internally
and directly generate the relevant target code. This approach would
require our generic optimization pass to be directly linked with the
specific target back end. This is an ugly layering violation and, in
addition, it causes major troubles in case the new optimization should
be dynamically loaded.

IMHO it's a bit unrealistic to have a target-independent optimization
layer. Almost all optimization wants to know target details at some
point. I think we can and probably should support that. We can allow
passes to gracefully fall back in the cases where target information is
not available.

2. Extend the LLVM-IR files to support heterogeneous modules

This would mean we extend LLVM-IR, such that IR for different targets
can be stored within a single IR file. This approach could be integrated
nicely into the LLVM code generation flow and would yield readable
LLVM-IR even for the device code. However, it adds another level of
complexity to the LLVM-IR files and does not only require massive
changes in the LLVM code base, but also in compilers built on top of
LLVM-IR.

I don't think the code base changes are all that bad. We have a number
of them to support generating code one function at a time rather than a
whole module together. They've been sitting around waiting for us to
send them upstream. It would be an easy matter to simply annotate each
function with its target. We don't currently do that because we never
write out such IR files but it seems like a simple problem to solve to
me.

3. Generate two independent LLVM-IR files and pass them around together

The host and device LLVM-IR modules could be kept in separate files.
This has the benefit of being user readable and not adding additional
complexity to the LLVM-IR files itself. However, separate files do not
provide information about how those files are related. Which files are
kernel files, how.where do they need to be loaded, ...? Also this
information could probably be put into meta-data or could be hard coded
into the generic compiler infrastructure, but this would require
significant additional code.

I don't think metadata would work because it would not satisfy the "no
semantic effects" requirement. We couldn't just drop the metadata and
expect things to work.

Another weakness of this approach is that the entire LLVM optimization
chain is currently built under the assumption that a single file/module
passed around. This is most obvious with the 'opt | llc' idiom, but in
general every tool that does currently exist would need to be adapted to
handle multiple files and would possibly even need semantic knowledge
about how to connect/use them together. Just running clang or
draggonegg with -load GPGPUOptimizer.so would not be possible.

Again, we have many of the changes to make this possible. I hope to
send them for review as we upgrade to 3.1.

All of the previous approaches require significant changes all over the
code base and would cause trouble with loadable optimization passes. The
intrinsic based approach seems to address most of the previous problems.

I'm pretty uncomfortable with the proposed intrinsic. It feels
tacked-on and not in the LLVM spirit. We should be able to extend the
IR to support multiple targets. We're going to need this kind of
support for much more than GPUs in thefuture. Heterogenous computing is
here to stay.

                             -Dave

<dag@cray.com> writes:

Tobias Grosser <tobias@grosser.es> writes:

To write optimizations that yield embedded GPU code, we also looked into
three other approaches:

1. Directly create embedded target code (e.g. PTX)

This would mean the optimization pass extracts device code internally
and directly generate the relevant target code. This approach would
require our generic optimization pass to be directly linked with the
specific target back end. This is an ugly layering violation and, in
addition, it causes major troubles in case the new optimization should
be dynamically loaded.

IMHO it's a bit unrealistic to have a target-independent optimization
layer. Almost all optimization wants to know target details at some
point. I think we can and probably should support that. We can allow
passes to gracefully fall back in the cases where target information is
not available.

I think I misread your intent here. It is indeed a very bad layering
violation to have opt generate code. In the response above I am talking
about making target characteristics available to opt passes if it is
available. I think the latter is important to get good performance.

                                  -Dave

Tobias Grosser <tobias@grosser.es> writes:

> To write optimizations that yield embedded GPU code, we also looked into
> three other approaches:
>
> 1. Directly create embedded target code (e.g. PTX)
>
> This would mean the optimization pass extracts device code internally
> and directly generate the relevant target code. This approach would
> require our generic optimization pass to be directly linked with the
> specific target back end. This is an ugly layering violation and, in
> addition, it causes major troubles in case the new optimization should
> be dynamically loaded.

IMHO it's a bit unrealistic to have a target-independent optimization
layer. Almost all optimization wants to know target details at some
point. I think we can and probably should support that. We can allow
passes to gracefully fall back in the cases where target information is
not available.

> 2. Extend the LLVM-IR files to support heterogeneous modules
>
> This would mean we extend LLVM-IR, such that IR for different targets
> can be stored within a single IR file. This approach could be integrated
> nicely into the LLVM code generation flow and would yield readable
> LLVM-IR even for the device code. However, it adds another level of
> complexity to the LLVM-IR files and does not only require massive
> changes in the LLVM code base, but also in compilers built on top of
> LLVM-IR.

I don't think the code base changes are all that bad. We have a number
of them to support generating code one function at a time rather than a
whole module together. They've been sitting around waiting for us to
send them upstream. It would be an easy matter to simply annotate each
function with its target. We don't currently do that because we never
write out such IR files but it seems like a simple problem to solve to
me.

If such changes are almost ready to be up-streamed, then great! It just
seems like a fairly non-trivial task to actually implement function-level
target selection, especially when you consider function call semantics,
taking the address of a function, etc. If you have a global variable, what
target "sees" it? Does it need to be annotated along with the function?
Can functions from two different targets share this pointer? At first
glance, there seems to be many non-trivial issues that are heavily
dependent on the nature of the target. For Yabin's use-case, the X86
portions need to be compiled to assembly, or even an object file, while the
PTX portions need to be lowered to an assembly string and embedded in the
X86 source (or written to disk somewhere). If you're targeting Cell, in
contrast, you'd want to compile both down to object files.

Don't get me wrong, I think this is something we need to do and the
llvm.codegen intrinsic is a band-aid solution, but I don't see this as a
simple problem.

> 3. Generate two independent LLVM-IR files and pass them around together
>
> The host and device LLVM-IR modules could be kept in separate files.
> This has the benefit of being user readable and not adding additional
> complexity to the LLVM-IR files itself. However, separate files do not
> provide information about how those files are related. Which files are
> kernel files, how.where do they need to be loaded, ...? Also this
> information could probably be put into meta-data or could be hard coded
> into the generic compiler infrastructure, but this would require
> significant additional code.

I don't think metadata would work because it would not satisfy the "no
semantic effects" requirement. We couldn't just drop the metadata and
expect things to work.

> Another weakness of this approach is that the entire LLVM optimization
> chain is currently built under the assumption that a single file/module
> passed around. This is most obvious with the 'opt | llc' idiom, but in
> general every tool that does currently exist would need to be adapted to
> handle multiple files and would possibly even need semantic knowledge
> about how to connect/use them together. Just running clang or
> draggonegg with -load GPGPUOptimizer.so would not be possible.

Again, we have many of the changes to make this possible. I hope to
send them for review as we upgrade to 3.1.

> All of the previous approaches require significant changes all over the
> code base and would cause trouble with loadable optimization passes. The
> intrinsic based approach seems to address most of the previous problems.

I'm pretty uncomfortable with the proposed intrinsic. It feels
tacked-on and not in the LLVM spirit. We should be able to extend the
IR to support multiple targets. We're going to need this kind of
support for much more than GPUs in thefuture. Heterogenous computing is
here to stay.

For me, the bigger question is: do we extend the IR to support multiple
targets, or do we keep the one-target-per-module philosophy and derive some
other way of representing how the modules fit together? I can see pros and
cons for both approaches.

What if instead of per-function annotations, we implement something like
module file sections? You could organize a module file into logical
sections based on target architecture. I'm just throwing that out there.

Justin Holewinski <justin.holewinski@gmail.com> writes:

    I don't think the code base changes are all that bad. We have a number
    of them to support generating code one function at a time rather than a
    whole module together. They've been sitting around waiting for us to
    send them upstream. It would be an easy matter to simply annotate each
    function with its target. We don't currently do that because we never
    write out such IR files but it seems like a simple problem to solve to
    me.

If such changes are almost ready to be up-streamed, then great!

Just to clariofy, the current changes simply allow a function to be
completely processed (including asm generation) before the next function
is sent to codegen.

It just seems like a fairly non-trivial task to actually implement
function-level target selection, especially when you consider function
call semantics, taking the address of a function, etc.

For something like PTX, runtime calls take care of the call semantics so
it is either up to the user or the frontend to set up the runtime calls
correctly. We don't need to completely solve this problem. Yet. :slight_smile:

If you have a global variable, what target "sees" it? Does it need to
be annotated along with the function?

For a tool like llc, wouldn't it be simply a matter of changing
TheTarget and reconstituting the various passes? The changes we have
waiting to upstream already allow us to reconstitute passes. I
sometimes use this to turn on/off debugging on a function-level basis.

The way we've constructed our backend interface should just allow us to
switch the target and reinitialize everything. I'm sure I'm glossing
over tons of details but I don't see a fundamental architectural problem
in LLVM that would prevent this.

Can functions from two different targets share this pointer?

Again, in the case of PTX it's the runtime's responsibility to ensure
this. I agree passing pointers around complicates things in the general
case but I also think it's a solvable problem.

For Yabin's use-case, the X86 portions need to be compiled to
assembly, or even an object file, while the PTX portions need to be
lowered to an assembly string and embedded in the X86 source (or
written to disk somewhere).

I think it's just a matter of switching to a different AsmWriter. The
PTX runtime can load objects from files. The code doesn't have to be a
string in the x86 object file.

If you're targeting Cell, in contrast, you'd want to compile both down
to object files.

I think we probably want to do that for PTX as well.

For me, the bigger question is: do we extend the IR to support
multiple targets, or do we keep the one-target-per-module philosophy
and derive some other way of representing how the modules fit
together? I can see pros and cons for both approaches.

Me too.

What if instead of per-function annotations, we implement something
like module file sections? You could organize a module file into
logical sections based on target architecture. I'm just throwing that
out there.

Do we allow more than one Module per file? If not, that seems like an
arbitrary limitation. If we allowed that we could have each module
specify a different target.

                                 -Dave

Justin Holewinski <justin.holewinski@gmail.com> writes:

I don’t think the code base changes are all that bad. We have a number
of them to support generating code one function at a time rather than a
whole module together. They’ve been sitting around waiting for us to
send them upstream. It would be an easy matter to simply annotate each
function with its target. We don’t currently do that because we never
write out such IR files but it seems like a simple problem to solve to
me.

If such changes are almost ready to be up-streamed, then great!

Just to clariofy, the current changes simply allow a function to be
completely processed (including asm generation) before the next function
is sent to codegen.

It just seems like a fairly non-trivial task to actually implement
function-level target selection, especially when you consider function
call semantics, taking the address of a function, etc.

For something like PTX, runtime calls take care of the call semantics so
it is either up to the user or the frontend to set up the runtime calls
correctly. We don’t need to completely solve this problem. Yet. :slight_smile:

But there has to be some interface that allows an LLVM IR function from one architecture to get at the code or name of a function from another architecture. This could be handled in the front-end, but it seems like we could design some abstraction.

If you have a global variable, what target “sees” it? Does it need to
be annotated along with the function?

For a tool like llc, wouldn’t it be simply a matter of changing
TheTarget and reconstituting the various passes? The changes we have
waiting to upstream already allow us to reconstitute passes. I
sometimes use this to turn on/off debugging on a function-level basis.

The way we’ve constructed our backend interface should just allow us to
switch the target and reinitialize everything. I’m sure I’m glossing
over tons of details but I don’t see a fundamental architectural problem
in LLVM that would prevent this.

Sorry, I meant global variables in the LLVM IR. Are they valid for only one architecture in the IR module?

Can functions from two different targets share this pointer?

Again, in the case of PTX it’s the runtime’s responsibility to ensure
this. I agree passing pointers around complicates things in the general
case but I also think it’s a solvable problem.

For Yabin’s use-case, the X86 portions need to be compiled to
assembly, or even an object file, while the PTX portions need to be
lowered to an assembly string and embedded in the X86 source (or
written to disk somewhere).

I think it’s just a matter of switching to a different AsmWriter. The
PTX runtime can load objects from files. The code doesn’t have to be a
string in the x86 object file.

If you’re targeting Cell, in contrast, you’d want to compile both down
to object files.

I think we probably want to do that for PTX as well.

Maybe, maybe not. It may make sense to rely on run-time JIT’ing of the PTX.

For me, the bigger question is: do we extend the IR to support
multiple targets, or do we keep the one-target-per-module philosophy
and derive some other way of representing how the modules fit
together? I can see pros and cons for both approaches.

Me too.

What if instead of per-function annotations, we implement something
like module file sections? You could organize a module file into
logical sections based on target architecture. I’m just throwing that
out there.

Do we allow more than one Module per file? If not, that seems like an
arbitrary limitation. If we allowed that we could have each module
specify a different target.

That could work.

Justin Holewinski <justin.holewinski@gmail.com> writes:

    For something like PTX, runtime calls take care of the call semantics so
    it is either up to the user or the frontend to set up the runtime calls
    correctly. We don't need to completely solve this problem. Yet. :slight_smile:

But there has to be some interface that allows an LLVM IR function
from one architecture to get at the code or name of a function from
another architecture. This could be handled in the front-end, but it
seems like we could design some abstraction.

Doesn't LLVM support taking the address of a function in another address
space? If not it probably should.

    > If you have a global variable, what target "sees" it? Does it need to
    > be annotated along with the function?
   
Sorry, I meant global variables in the LLVM IR. Are they valid for
only one architecture in the IR module?

Ah. It very much depends on the system architecture. Since current PTX
targets run in an entirely separate address space globals would have to
be replicated and copied to/from the device. This might require
target-specific modules.

For a system with shared memory, I would assume the globals could simply
be shared "as usual." Otherwise, it wouldn't be shared memory. In a
target-specific module design, one or the other would be an extern
reference.

    > If you're targeting Cell, in contrast, you'd want to compile both down
    > to object files.
   
    I think we probably want to do that for PTX as well.

Maybe, maybe not. It may make sense to rely on run-time JIT'ing of the PTX.

That happens regardless. There is no way to produce instructions "to
the metal" for NVIDIA targets. I was referring to PTX object files
above.

    Do we allow more than one Module per file? If not, that seems like an
    arbitrary limitation. If we allowed that we could have each module
    specify a different target.

That could work.

Given your questions about globals above, I think it might be a
requirement unless we want to require code for separate targets live in
separate files. I think that's too restrictive because some opt pass
might want to extract kernels and put them on separate targets.

                              -Dave