[PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

Hi Dave,

I highly appreciate your idea of integrating heterogeneous computing features directly into LLVM-IR. I believe this can be a way worth going, but I doubt now is the right moment for it. I don't share your opinion that it is easy to move LLVM-IR in this direction, but I rather believe that this is an engineering project that will take several months of full time work. Possibly not the implementation itself, but designing it, discussing it, implementing it and ensuring that the new feature does not increase run-time and memory footprint or reduce maintainability of LLVM. Due to the large amount of changes that would be needed all over LLVM, I really think we should first get some experience in this area, before we burn this feature into LLVM-IR.

The llvm.codegen intrinsic seems the perfect match to build up such experience. It requires no changes to LLVM-IR itself and only very local changes to the generic back end infrastructure. It may possibly not be as generic as other solutions, but it is far from being an ugly hack. Quite in contrast, it is a close match for OpenCL like run times and works well with the existing PTX back end.

Do you have definitiv plans to add heterogeneous computing capabilities to LLVM-IR within the next couple (3-4) months? Will these capabilities superseed the llvm codegen intrinsic?

In case such plans do not exist, what do you think about adding the llvm.codegen() intrinsic for now? If mid-term plans exist for heterogeneous extensions to LLVM-IR, we can document them along the intrinsic.

Cheers
Tobi

Wait. I don't think there is enough justification for this to move
forward. Apart from the technical issues that have already been
raised. I can also see this introduces a safety issue since the
embedded IR code is not checked / verified at compile time. Unless
Chris says otherwise, I don't see this patch being accepted on trunk.

[...]

Could you explain what security issues you exactly see? The embedded LLVM-IR
is checked by the IR verifier the same as the host IR is. For both
LLVM-IR modules the target code is generated and consequently verified
at the same time. The embedded IR is _not_ compiled later than the host
IR. What did I miss?

Hi Evan,

in your last mail, you pointed out security issues with this intrinsic.
Could you point them out to me?

Tobi

I think we need several modules per file. Supporting AMDIL and PTX at the same time sounds more than useful.

Another question that pops up to me. If we go support several modules, how would the command line options to opt look like? Do we want to make all options sub-module specific? Getting this user friendly may be difficult. The same for the output of llc. At the moment llc can dump the assembly to stdout. Would you dump the assembly of the different modules to stdout or do you want to support multiple -o options to specify the various output files?

The same for the LLVM CodeGen/Target API. It must possibly be changed to support the output of several modules or the specification of different options for each module. We also have the same problems as Justin pointed out for the codegen intrinsic. Some llc options are globals, they would need to be made Codegen options, if we want to set them on a per-module basis.

Cheers
Tobi

Tobias Grosser<tobias@grosser.es> writes:

To write optimizations that yield embedded GPU code, we also looked into
three other approaches:

1. Directly create embedded target code (e.g. PTX)

This would mean the optimization pass extracts device code internally
and directly generate the relevant target code. This approach would
require our generic optimization pass to be directly linked with the
specific target back end. This is an ugly layering violation and, in
addition, it causes major troubles in case the new optimization should
be dynamically loaded.

IMHO it's a bit unrealistic to have a target-independent optimization
layer. Almost all optimization wants to know target details at some
point. I think we can and probably should support that. We can allow
passes to gracefully fall back in the cases where target information is
not available.

Yes, I agree it makes sense to make target-information available to the optimizers. As you noted yourself, this is different to performing target code generation in the optimizers.

2. Extend the LLVM-IR files to support heterogeneous modules

This would mean we extend LLVM-IR, such that IR for different targets
can be stored within a single IR file. This approach could be integrated
nicely into the LLVM code generation flow and would yield readable
LLVM-IR even for the device code. However, it adds another level of
complexity to the LLVM-IR files and does not only require massive
changes in the LLVM code base, but also in compilers built on top of
LLVM-IR.

I don't think the code base changes are all that bad. We have a number
of them to support generating code one function at a time rather than a
whole module together. They've been sitting around waiting for us to
send them upstream. It would be an easy matter to simply annotate each
function with its target. We don't currently do that because we never
write out such IR files but it seems like a simple problem to solve to
me.

Supporting several modules in on LLVM-IR file may not be too difficult,
but getting this in may still be controversial. The large amount of changes that I see are changes to the tools. At the moment all tools expect a single module coming from an LLVM-IR file. I pointed out the problems in llc and the codegen examples in my other mail.

3. Generate two independent LLVM-IR files and pass them around together

The host and device LLVM-IR modules could be kept in separate files.
This has the benefit of being user readable and not adding additional
complexity to the LLVM-IR files itself. However, separate files do not
provide information about how those files are related. Which files are
kernel files, how.where do they need to be loaded, ...? Also this
information could probably be put into meta-data or could be hard coded
into the generic compiler infrastructure, but this would require
significant additional code.

I don't think metadata would work because it would not satisfy the "no
semantic effects" requirement. We couldn't just drop the metadata and
expect things to work.

You are right, this solution requires semantic meta-data which is a non-trivial prerequisite.

Another weakness of this approach is that the entire LLVM optimization
chain is currently built under the assumption that a single file/module
passed around. This is most obvious with the 'opt | llc' idiom, but in
general every tool that does currently exist would need to be adapted to
handle multiple files and would possibly even need semantic knowledge
about how to connect/use them together. Just running clang or
draggonegg with -load GPGPUOptimizer.so would not be possible.

Again, we have many of the changes to make this possible. I hope to
send them for review as we upgrade to 3.1.

Could you provide a list of the changes you have in the pipeline and a reliable timeline on when you will upstream them? How much additional work from other people is required to make this a valuable replacement of the llvm.codegen intrinsic?

All of the previous approaches require significant changes all over the
code base and would cause trouble with loadable optimization passes. The
intrinsic based approach seems to address most of the previous problems.

I'm pretty uncomfortable with the proposed intrinsic. It feels
tacked-on and not in the LLVM spirit. We should be able to extend the
IR to support multiple targets. We're going to need this kind of
support for much more than GPUs in thefuture. Heterogenous computing is
here to stay.

Where exactly do you see problems with this intrinsic? It is not meant to block further work in heterogeneous computing, but to allow us to gradually improve LLVM to gain such features. It especially provides a low overhead solution that adds working heterogeneous compute capabilities for major GPU targets to LLVM. This working solution can prepare the ground for closer integrated solutions.

Tobi

Tobias Grosser <tobias@grosser.es> writes:

Doesn't LLVM support taking the address of a function in another address
space? If not it probably should.

Hi Dave,

I highly appreciate your idea of integrating heterogeneous computing
features directly into LLVM-IR. I believe this can be a way worth
going, but I doubt now is the right moment for it. I don't share your
opinion that it is easy to move LLVM-IR in this direction, but I
rather believe that this is an engineering project that will take
several months of full time work. Possibly not the implementation
itself, but designing it, discussing it, implementing it and ensuring
that the new feature does not increase run-time and memory footprint
or reduce maintainability of LLVM. Due to the large amount of changes
that would be needed all over LLVM, I really think we should first get
some experience in this area, before we burn this feature into
LLVM-IR.

I'm not advocating that we rush into this by any means. I'm well aware
that the discussions and experiments will take quite a while to plow
through. I think a small set of enhancements will go a long way. I'd
like to avoid hacking in special intrinsics like llvm.codegen that feel
so very much in opposition to the rest of the LLVM design.

The llvm.codegen intrinsic seems the perfect match to build up such
experience. It requires no changes to LLVM-IR itself and only very
local changes to the generic back end infrastructure. It may possibly
not be as generic as other solutions, but it is far from being an ugly
hack. Quite in contrast, it is a close match for OpenCL like run times
and works well with the existing PTX back end.

I'll bite my tongue on the designs of OpenCL and CUDA. :slight_smile:

But regardless, if those are your targets you don't need llvm.codegen at
all.

Do you have definitiv plans to add heterogeneous computing
capabilities to LLVM-IR within the next couple (3-4) months? Will
these capabilities superseed the llvm codegen intrinsic?

No specific plans to change the IR. We have not found a need such
changes on current architectures as the runtimes provided with those
architectures handles the ugly details. I am thinking further into the
future and what might be needed there.

In case such plans do not exist, what do you think about adding the
llvm.codegen() intrinsic for now? If mid-term plans exist for
heterogeneous extensions to LLVM-IR, we can document them along the
intrinsic.

I think it's completely unnecessary if your goal is to get something
working on current hardware.

We do have certaint structural/software engineeering changes to the
implementation of LLVM's code generator that would be useful. This
primarily is the ability to completely process one function before
moving onto the next. This is important when dealing with heterogeneous
systems as one has to for example write out different asm for the
various targets at a function granularity. But that doesn't require any
IR changes whatsoever.

                                   -Dave

Tobias Grosser <tobias@grosser.es> writes:

I think we need several modules per file. Supporting AMDIL and PTX at
the same time sounds more than useful.

Yes.

Another question that pops up to me. If we go support several modules,
how would the command line options to opt look like? Do we want to
make all options sub-module specific? Getting this user friendly may
be difficult. The same for the output of llc. At the moment llc can
dump the assembly to stdout. Would you dump the assembly of the
different modules to stdout or do you want to support multiple -o
options to specify the various output files?

I think you're making this too complicated. I think opt should continue
to work the way it does now. Apply the same flags to all modules. If
the user wants different transformations based on target either the
target characteristics should inform the optimizer or the file should be
split into multiple IR files.

The same for the LLVM CodeGen/Target API. It must possibly be changed
to support the output of several modules or the specification of
different options for each module. We also have the same problems as
Justin pointed out for the codegen intrinsic. Some llc options are
globals, they would need to be made Codegen options, if we want to set
them on a per-module basis.

Can you give me some examples? What kinds of options would be
target-specific and not implied by the target attribute on the Module?

                              -Dave

Tobias Grosser <tobias@grosser.es> writes:

Would you dump the assembly of the different modules to stdout or do
you want to support multiple -o options to specify the various output
files?

I forgot to address this one. With current OpenCL and CUDA
specifications, there's no need to do multiple .o files. In my mind,
llc should output one .o (one .s, etc.). Anything else wreaks havoc on
build systems.

But Chris has the final say, I think.

                               -Dave

Tobias Grosser <tobias@grosser.es> writes:

Supporting several modules in on LLVM-IR file may not be too difficult,
but getting this in may still be controversial. The large amount of
changes that I see are changes to the tools. At the moment all tools
expect a single module coming from an LLVM-IR file. I pointed out the
problems in llc and the codegen examples in my other mail.

I replied to that mail so I won't repeat it all here. I don't think
there's any problem given current technology. Since I don't know any
details (only speculation) about what's coming in the future, I can't
comment beyond that.

Again, we have many of the changes to make this possible. I hope to
send them for review as we upgrade to 3.1.

Could you provide a list of the changes you have in the pipeline and a
reliable timeline on when you will upstream them? How much additional
work from other people is required to make this a valuable replacement
of the llvm.codegen intrinsic?

I'll try to recall the major bits. I did this work 3-4 years ago...

I think the major issue was with the AsmPrinter. There's global state
kept around that needs to be cleared between invocations. The
initialization step needs to be re-run for each function but there are
some tricky bits that should not happen each run. That is, most of
AsmPrinter is idempotent but not all.

Label names are a big issue. A simple label counter (L0, L1, etc.) is
no longer sufficent because the counter gets reset between invocations
and you end up with multiple labels with the same name in the .s file.
We got around this by including the (mangled) function name in the label
name.

I had to tweak the mangling code a bit so that it would generate valid
label names. I also consolidated it as there were at least two
different implementations in the ~2.5 codebase. I don't know if that's
changed.

We don't use much of opt at all. I'm sure there are some issues with
the interprocedural optimizations. We didn't deal with those. All of
our changes are in the llc/codegen piece.

As for getting it upstream, we're moving to 3.1 as soon as it's ready
and my intention is to push as much of our customized code upstream as
possible during that transition. The above work would be a pretty high
priority as it is a major source of conflicts for us and I'd rather just
git rid of those. :slight_smile:

So expect to start seeing something within 1-2 months. Unfortunately,
we have bureaucratic processes I have to go through here to get stuff
approved for public release.

Where exactly do you see problems with this intrinsic? It is not meant
to block further work in heterogeneous computing, but to allow us to
gradually improve LLVM to gain such features. It especially provides a
low overhead solution that adds working heterogeneous compute
capabilities for major GPU targets to LLVM. This working solution can
prepare the ground for closer integrated solutions.

It feels like a code generator bolted onto the side of opt, llc,
etc. with all of the details that involves. It seems much easier to me
to just go through the "real" code generator.

                           -Dave

Tobias Grosser<tobias@grosser.es> writes:

Doesn't LLVM support taking the address of a function in another address
space? If not it probably should.

Hi Dave,
The llvm.codegen intrinsic seems the perfect match to build up such
experience. It requires no changes to LLVM-IR itself and only very
local changes to the generic back end infrastructure. It may possibly
not be as generic as other solutions, but it is far from being an ugly
hack. Quite in contrast, it is a close match for OpenCL like run times
and works well with the existing PTX back end.

I'll bite my tongue on the designs of OpenCL and CUDA. :slight_smile:

But regardless, if those are your targets you don't need llvm.codegen at
all.

Why is it not needed? I don't see anything that could currently replace it. How can I create a loadable optimizer module that creates embedded PTX code without the llvm.codegen intrinsic?

Do you have definitiv plans to add heterogeneous computing
capabilities to LLVM-IR within the next couple (3-4) months? Will
these capabilities superseed the llvm codegen intrinsic?

No specific plans to change the IR. We have not found a need such
changes on current architectures as the runtimes provided with those
architectures handles the ugly details. I am thinking further into the
future and what might be needed there.

OK. I am talking about something that is available within the next weeks in LLVM.

In case such plans do not exist, what do you think about adding the
llvm.codegen() intrinsic for now? If mid-term plans exist for
heterogeneous extensions to LLVM-IR, we can document them along the
intrinsic.

I think it's completely unnecessary if your goal is to get something
working on current hardware.

Again, why is it unnecessary?

We do have certaint structural/software engineeering changes to the
implementation of LLVM's code generator that would be useful. This
primarily is the ability to completely process one function before
moving onto the next. This is important when dealing with heterogeneous
systems as one has to for example write out different asm for the
various targets at a function granularity. But that doesn't require any
IR changes whatsoever.

At least for CUDA/OpenCL the modules are entirely independent. Is such a fine granularity realy required?

Tobi

Yes, that's what I am advocating for. There is no need for all this complexity. Both standards store the embedded code as a string in the host module. That is exactly what the llvm.codegen intrinsic models. It requires zero further changes to the code generation backend.

In contrast, extending LLVM-IR to support heterogeneous modules requires us to add logic to the llvm code generation that knows how to link the different sub-modules.

Tobi

It is the real code generator. It is just applied on embedded strings, which is how both OpenCL and CUDA represent embedded programs. I doubt there will ways to model OpenCL or CUDA more closely.

Tobi

Tobias Grosser <tobias@grosser.es> writes:

But regardless, if those are your targets you don't need llvm.codegen at
all.

Why is it not needed? I don't see anything that could currently
replace it. How can I create a loadable optimizer module that creates
embedded PTX code without the llvm.codegen intrinsic?

Embed the PTX as a string in the x86 object/executable. This requires
that the AsmPrinter can be directed to multiple files but it doesn't
require any IR changes at all. Actually, since your modules are
independent it doesn't even require AsmPrinter changes.

No specific plans to change the IR. We have not found a need such
changes on current architectures as the runtimes provided with those
architectures handles the ugly details. I am thinking further into the
future and what might be needed there.

OK. I am talking about something that is available within the next
weeks in LLVM.

Then you don't need a special intrinsic.

I think it's completely unnecessary if your goal is to get something
working on current hardware.

Again, why is it unnecessary?

See above.

We do have certaint structural/software engineeering changes to the
implementation of LLVM's code generator that would be useful. This
primarily is the ability to completely process one function before
moving onto the next. This is important when dealing with heterogeneous
systems as one has to for example write out different asm for the
various targets at a function granularity. But that doesn't require any
IR changes whatsoever.

At least for CUDA/OpenCL the modules are entirely independent. Is such
a fine granularity realy required?

If they're independent, no. In our case the (to us) frontend extracts
kernels and send them to codegen they same way it sends x86 code.

Originally I did this for scalability purposes. We could not compile
very large codes when LLVM insisted we keep all the IR around all the
time. We had to get rid of that restriction which led to the
function-at-a-time model. It just happens that it works well for
current accelerators.

                        -Dave

Tobias Grosser <tobias@grosser.es> writes:

I forgot to address this one. With current OpenCL and CUDA
specifications, there's no need to do multiple .o files. In my mind,
llc should output one .o (one .s, etc.). Anything else wreaks havoc on
build systems.

Yes, that's what I am advocating for. There is no need for all this
complexity. Both standards store the embedded code as a string in the
host module. That is exactly what the llvm.codegen intrinsic
models. It requires zero further changes to the code generation
backend.

But why do you need an intrinsic to do that? Just generate the code to
a file and suck it into a string, maybe with an external "linker" tool.

If you just want something to work, that should be sufficient. If you
want some long-term design/implementation I don't think llvm.codegen is
it.

In contrast, extending LLVM-IR to support heterogeneous modules
requires us to add logic to the llvm code generation that knows how to
link the different sub-modules.

We already have the Linker.

                                -Dave

Sorry Tobias, I'm not in favor of this change. From what I can tell, this enables some features which can implemented via other means. It adds all kinds of complexity to LLVM and I'm also highly concerned about bitcode that can embed illegal (or worse malicious) code using this feature.

Unless Chris says otherwise, we can consider this proposal dead. Sorry.

Evan

Hi Evan,

there is no need to force this change in. I am rather trying to understand the shortcomings of my approach and look for possible better solutions.

That's why I was asking you where you see the possibility of illegal/malicious code? You did not really explain it yet and I would
be more than happy to be understand such a problem. From my point of view embedded and host module code are both compiled at the same time and are both checked by the LLVM bitcode verifier. How could this introduce any malicious code, that could not be introduced by normal LLVM-IR?

In terms of the complexity. The only alternative proposal I have heard of was making LLVM-IR multi module aware or adding multi-module support to all LLVM-IR tools. Both of these changes are way more complex than the codegen intrinsic. Actually, they are soo complex that I doubt that they can be implemented any time soon. What is the simpler approach you are talking about?

Maybe I completely missed the point, but if there would be a good alternative there would be no need to discuss. I would happily go ahead and implement the said alternative. Even if there is non I would keep quiet, after I understand the concerns that block this proposal. For now, I don't think I understood the concerns yet.

Cheers
Tobi

OK. I think we are on the same track. Yes, there is no need for a lot of infrastructure. Storing PTX in a string of the host module, is the only thing needed.

So why the intrinsic? I want to create the PTX string from an LLVM-IR optimizer pass, that should be loaded into clang, dragonegg, opt, ..
An LLVM-IR optimizer pass does not have access to the file system and it can not link to the LLVM back ends to directly create PTX. Creating PTX in an optimizer pass would be an ugly hack. The cleaner solution is to store an LLVM-IR string in the host module and to mark it with the llvm.codegen() intrinsic. When the module is processed by the backend, the string is automatically translated to PTX. This requires no additional file writing, introduces no layering violations and seems to be very simple.

I don't see a better way to translate LLVM-IR to PTX. Do you stil believe introducing file writing to an optimizer module is a good and portable solution?

Cheers
Tobi

Tobias Grosser<tobias@grosser.es> writes:

I forgot to address this one. With current OpenCL and CUDA
specifications, there’s no need to do multiple .o files. In my mind,
llc should output one .o (one .s, etc.). Anything else wreaks havoc on
build systems.

Yes, that’s what I am advocating for. There is no need for all this
complexity. Both standards store the embedded code as a string in the
host module. That is exactly what the llvm.codegen intrinsic
models. It requires zero further changes to the code generation
backend.

But why do you need an intrinsic to do that? Just generate the code to
a file and suck it into a string, maybe with an external “linker” tool.

If you just want something to work, that should be sufficient. If you
want some long-term design/implementation I don’t think llvm.codegen is
it.

OK. I think we are on the same track. Yes, there is no need for a lot of infrastructure. Storing PTX in a string of the host module, is the only thing needed.

So why the intrinsic? I want to create the PTX string from an LLVM-IR optimizer pass, that should be loaded into clang, dragonegg, opt, …
An LLVM-IR optimizer pass does not have access to the file system and it can not link to the LLVM back ends to directly create PTX. Creating PTX in an optimizer pass would be an ugly hack. The cleaner solution is to store an LLVM-IR string in the host module and to mark it with the llvm.codegen() intrinsic. When the module is processed by the backend, the string is automatically translated to PTX. This requires no additional file writing, introduces no layering violations and seems to be very simple.

I don’t see a better way to translate LLVM-IR to PTX. Do you stil believe introducing file writing to an optimizer module is a good and portable solution?

Until any new infrastructure is implemented, I don’t see it being any worse of a solution. Don’t get me wrong, I think the llvm.codegen() intrinsic is a fast way to get things up and running for the GSoC project; but I also agree with Dan and Evan that it’s not appropriate for LLVM mainline. There are just too many subtle details and this really only handles the case of host code needing the device code as text assembly.

To support opt-level transforms, you could just embed the generated IR as text in the module, then invoke a separate tool to extract that you into a separate module. The more I think about this, the more I become convinced that we could benefit from a module “container,” similar to a Mac fat/universal binary. Something like this probably wouldn’t be too hard to implement; the main problem I see if what llc outputs, or maybe a single llc invocation would only process one module in the container.

Tobias Grosser <tobias@grosser.es> writes:

In terms of the complexity. The only alternative proposal I have heard
of was making LLVM-IR multi module aware or adding multi-module
support to all LLVM-IR tools.

That's simply not true. I outlined how you can accomplish your task
without any IR changes at all. IR changes are only necessary (probably)
if we want opt or some other tool to extract accelerator kernels. And
even then I'm not 100% sure we need IR changes.

                          -Dave

Tobias Grosser <tobias@grosser.es> writes:

So why the intrinsic? I want to create the PTX string from an LLVM-IR
optimizer pass, that should be loaded into clang, dragonegg, opt, ..

You want to codegen in the optimizer? I'm confused.

An LLVM-IR optimizer pass does not have access to the file system and
it can not link to the LLVM back ends to directly create PTX. Creating
PTX in an optimizer pass would be an ugly hack.

So you _don't_ want to codegen in the optimizer. Now I'm really
confused.

The cleaner solution is to store an LLVM-IR string in the host module
and to mark it with the llvm.codegen() intrinsic. When the module is
processed by the backend, the string is automatically translated to
PTX. This requires no additional file writing, introduces no layering
violations and seems to be very simple.

Why do you need to store IR in a string? It's already in the IR file or
you can put it into another file. All you need is an _external_ tool to
drive llc to process and codegen these multiple files (to multiple
targets) and then another tool to suck up the accelerator code into a
string in the host assembly file. Then you assemble into an object.

No IR changes and you end up with one object file. No changes to build
systems at all, it's all handled by a driver.

llvm.codegen is completely unnecessary.

                                 -Dave

Tobias Grosser <tobias@grosser.es> writes:

So why the intrinsic? I want to create the PTX string from an LLVM-IR
optimizer pass, that should be loaded into clang, dragonegg, opt, …

You want to codegen in the optimizer? I’m confused.

An LLVM-IR optimizer pass does not have access to the file system and
it can not link to the LLVM back ends to directly create PTX. Creating
PTX in an optimizer pass would be an ugly hack.

So you don’t want to codegen in the optimizer. Now I’m really
confused.

The device code IR would be generated in the optimization pass, and codegen’d when the host module is codegen’d.

The word “codegen” is overloaded here, as we’re talking about IR codegen during optimization, and device codegen during host codegen. Confusing, no? :slight_smile:

The cleaner solution is to store an LLVM-IR string in the host module
and to mark it with the llvm.codegen() intrinsic. When the module is
processed by the backend, the string is automatically translated to
PTX. This requires no additional file writing, introduces no layering
violations and seems to be very simple.

Why do you need to store IR in a string? It’s already in the IR file or
you can put it into another file. All you need is an external tool to
drive llc to process and codegen these multiple files (to multiple
targets) and then another tool to suck up the accelerator code into a
string in the host assembly file. Then you assemble into an object.

No IR changes and you end up with one object file. No changes to build
systems at all, it’s all handled by a driver.

llvm.codegen is completely unnecessary.

I believe the point Tobias is trying to make is that he wants to retain the ability to pipe modules between tools and not worry about the modules ever hitting disk, e.g.

opt -load GPUOptimizer.so -gpu-opt | llc -march=x86

where the module coming in to opt is just unoptimized host code, and the module coming out of opt has embedded GPU IR.

The llvm.codegen() does solve this problem, but at the cost of too much ambiguity.