[RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Hi all,

I’d like to propose a change in the Driver implementation to support programming models that require offloading with a unified infrastructure. The goal is to have a design that is general enough to cover different programming models with as little as possible customization that is programming-model specific. Some of this discussion already took place in http://reviews.llvm.org/D9888 but would like to continue that here in he mailing list and try to collect as much feedback as possible.

Currently, there are two programming models supported by clang that require offloading - CUDA and OpenMP. Examples of other offloading models that can could benefit of a unified driver design as they become supported in clang are also SYCL (https://www.khronos.org/sycl) and OpenACC (http://www.openacc.org/). Therefore, I’ll try to make the discussion a general as possible, but will occasionally provide examples on how that applies on CUDA and OpenMP, given that is what people may care about more immediately.

I hope I covered all the possible implications of a general offloading implementation. Let me know if you think there is something missing that should also be covered, your suggestions and concerns. Any feedback is very much welcome!

Thanks!

Samuel

OpenMP (Host IR has to be read by the device to determine which
declarations have to be emitted and the device binary is embedded in the
host binary at link phase through a proper linker script):

Src -> Host PP -> A

A -> HostCompile -> B

A,B -> DeviceCompile -> C

I think even for some OpenMP targets it might be better to allow using
device PP (and corresponding headers) due to target specific macro. For
example, some function may have target specific version and host version
with #ifdef's on CPU type that allows to have completely different
implementation on host and on target.

C -> DeviceAssembler -> D

E -> DeviceLinker -> F

I suspect that you meant D instead of E here, isn't it?

Hi Samuel,

Thanks for your work, I would really much like to see OpenMP offloading implemented in clang!

I can’t judge the necessary changes to the driver infrastructure but will rather focus on an end-user view…

Hi Dmitri,

Hi Jonas,

Hi Samuel,

Hi Jonas,

Hi all,

Hi Samuel!

    > I’d like to propose a change in the Driver implementation
    > to support programming models that require offloading with a
    > unified infrastructure. The goal is to have a design that
    > is general enough to cover different programming models with
    > as little as possible customization that is
    > programming-model specific. Some of this discussion already
    > took place in http://reviews.llvm.org/D9888 but would like
    > to continue that here in he mailing list and try to collect
    > as much feedback as possible.

    > Currently, there are two programming models supported by
    > clang that require offloading - CUDA and OpenMP. Examples of
    > other offloading models that can could benefit of a unified
    > driver design as they become supported in clang are also
    > SYCL (SYCL Overview - The Khronos Group Inc) and OpenACC
    > (http://www.openacc.org/).

Great proposal!

Very à propos since I am just thinking about implementing it with Clang
in my SYCL implementation (see
https://github.com/amd/triSYCL#possible-futures for possible way I am
thinking of).

    > OpenMP (Host IR has to be read by the device to determine
    > which declarations have to be emitted and the device binary
    > is embedded in the host binary at link phase through a
    > proper linker script):

    > Src -> Host PP -> A

    > A -> HostCompile -> B

    > A,B -> DeviceCompile -> C

    > C -> DeviceAssembler -> D

    > E -> DeviceLinker -> F

    > B -> HostAssembler -> G

    > G,F -> HostLinker -> Out

In SYCL it would be pretty close. Something like:

Src -> Host PP -> A

A -> HostCompile -> B

B -> HostAssembler -> C

Src -> Device PP -> D

D -> DeviceCompile -> E

E -> DeviceAssembler -> F

F -> DeviceLinker -> G

C,G -> HostLinker -> Out

    > As an hypothetical example, lets assume we wanted to compile
    > code that uses both CUDA for a nvptx64 device, OpenMP for an
    > x86_64 device, and a powerpc64le host, one could invoke the
    > driver as:

    > clang -target powerpc64le-ibm-linux-gnu <more host options>

    > -target-offload=nvptx64-nvidia-cuda -fcuda -mcpu sm_35 <more
    > options for the nvptx toolchain>

    > -target-offload=x86_64-pc-linux-gnu -fopenmp <more options
    > for the x86_64 toolchain>

Just to be sure to understand: you are thinking about being able to
outline several "languages" at once, such as CUDA *and* OpenMP, right ?

I think it is required for serious applications. For example, in the HPC
world, it is common to have hybrid multi-node heterogeneous applications
that use MPI+OpenMP+OpenCL for example. Since MPI and OpenCL are just
libraries, there is only OpenMP to off-load here. But if we move to
OpenCL SYCL instead with MPI+OpenMP+SYCL then both OpenMP and SYCL have
to be managed by the Clang off-loading infrastructure at the same time
and be sure they combine gracefully...

I think your second proposal about (un)bundling can already manage this.

Otherwise, what about the code outlining itself used in the off-loading
process? The code generation itself requires to outline the kernel code
to some external functions to be compiled by the kernel compiler. Do you
think it is up to the programmer to re-use the recipes used by OpenMP
and CUDA for example or it would be interesting to have a third proposal
to abstract more the outliner to be configurable to handle globally
OpenMP, CUDA, SYCL...?

Thanks a lot,

Some very good points above and back to my broken record..

If all offloading is done in a single unified library -
a. Lowering in LLVM is greatly simplified since there's ***1***
offload API to be supported
A region that's outlined for SYCL, CUDA or something else is
essentially the same thing. (I do realize that some transformation may
be highly target specific, but to me that's more target hw driven than
programming model driven)

b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since the same
runtime will handle them all. (With the limitation that if you want
CUDA to *talk to* OMP or something else there needs to be some glue.
I'm merely saying that 1 application with multiple models in a way
that won't conflict)

c. The driver doesn't need to figure out do I link against some or a
multitude of combining/conflicting libcuda, libomp, libsomething -
it's liboffload - done

The driver proposal and the liboffload proposal should imnsho be
tightly coupled and work together as *1*. The goals are significantly
overlapping and relevant. If you get the liboffload OMP people to make
that more agnostic - I think it simplifies the driver work.

Just to be sure to understand: you are thinking about being able

    >> to outline several "languages" at once, such as CUDA *and*
    >> OpenMP, right ?
    >>
    >> I think it is required for serious applications. For example, in
    >> the HPC world, it is common to have hybrid multi-node
    >> heterogeneous applications that use MPI+OpenMP+OpenCL for
    >> example. Since MPI and OpenCL are just libraries, there is only
    >> OpenMP to off-load here. But if we move to OpenCL SYCL instead
    >> with MPI+OpenMP+SYCL then both OpenMP and SYCL have to be managed
    >> by the Clang off-loading infrastructure at the same time and be
    >> sure they combine gracefully...
    >>
    >> I think your second proposal about (un)bundling can already
    >> manage this.
    >>
    >> Otherwise, what about the code outlining itself used in the
    >> off-loading process? The code generation itself requires to
    >> outline the kernel code to some external functions to be compiled
    >> by the kernel compiler. Do you think it is up to the programmer
    >> to re-use the recipes used by OpenMP and CUDA for example or it
    >> would be interesting to have a third proposal to abstract more
    >> the outliner to be configurable to handle globally OpenMP, CUDA,
    >> SYCL...?

    > Some very good points above and back to my broken record..

    > If all offloading is done in a single unified library -
    > a. Lowering in LLVM is greatly simplified since there's ***1***
    > offload API to be supported A region that's outlined for SYCL,
    > CUDA or something else is essentially the same thing. (I do
    > realize that some transformation may be highly target specific,
    > but to me that's more target hw driven than programming model
    > driven)

    > b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since the
    > same runtime will handle them all. (With the limitation that if
    > you want CUDA to *talk to* OMP or something else there needs to
    > be some glue. I'm merely saying that 1 application with multiple
    > models in a way that won't conflict)

    > c. The driver doesn't need to figure out do I link against some
    > or a multitude of combining/conflicting libcuda, libomp,
    > libsomething - it's liboffload - done

Yes, a unified target library would help.

    > The driver proposal and the liboffload proposal should imnsho be
    > tightly coupled and work together as *1*. The goals are
    > significantly overlapping and relevant. If you get the liboffload
    > OMP people to make that more agnostic - I think it simplifies the
    > driver work.

So basically it is about introducing a fourth unification: liboffload.

A great unification sounds great.
My only concern is that if we tie everything together, it would increase
the entry cost: all the different components should be ready in
lock-step.
If there is already a runtime available, it would be easier to start
with and develop the other part in the meantime.
So from a pragmatic agile point-of-view, I would prefer not to impose a
strong unification.
In the proposal of Samuel, all the parts seem independent.

    > ------ More specific to this proposal - device
    > linker vs host linker. What do you do for IPA/LTO or whole
    > program optimizations? (Outside the scope of this project.. ?)

Ouch. I did not think about it. It sounds like science-fiction for
now. :slight_smile: Probably outside the scope of this project..

Are you thinking to having LTO separately on each side independently,
host + target? Of course having LTO on host and target at the same time
seems trickier... :slight_smile: But I can see here a use case for "constant
specialization" available in SPIR-V, if we can have some simple host
LTO knowledge about constant values flowing down into device IR.

For non link-time IPA, I think it is simpler since I guess the
programming models envisioned here are all single source, so we can
apply most of the IPA *before* outlining I hope. But perhaps wild
preprocessor differences for host and device may cause havoc here?

Chris,

A unified offload library, as good as it might be to have one, is
completely orthogonal to Samuel's proposal.

He proposed a unified driver support; it doesn't matter what offload
library individual compiler components called by driver are targeting.

Yours,
Andrey

    >> Just to be sure to understand: you are thinking about being able
    >> to outline several "languages" at once, such as CUDA *and*
    >> OpenMP, right ?
    >>
    >> I think it is required for serious applications. For example, in
    >> the HPC world, it is common to have hybrid multi-node
    >> heterogeneous applications that use MPI+OpenMP+OpenCL for
    >> example. Since MPI and OpenCL are just libraries, there is only
    >> OpenMP to off-load here. But if we move to OpenCL SYCL instead
    >> with MPI+OpenMP+SYCL then both OpenMP and SYCL have to be managed
    >> by the Clang off-loading infrastructure at the same time and be
    >> sure they combine gracefully...
    >>
    >> I think your second proposal about (un)bundling can already
    >> manage this.
    >>
    >> Otherwise, what about the code outlining itself used in the
    >> off-loading process? The code generation itself requires to
    >> outline the kernel code to some external functions to be compiled
    >> by the kernel compiler. Do you think it is up to the programmer
    >> to re-use the recipes used by OpenMP and CUDA for example or it
    >> would be interesting to have a third proposal to abstract more
    >> the outliner to be configurable to handle globally OpenMP, CUDA,
    >> SYCL...?

    > Some very good points above and back to my broken record..

    > If all offloading is done in a single unified library -
    > a. Lowering in LLVM is greatly simplified since there's ***1***
    > offload API to be supported A region that's outlined for SYCL,
    > CUDA or something else is essentially the same thing. (I do
    > realize that some transformation may be highly target specific,
    > but to me that's more target hw driven than programming model
    > driven)

    > b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since the
    > same runtime will handle them all. (With the limitation that if
    > you want CUDA to *talk to* OMP or something else there needs to
    > be some glue. I'm merely saying that 1 application with multiple
    > models in a way that won't conflict)

    > c. The driver doesn't need to figure out do I link against some
    > or a multitude of combining/conflicting libcuda, libomp,
    > libsomething - it's liboffload - done

Yes, a unified target library would help.

    > The driver proposal and the liboffload proposal should imnsho be
    > tightly coupled and work together as *1*. The goals are
    > significantly overlapping and relevant. If you get the liboffload
    > OMP people to make that more agnostic - I think it simplifies the
    > driver work.

So basically it is about introducing a fourth unification: liboffload.

A great unification sounds great.
My only concern is that if we tie everything together, it would increase
the entry cost: all the different components should be ready in
lock-step.
If there is already a runtime available, it would be easier to start
with and develop the other part in the meantime.
So from a pragmatic agile point-of-view, I would prefer not to impose a
strong unification.

I think may not be explaining clearly - let me elaborate by example a bit below

In the proposal of Samuel, all the parts seem independent.

    > ------ More specific to this proposal - device
    > linker vs host linker. What do you do for IPA/LTO or whole
    > program optimizations? (Outside the scope of this project.. ?)

Ouch. I did not think about it. It sounds like science-fiction for
now. :slight_smile: Probably outside the scope of this project..

It should certainly not be science fiction or an after-thought. I
won't go into shameless self promotion, but there are certainly useful
things you can do when you have a "whole device kernel" perspective.

To digress into the liboffload component of this (sorry)
what we have today is basically liboffload/src/all source files mucked together

What I'm proposing would look more like this

liboffload/src/common_middle_layer_glue # to start this may be "best effort"
liboffload/src/omp # This code should exist today, but ideally should
build on top of the middle layer
liboffload/src/ptx # this may exist today - not sure
liboffload/src/amd_gpu # probably doesn't exist, but
wouldn't/shouldn't block anything
liboffload/src/phi # may exist in some form
liboffload/src/cuda # may exist in some form outside of the OMP work

The end result would be liboffload.

Above and below the common middle layer API are programming model or
hardware specific. To add a new hw backend you just implement the
things the middle layer needs. To add a new programming model you
build on top of the common layer. I'm not trying to force
anyone/everyone to switch to this now - I'm hoping that by being a
squeaky wheel this isolation of design and layers is there from the
start - even if not perfect. I think it's sloppy to not consider this
actually. LLVM's code generation is clean and has a nice separation
per target (for the most part) - why should the offload library have
bad design which just needs to be refactored later. I've seen others
in the community beat up Intel to force them to have higher quality
code before inclusion... some of this may actually be just minor
refactoring to come close to the target. (No pun intended)

Hi Ronan,

Thanks for the feedback!

    >> Just to be sure to understand: you are thinking about being able
    >> to outline several "languages" at once, such as CUDA *and*
    >> OpenMP, right ?
    >>
    >> I think it is required for serious applications. For example, in
    >> the HPC world, it is common to have hybrid multi-node
    >> heterogeneous applications that use MPI+OpenMP+OpenCL for
    >> example. Since MPI and OpenCL are just libraries, there is only
    >> OpenMP to off-load here. But if we move to OpenCL SYCL instead
    >> with MPI+OpenMP+SYCL then both OpenMP and SYCL have to be managed
    >> by the Clang off-loading infrastructure at the same time and be
    >> sure they combine gracefully...
    >>
    >> I think your second proposal about (un)bundling can already
    >> manage this.
    >>
    >> Otherwise, what about the code outlining itself used in the
    >> off-loading process? The code generation itself requires to
    >> outline the kernel code to some external functions to be compiled
    >> by the kernel compiler. Do you think it is up to the programmer
    >> to re-use the recipes used by OpenMP and CUDA for example or it
    >> would be interesting to have a third proposal to abstract more
    >> the outliner to be configurable to handle globally OpenMP, CUDA,
    >> SYCL...?

    > Some very good points above and back to my broken record..

    > If all offloading is done in a single unified library -
    > a. Lowering in LLVM is greatly simplified since there's ***1***
    > offload API to be supported A region that's outlined for SYCL,
    > CUDA or something else is essentially the same thing. (I do
    > realize that some transformation may be highly target specific,
    > but to me that's more target hw driven than programming model
    > driven)

    > b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since the
    > same runtime will handle them all. (With the limitation that if
    > you want CUDA to *talk to* OMP or something else there needs to
    > be some glue. I'm merely saying that 1 application with multiple
    > models in a way that won't conflict)

    > c. The driver doesn't need to figure out do I link against some
    > or a multitude of combining/conflicting libcuda, libomp,
    > libsomething - it's liboffload - done

Yes, a unified target library would help.

    > The driver proposal and the liboffload proposal should imnsho be
    > tightly coupled and work together as *1*. The goals are
    > significantly overlapping and relevant. If you get the liboffload
    > OMP people to make that more agnostic - I think it simplifies the
    > driver work.

So basically it is about introducing a fourth unification: liboffload.

A great unification sounds great.
My only concern is that if we tie everything together, it would increase
the entry cost: all the different components should be ready in
lock-step.
If there is already a runtime available, it would be easier to start
with and develop the other part in the meantime.
So from a pragmatic agile point-of-view, I would prefer not to impose a
strong unification.
In the proposal of Samuel, all the parts seem independent.

I agree with Ronan. Having a unified library is a discussion that is worth
having, but the design of something like that has to be incremental. And in
order for that to happened it has to start from what is already available
or about to be available upstream.

I don't think that the decisions about the library affect the driver much.
Specifying a different library in a given toolchain is only a one-line
change.

I'd rather have the library discussion in a different mailing list (maybe
OpenMP) because, as Andrey said, these (driver and library) are (and in my
opinion should) be two separate efforts.

    > ------ More specific to this proposal - device
    > linker vs host linker. What do you do for IPA/LTO or whole
    > program optimizations? (Outside the scope of this project.. ?)

Ouch. I did not think about it. It sounds like science-fiction for
now. :slight_smile: Probably outside the scope of this project..

Are you thinking to having LTO separately on each side independently,
host + target? Of course having LTO on host and target at the same time
seems trickier... :slight_smile: But I can see here a use case for "constant
specialization" available in SPIR-V, if we can have some simple host
LTO knowledge about constant values flowing down into device IR.

For non link-time IPA, I think it is simpler since I guess the
programming models envisioned here are all single source, so we can
apply most of the IPA *before* outlining I hope. But perhaps wild
preprocessor differences for host and device may cause havoc here?

LTO is coupled to the toolchain and which plugins the linker supports. In
the OpenMP implementation we prototype we have in github we have LTO
enabled by the the driver - it uses llvm-link to produce a single piece of
IR before calling the backend. We could do something similar because
expecting a linker to have LTO plugins enabled for every possible device
seems unlikely. I guess that would be a separate proposal. :slight_smile:

I understand why we would like to have LTO before outlining (propagate info
from host to device code). However, in the way things currently are
(frontend does outlining) that would be hard. So, at least, I hope to have
clang providing the maximum information through attributes to outlined
kernels.

Thanks again,
Samuel

Hi Chris,

I agree with Andrey when he says this should be a separate discussion.

I think that aiming at having a library that would support any possible programming model would take a long time, as it requires a lot of consensus namely from who is maintaining programming models already in clang (e.g. CUDA). We should try to have something incremental.

I’m happy to discuss and know more about the design and code you would like to contribute to this, but I think you should post it in a different thread.

Thanks,
Samuel

Hi, I'm one of the people working on CUDA in clang.

In general I agree that the support for CUDA today is rather ad-hoc; it can
likely be improved. However, there are many points in this proposal that I do
not understand. Inasmuch as I think I understand it, I am concerned that it's
adding a new abstractions instead of fixing the existing ones, and that this
will result in a lot of additional complexity.

a) Create toolchains for host and offload devices before creating the actions.

The driver has to detect the employed programming models through the provided
options (e.g. -fcuda or -fopenmp) or file extensions. For each host and
offloading device and programming model, it should create a toolchain.

Seems sane to me.

b) Keep the generation of Actions independent of the program model.

In my view, the Actions should only depend on the compile phases requested by
the user and the file extensions of the input files. Only the way those
actions are interpreted to create jobs should be dependent on the programming
model. This would avoid complicating the actions creation with dependencies
that only make sense to some programming models, which would make the
implementation hard to scale when new programming models are to be adopted.

I don't quite understand what you're proposing here, or what you're trying to
accomplish with this change.

Perhaps it would help if you could give a concrete example of how this would
change e.g. CUDA or Mac universal binary compilation?

For example, in CUDA compilation, we have an action which says "compile
everything below here as cuda arch sm_35". sm_35 comes from a command-line
flag, so as I understand your proposal, this could not be in the action graph,
because it doesn't come from the filename or the compile phases requested by
the user. So, how will we express this notion that some actions should be
compiled for a particular arch?

c) Use unbundling and bundling tools agnostic of the programming model.

I propose a single change in the action creation and that is the creation of
a “unbundling” and "bundling” action whose goal is to prevent the user to
have to deal with multiple files generated from multiple toolchains (host
toolchain and offloading devices’ toolchains) if he uses separate compilation
in his build system.

I'm not sure I understand what "separate compilation" is here. Do you mean, a
compilation strategy which outputs logically separate machine code for each
architecture, only to have this code combined at link time? (In contrast to
how we currently compile CUDA, where the device code for a file is integrated
into the host code for that file at compile time?)

If that's right, then what I understand you're proposing here is that, instead
of outputting N different object files -- one for the host, and N-1 for all our
device architectures -- we'd just output one blob which clang would understand
how to handle.

For my part, I am highly wary of introducing a new file format into clang's
output. Historically, clang (along with other compilers) does not output
proprietary blobs. Instead, we output object files in well-understood,
interoperable formats, such as ELF. This is beneficial because there are lots
of existing tools which can handle these files. It also allows e.g. code
compiled with clang to be linked with g++.

Build tools are universally awful, and I sympathize with the urge not to change
them. But I don't think this is a business we want the compiler to be in.
Instead, if a user wants this kind of "fat object file", they could obtain one
by using a simple wrapper around clang. If this wrapper's output format became
widely-used, we could then consider supporting it directly within clang, but
that's a proposition for many years in the future.

d) Allow the target toolchain to request the host toolchain to be used for a given action.

Seems sane to me.

e) Use a job results cache to enable sharing results between device and host toolchains.

I don't understand why we need a cache for job results. Why can we not set up
the Action graph such that each node has the correct inputs? (You've actually
sketched exactly what I think the Action graph should look like, for CUDA and
OpenMP compilations.)

f) Intercept the jobs creation before the emission of the command.

In my view this is the only change required in the driver (apart from the
obvious toolchain changes) that would be dependent on the programming model.
A job result post-processing function could check that there are offloading
toolchains to be used and spawn the jobs creation for those toolchains as
well as append results from one toolchain to the results of some other
accordingly to the programming model implementation needs.

Again it's not clear to me why we cannot and should not represent this in the
Action graph. It's that graph that's supposed to tell us what we're going to
do.

g) Reflect the offloading programming model in the naming of the save-temps files.

We already do this somewhat; e.g. for CUDA with save-temps, we'll output foo.s
and foo-sm_35.s. Extending this to be more robust (e.g. including the triple)
seems fine.

h) Use special options -target-offload=<triple> to specify offloading targets and delimit options meant for a toolchain.

I think I agree that we should generalize the flags we're using.

I'm not sold on the name or structure (I'm not aware of any other flags that
affect *all* flags following them?), but we can bikeshed about that separately.

i) Use the offload kinds in the toolchain to drive the commands generation by Tools.

I'm not sure exactly what this means, but it doesn't sound
particularly contentious. :slight_smile:

3. We are willing to help with implementation of CUDA-specific parts when
they overlap with the common infrastructure; though we expect that effort to
be driven also by other contributors specifically interested in CUDA support
that have the necessary know-how (both on CUDA itself and how it is supported
in Clang / LLVM).

Given that this is work that doesn't really help CUDA (the driver works fine
for us as-is), I am not sure we'll be able to devote significant resources to
this project. Of course we'll be available to assist with code relevant
reviews and give advice.

I think like any other change to clang, the responsibility will rest on the
authors not to break existing functionality, at the very least inasmuch as is
checked by existing unit tests.

Regards,
-Justin

Hi Justin,

It’s great to have your feedback!

This has two objectives. One is to avoid the creation of actions that are programming model specific. The other is to remove complexity from the action creation that would have to mix phases and different programming models DAG requirements

As I understand this, we're saying that we'll build up an action
graph, but it is sort of a lie, in that it does not encapsulate all of
the logic we're interested in. Then, when we convert the actions into
jobs, we'll postprocess them using language-specific logic to make the
jobs do what we want.

I am not in favor of this approach, as I understand it. Although I
acknowledge that it would simplify building the Action graph itself,
it does so by moving this complexity into a "shadow Action graph" --
the DAG that *actually* describes what we're going to do (which may
never be explicitly constructed, but still exists in our minds). I
don't think this is actually a simplification.

If, as you say, building the Action graph for CUDA and OpenMP is
complicated, I think we should fix that. Then we'll be able to
continue using our existing tools to e.g. inspect the Action graph
generated by the driver.

I see the driver already as a wrapper, so I don't think it is not appropriate to use it.

You and I, being compiler hackers, understand that the driver is a
wrapper. However, to a user, the driver is the compiler. No build
system invokes clang -cc1 directly.

However, I think the creation of the blob should be done by an external tool, say, as it was a linker.

Sure, but this isn't the difference I was getting at. What I was
trying to say is that the creation of the blob should be done by a
tool which is external to the compiler *from the perspective of the
user*. Meaning that, the driver should not invoke this tool. If the
user wants it, they can invoke it explicitly (as they might use tar to
bundle their object files).

I'd put it in this way: an bundled file should work as a normal host file, regardless of what device code it embeds.

OK, but this still makes all existing tools useless if I want to
inspect device code. If you give me a .o file and tell me that it's
device code, I can inspect it, disassemble it, or whatever using
existing tools. If it's a bundle in a file format we made up here on
this list, there's very little chance existing tools are going to let
me get the device code out in a sensible way.

Again, I don't think that inventing file formats -- however simple --
is a business that we should be getting into.

Even for ELF, I agree putting the code in some section is more elegant. I'll investigate the possibilities to implement that.

Maybe, but unless there's a way to annotate that section and say "this
section contains code for architecture foo", then objdump isn't going
to work sensibly on that section, and I think that's basically game
over.

In other side, we have text files. My opinion is that we should have something that is easy to read and edit. How would a bundled text file look like in your opinion?

Similarly, this will not interoperate with any existing tools, and I
think that's job zero.

If, as you say, building the Action graph for CUDA and OpenMP is complicated, I think we should fix that.

It occurs to me that perhaps all you want is to build up the Action
graph in a non-language-specific manner, and then pass that to e.g.
CUDA-specific code that will massage the Action graph into what it
wants.

I don't know if that would be an improvement over the current
situation -- there are a lot of edge cases -- but it might.

Hi Justin,