[RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Samuel_Antao · March 4, 2016, 10:14pm

:

> If, as you say, building the Action graph for CUDA and OpenMP is
complicated, I think we should fix that.

It occurs to me that perhaps all you want is to build up the Action
graph in a non-language-specific manner, and then pass that to e.g.
CUDA-specific code that will massage the Action graph into what it
wants.

I don't know if that would be an improvement over the current
situation -- there are a lot of edge cases -- but it might.

That's a possible approach. Could be a good way to organize it. However, if
you have two different programming models those transformations would
happen in a given sequence, so the one that comes last will have to be
aware of the programming model that was used for the first transformation.
This wouldn't be as clean as having the host actions (which are always the
same for a given file and options) and have all the job generation to orbit
around that.

Let me study the problem of doing this with actions and see all the
possible implications.

Justin_Lebar · March 4, 2016, 10:20pm

So, in your opinion, should we create an action for each programing model or
should we have a generic one?

We currently have generic Actions, like "CompileAction". I think those should
stay? BindArch and the like add a lot of complexity, maybe there's a way to
get rid of those, merging their information into the other Actions.

Does that answer your question? I'm afraid I may be misunderstanding.

I have some application that I've been compiling with clang, and I usually
just run "make". Now I read somewhere that a new release of clang has
support for CUDA and I happen to have a nice loop that I could implement with
CUDA. So, I add a new file with the new implementation, then I run "make", it
compiles but when I run it crashes. The reason it crashes is that I was using
separate compilation and know I need to change all my makefile rules to
forward a new kind of file, that I may not even know what it is.

Again, I do not think that we should make up new file formats and incorporate
them into clang so that people can use new compiler features without modifying
their makefiles.

I think it is far more important that low-level tools such as ld and objdump
continue to work on the files that the compiler outputs. That likely means
we'll have to output N separate files, one for the host and one for each device
arch.

But hey, this is just my opinion, and I'm a nobody here. No offense taken if
the community decides otherwise.

echristo · March 4, 2016, 10:26pm

So, in your opinion, should we create an action for each programing model or
should we have a generic one?

We currently have generic Actions, like “CompileAction”. I think those should
stay? BindArch and the like add a lot of complexity, maybe there’s a way to
get rid of those, merging their information into the other Actions.

Does that answer your question? I’m afraid I may be misunderstanding.

I have some application that I’ve been compiling with clang, and I usually
just run “make”. Now I read somewhere that a new release of clang has
support for CUDA and I happen to have a nice loop that I could implement with
CUDA. So, I add a new file with the new implementation, then I run “make”, it
compiles but when I run it crashes. The reason it crashes is that I was using
separate compilation and know I need to change all my makefile rules to
forward a new kind of file, that I may not even know what it is.

Again, I do not think that we should make up new file formats and incorporate
them into clang so that people can use new compiler features without modifying
their makefiles.

I think it is far more important that low-level tools such as ld and objdump
continue to work on the files that the compiler outputs. That likely means
we’ll have to output N separate files, one for the host and one for each device
arch.

But hey, this is just my opinion, and I’m a nobody here. No offense taken if
the community decides otherwise.

I haven’t disagreed with anything you’ve said yet

-eric

Samuel_Antao · March 4, 2016, 10:37pm

Hi Justin,

Samuel_Antao · March 5, 2016, 12:34am

Hi Justin, Eric,

Thanks again for you time to discuss this.

So, if we are to use a wrapper around the driver you would be able to pack the outputs in whatever format. What about the inputs? We would need to add options to enable passing multiple inputs for the same compilation, right? Also, process those inputs would require replicating a lot of what the driver already does in terms of the checks on the inputs, don’t you think?

Thanks again,
Samuel

Justin_Lebar · March 5, 2016, 12:42am

In your opinion, if we add support for a new programming model, say OpenMP, should we try to convert CudaAction in something more generic (say DeviceAction) or adding actions for each programming model is the way to go?

Oh, I don't have a strong opinion. It may or may not make sense to
combine them, depending on whether OpenMP needs different arguments to
its actions than CUDA needs.

How would generate two separate files help ld?

Presumably you could still link all the host object files together,
and, if you have a linker for your device target, you could also link
those.

So, if we are to use a wrapper around the driver you would be able to pack the outputs in whatever format. What about the inputs? We would need to add options to enable passing multiple inputs for the same compilation, right?

I see inputs as a completely different question from outputs.

In CUDA, a single input file contains both host and device code. I
presume the same is for OpenMP? If for some reason you need to pass
in multiple input files to a single compilation (setting aside the
question of whether or not this is a good requirement to have -- it
seems like a big departure from how C++ compilation normally works),
you can just pass multiple inputs to clang. Certainly we shouldn't
expect users to bundle up multiple input files using some external
tool just to pass them to the driver?

Maybe I'm missing something here again, sorry.

Samuel_Antao · March 5, 2016, 1:06am

:

> In your opinion, if we add support for a new programming model, say
OpenMP, should we try to convert CudaAction in something more generic (say
DeviceAction) or adding actions for each programming model is the way to go?

Oh, I don't have a strong opinion. It may or may not make sense to
combine them, depending on whether OpenMP needs different arguments to
its actions than CUDA needs.

> How would generate two separate files help ld?

Presumably you could still link all the host object files together,
and, if you have a linker for your device target, you could also link
those.

Ok, you could link each component, but then you couldn't do anything,
because the device side only works if you have that specific host code,
allocating the data and invoking the kernel. Unless you compile CUDA code
and then disregard the host code use only the device code with some
unrelated host object, which seems a rather twisted use case.

> So, if we are to use a wrapper around the driver you would be able to
pack the outputs in whatever format. What about the inputs? We would need
to add options to enable passing multiple inputs for the same compilation,
right?

I see inputs as a completely different question from outputs.

In CUDA, a single input file contains both host and device code. I
presume the same is for OpenMP? If for some reason you need to pass
in multiple input files to a single compilation (setting aside the
question of whether or not this is a good requirement to have -- it
seems like a big departure from how C++ compilation normally works),
you can just pass multiple inputs to clang. Certainly we shouldn't
expect users to bundle up multiple input files using some external
tool just to pass them to the driver?

Maybe I'm missing something here again, sorry.

Yes, for OpenMP is the same. The problem is not when the input is source,
but when we do separate compilation. I know the current CUDA implementation
in clang doesn't support it, but let's assume I would like to make
something on top of the current implementation to make it work.

I have a.cu and b.cu and b.cu, both with a CUDA kernel. Now, b.cu has a
device function that is also used in a.cu.

If I use NVCC I could do:

but I also could do:

B: nvcc a.cu -rdc=true -c
B: nvcc b.cu -rdc=true -c
B: nvcc -rdc=true a.o b.o -o a.out

(nvcc incorporates device code in the *.o, then at link time it extracts
it, link it, and embeds the result on the host)

Wouldn't be desirable to have clang supporting case B as well? I don't have
statistics, but I suspect that most of the applications use B, I think it
is not common to have users to pass all the source files at once to the
compiler. Maybe in CUDA you find several A's (kernels are explicitly
outlined so users cared about organizing the code differently), but for
OpenMP, B's is going to be the majority.

Thanks,
Samuel

Justin_Lebar · March 5, 2016, 5:18pm

Ok, you could link each component, but then you couldn't do anything, because the device side only works if you have that specific host code, allocating the data and invoking the kernel.

Sure, you'd have to do something after linking to group everything
together. Like, more reasonably, you could link together all the
device object files, and then link together the host object files plus
the one device blob using a tool which understands this blob.

Or you could just pass all of the object files to a tool which
understands the difference between host and device object files and
will DTRT.

B: nvcc -rdc=true a.o b.o -o a.out
Wouldn't be desirable to have clang supporting case B as well?

Sure, yes. It's maybe worth elaborating on how we support case A
today. We compile the .cu file for device once for each device
architecture, generating N .s files and N corresponding .o files.
(The .s files are assembled by a black-box tool from nvidia.) We then
feed both the .s and .o files to another tool from nvidia, which makes
one "fat binary". We finally incorporate the fatbin into the host
object file while compiling.

Which sounds a lot like what I was telling you I didn't want to do, I
know. But the reason I think it's different is that there exists
a widely-adopted one-object-file format for cuda/nvptx. So if you do
the above in the right way, which we do, all of nvidia's binary tools
(objdump, etc) just work. Moreover, there are no real alternative
tools to break by this scheme -- the ISA is proprietary, and nobody
has bothered to write such a tool, to my knowledge. If they did, I
suspect they'd make it compatible with nvidia's (and thus our) format.

Since we already have this format and it's well-supported by tools
etc, we'd probably want to support in clang unbundling the CUDA code
at linktime, just like nvcc.

Anyway, back to your question, where we're dealing with an ISA which
does not have a well-established bundling format. In this case, I
don't think it would be unreasonable to support

clang a-host.o a-device.o b-host.o b-device.o -o a.out

clang could presumably figure out the architecture of each file either
from its name, from some sort of -x params, or by inspecting the file
-- all three would have good precedent.

The only issue is whether or not this should instead look like

clang a.tar b.tar -o a.out

The functionality is exactly the same.

If we use tar or invent a new format, we don't necessarily have to
change build systems. But we've either opened a new can of worms by
adding a rather more expressive than we want file format into clang
(tar is the obvious choice, but it's not a great fit; no random
access, no custom metadata, lots of edge cases to handle as errors,
etc), or we've made up a new file format with all the problems we've
discussed.

-Justin

Finkel_Hal_J · March 7, 2016, 4:56am

From: "Justin Lebar via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Samuel F Antao" <sfantao@us.ibm.com>
Cc: "Alexey Bataev" <a.bataev@hotmail.com>, "C Bergström via cfe-dev" <cfe-dev@lists.llvm.org>, "John McCall"
<rjmccall@gmail.com>
Sent: Saturday, March 5, 2016 11:18:54 AM
Subject: Re: [cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

> Ok, you could link each component, but then you couldn't do
> anything, because the device side only works if you have that
> specific host code, allocating the data and invoking the kernel.

Sure, you'd have to do something after linking to group everything
together. Like, more reasonably, you could link together all the
device object files, and then link together the host object files
plus
the one device blob using a tool which understands this blob.

Or you could just pass all of the object files to a tool which
understands the difference between host and device object files and
will DTRT.

> B: nvcc -rdc=true a.o b.o -o a.out
> Wouldn't be desirable to have clang supporting case B as well?

Sure, yes. It's maybe worth elaborating on how we support case A
today. We compile the .cu file for device once for each device
architecture, generating N .s files and N corresponding .o files.
(The .s files are assembled by a black-box tool from nvidia.) We
then
feed both the .s and .o files to another tool from nvidia, which
makes
one "fat binary". We finally incorporate the fatbin into the host
object file while compiling.

Which sounds a lot like what I was telling you I didn't want to do, I
know. But the reason I think it's different is that there exists
a widely-adopted one-object-file format for cuda/nvptx. So if you do
the above in the right way, which we do, all of nvidia's binary tools
(objdump, etc) just work. Moreover, there are no real alternative
tools to break by this scheme -- the ISA is proprietary, and nobody
has bothered to write such a tool, to my knowledge. If they did, I
suspect they'd make it compatible with nvidia's (and thus our)
format.

Since we already have this format and it's well-supported by tools
etc, we'd probably want to support in clang unbundling the CUDA code
at linktime, just like nvcc.

Anyway, back to your question, where we're dealing with an ISA which
does not have a well-established bundling format. In this case, I
don't think it would be unreasonable to support

clang a-host.o a-device.o b-host.o b-device.o -o a.out

clang could presumably figure out the architecture of each file
either
from its name, from some sort of -x params, or by inspecting the file
-- all three would have good precedent.

The only issue is whether or not this should instead look like

clang a.tar b.tar -o a.out

The functionality is exactly the same.

If we use tar or invent a new format, we don't necessarily have to
change build systems. But we've either opened a new can of worms by
adding a rather more expressive than we want file format into clang
(tar is the obvious choice, but it's not a great fit; no random
access, no custom metadata, lots of edge cases to handle as errors,
etc), or we've made up a new file format with all the problems we've
discussed.

Many of the projects that will use this feature are very large, with highly non-trivial build systems. Requiring significant changes (beyond the normal changes to compiler paths and flags) in order to use OpenMP (including with accelerator support) should be avoided where ever possible. This is much more important than ensuring tools compatibility with other compilers (although accomplishing both goals simultaneously seems even better still). Based on my reading of this thread, it seems like we have several options (these come to mind):

1. Use multiple object files. Many build-system changes can be avoided by using some scheme for guessing the name of the device object files from that of the host object file. This often won't work, however, because many build systems copy object files around, add them to static archives, etc. and the device object files would be missed in this operations.

2. Use some kind of bundling format. tar, zip, ar, etc. seem like workable options. Any user who runs 'file' on them will easily guess how to extract the data. objdump, etc., however, won't know how to handle these directly (which can also have build-system implications, although more rare than for (1)).

3. Treat the input/output object file name as a directory, and store in that directory the host and device object files. This might be effectively transparent, but also suffers from potential build-system problems (rm -f won't work, for example).

4. Store the device code in special sections of the host object file. This seems the most build-system friendly, although perhaps the most complicated to implementation on our end. Also, as has been pointed out, is also the technique nvcc uses.

All things considered, I think that I'd prefer (4). If we're picking an option to minimize build-system changes, which I fully support, picking the option with the smallest chance of incompatibilities seems optimal. There is also other (prior) art here, and we should find out how GCC is handling this in GCC 6 for OpenACC and/or OpenMP 4 (OpenACC - GCC Wiki). Also, we can check on PGI and/or Pathscale (for OpenACC, OpenHMPP, etc.), in addition to any relevant details of what nvcc does here.

Thanks again,
Hal

_C_Bergstrom · March 7, 2016, 5:29am

Phone reply so some formatting may get messed up.

Option 4.b
Internally we (pathscale) use unified objects with symbol name mangling for devise sections. The only complication to this is the assembler/runtime needs to know what to do when hitting these sections. I personally really don't like stuffing all the device code in a data section. It seems very hacky but I'd agree it's way more friendly than multiple objects. In my perfect world the unoptimized, optimized and even offload version is just name mangling. (something not too far away with how glibc handles optimized versions of a function. )

The way llvm would handle sse4 symbol vs avx512 symbol overlaps with this quite a bit.

Justin_Lebar · March 7, 2016, 8:53pm

Requiring significant changes (beyond the normal changes to compiler paths and flags) in order to use OpenMP (including with accelerator support) should be avoided where ever possible.

One of the reasons I'm not convinced we should rule out creating
multiple object files is that if modifying your build system to
support this is hard, it's trivial to create a wrapper script to tar
and untar your object files:

clang-wrapper -c foo.cpp -fflags -o foo.tar
# Creates foo.tar containing foo.o, foo-sm_35.o, foo-compute_35.s.
clang-wrapper -link foo.tar bar.tar
# Untars foo.tar, bar.tar, and runs
# clang foo.o foo-sm_35.o foo-compute_35.s ...

We're talking ~100 lines of Python here, which would represent a tiny
amount of complexity atop an already highly complex build system.

If for some reason using tar isn't an option, one could write a
wrapper which basically makes a tar out of the object file, shoving
all of the non-host code into special sections into the object file,
as you've suggested. This shouldn't be substantially more complex
than creating a tar, and I think we agree that this would be very
unlikely to cause problems with a build system.

I'm not arguing here that such a wrapper is desirable, just that it's
possible and not particularly complex. This, I think, expands the
universe of possibilities available for our consideration on the
compiler side. I'd also like to have something which requires minimal
build system changes and is compatible with existing tools, even if my
priorities are inverted from yours.

(FWIW I think the main arguments against such a wrapper are probably
its performance impact, and perhaps that if this is something everyone
is going to use, we should just build it in by default.)

I agree that the next step should be to look at prior art. It seems
to me that we don't need to solve the general problem of multiarch
compilation here -- we just need a solution for the architectures we
care about now and in the near future. We already have an NVPTX
solution that I think is acceptable to everyone? So what other
architectures do we need to look at, and what do existing compilers
do?

Internally we (pathscale) use unified objects with symbol name mangling for devise sections.

One of the most common complaints I hear about ARM is that it can
switch between the full ARM ISA and Thumb via a runtime switch. As a
result, objdump has a very difficult time figuring out how to
disassemble code that uses both ARM and Thumb.

This sounds like a path towards that Dark Side. Not quite as bad, and
maybe not as bad as stuffing everything in a data section, but still.

The way llvm would handle sse4 symbol vs avx512 symbol overlaps with this quite a bit.

Eh, sse4 and avx512 instructions are unambiguous, so that seems
totally sensible.

-Justin

Samuel_Antao · March 7, 2016, 11:26pm

Hi Justin,

Justin_Lebar · March 7, 2016, 11:52pm

The problem here is that OpenMP enables having devices of different kinds, so, along with GPU code we may now have PHI code, or an AMD GPU.

That's a good point as far as why we'd want a general solution.

Requiring that each OpenMP (or CUDA, if separate compilation is used) user to have its wrapper seems a lot of wasted effort

We're not "requiring" anything, right? You also have the option of
fixing your build system. We're just saying, your build system is
fundamentally not the compiler's problem. But if you don't want to or
can't fix your build system, well, here's a hack.

Ultimately shoving data into object files using a made-up format is
also a hack. If we went with that approach, we'd be requiring anyone
who wants to use objdump to expend considerable effort in order to
unpack our hack. I am still very uncomfortable saying that, at the
compiler level, it's the right tradeoff to favor convenience for build
systems over usability of basic tools whose function is to handle the
output from compilers.

FatELF looks like someone's attempt to do this correctly, although it
appears it would not be appropriate to shove IR or assembly into one,
since, at least nominally, each entry must be an ELF file. There does
appear to be some support in binutils, though.
FatELF At least note that although FatELF
attempts to be the Simplest Thing That Could Possibly Work, there is
considerably more complexity necessary here than we're intimating for
our own ad-hoc solutions.

Letting the driver output a tarball also doesn't seem like a horrible
compromise to me, although there are efficiency questions, and a lot
of edge cases around what happens if you get a weird tarball as input.
At least asking someone to extract a tarball isn't too unreasonable.

Justin_Lebar · March 8, 2016, 12:08am

I should point out that there's an important issue of cross-compiler /
cross-linker compatibility here, as well.

If clang, icc, nvcc, gcc, and msvc all make up their own "fat object"
formats, we're going to have a heckuva time linking together files
from these different compilers. Given only Kund's description of what
Intel is currently doing, or given only my description of what nvcc
currently does, one could not create a compatible tool -- be that a
linker, objdump, or another compiler.

This is the danger of proprietary formats, which is why I've dug so
deeply into this particular hill.

_C_Bergstrom · March 8, 2016, 12:46am

#1 OMP allows multiple code generation, but doesn't *require* it. It
wouldn't be invalid code if you only generated for a single target at
a time - which imho isn't that unreasonable. Why?! It's unlikely that
a user is going to try to build a single binary that runs across
different supercomputers. It's not like what ANL is getting will be a
mix (today/near future) of PHI+GPU. It will be a PHI shop.. ORNL is a
GPU shop. The only burden is the user building the source twice (not
that hard and what they do today anyway)

#2 This proposed tarball hack/wrapper thingie is just yuck design
imho. I think there are better and more clean long term solutions

#3 re: "ARM ISA and Thumb via a runtime switch. As a result, objdump
has a very difficult time figuring out how to disassemble code that
uses both ARM and Thumb."

My proposed solution of prefixing/name mangling the symbol to include
"target" or optimization level solves this. It's almost exactly as
what glibc (I spent 15 minutes looking for this doc I've seen before,
but couldn't find it.. if really needed I'll go dig in the glibc
sources for examples - the "doc" I'm looking for could be on the
loader side though)

In the meantime there's also this

https://sourceware.org/glibc/wiki/libmvec
"For x86_64 vector functions names are created based on #2.6. Vector
Function Name Mangling from Vector ABI"

https://sourceware.org/glibc/wiki/libmvec?action=AttachFile&do=view&target=VectorABI.txt

Which explicitly handles this case and explicitly mentions OMP.
"Vector Function ABI provides ABI for vector functions generated by
compiler supporting SIMD constructs of OpenMP 4.0 [1]."

it may also be worthwhile looking at existing solutions more closely
https://gcc.gnu.org/wiki/FunctionSpecificOpt

"The target attribute is used to specify that a function is to be
compiled with different target options than specified on the command
line. This can be used for instance to have functions compiled with a
different ISA"

Finkel_Hal_J · March 8, 2016, 1:11am

From: "C Bergström via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Justin Lebar" <jlebar@google.com>
Cc: "Alexey Bataev" <a.bataev@hotmail.com>, "C Bergström via cfe-dev" <cfe-dev@lists.llvm.org>, "Samuel F Antao"
<sfantao@us.ibm.com>, "John McCall" <rjmccall@gmail.com>
Sent: Monday, March 7, 2016 6:46:57 PM
Subject: Re: [cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

#1 OMP allows multiple code generation, but doesn't *require* it. It
wouldn't be invalid code if you only generated for a single target at
a time - which imho isn't that unreasonable. Why?! It's unlikely that
a user is going to try to build a single binary that runs across
different supercomputers. It's not like what ANL is getting will be a
mix (today/near future) of PHI+GPU. It will be a PHI shop.. ORNL is a
GPU shop. The only burden is the user building the source twice (not
that hard and what they do today anyway)

I agree, but supercomputers are not the only relevant platforms. Lot's of people have GPUs, and OpenMP offloading can also be used on various kinds of heterogeneous systems. I see no reason to design, at the driver level, for only a single target device type unless doing more is a significant implementation burden.

-Hal

Justin_Lebar · March 8, 2016, 2:38am

The more I think about this, the more I'm convinced that having the
clang driver output a tar (or zip or whatever, we can bikeshed later)
makes a lot of sense, in terms of meeting all of our competing
requirements.

tar is trivially compatible with all existing tools. It's also
trivial to edit. Want to add or remove an object file from the
bundle? Go for it. Want to disassemble just one platform's code,
using a proprietary tool you don't control? No problem.

tar also preserves the single-file-out behavior of the compiler, so it
should be compatible with existing build systems with minimal changes.

There's also a very simple model for explaining how tars would work
when linking with the clang driver:

$ clang a.tar b.tar

is exactly equivalent to extracting a.tar and b.tar and then doing

$ clang a.file1 a.file2 b.file1 b.file2

tar has the nice property that it has an unambiguous ordering.

One final advantage of tar is that it's trivial to memory map portions
of the tarball, so extracting one arch's object file is a zero-cost
operation. And tars can be written incrementally, so if you don't do
your different architectures' compilations in parallel, each stage of
compilation could just write to the one tarball, letting you avoid a
copy.

If we instead bundle into a single object file, which seems to be the
other main proposal at the moment, we have to deal with a lot of
additional complexity:

* We'll have to come up with a scheme for all of the major object file
formats (ELF, MachO, whatever Windows uses)

* We'll basically be reinventing the wheel wrt a lot of the metadata
in an object file. For example, ELF specifies the target ISA in its
header. But our object files will contain multiple ISAs. Similarly,
we'll have to come up with a debug info scheme, and come up with a way
to point between different debuginfo sections, in potentially
different (potentially proprietary!) formats, at different code for
different archs. I'm no expert, but ELF really doesn't seem built for
this. (Thus the existence of the FatELF project.)

* Unless we choose to do nested object files (which seems horrible),
our multiarch object file is not going to contain within it N valid
object files -- the data for each arch is going to be spread out, or
there are going to be some headers which can only appear once, or
whatever. So you can't just mmap portions of our multiarch object
file to retrieve the bits you want, like you can with tar.

* We'll want everyone to agree to the specifics of this format.
That's a tall order no matter what we choose, but complexity will
stand in the way of getting something actually interoperable.

In addition, this scheme doesn't play nicely with existing tools -- we
will likely need to patch binutils rather extensively in order to get
it to play sanely with this format.

I don't think we need to force anyone to use tar. For the Intel phi,
putting everything into one object file probably makes sense, because
it's all the same ISA. For NVPTX, we may want to continue having the
option of compiling into something which looks like nvcc's format.
(Although an extracted tarball would be compatible with all of
nvidia's tools, afaik.) Maybe we'll want to support other existing
formats as well, for compatibility. And we can trivially have a flag
that tells clang not to tar its output, if you're allergic.

-Justin

_C_Bergstrom · March 8, 2016, 3:07am

This isn't bikeshed - it's either designed properly or some hack. You don't see other solutions doing this? Tar isn't supported on Windows and zip isn't a guarantee on linux (yes it should be easy to install but not a guaranteed default)

I can't understand why you're pushing so hard on a hack that nobody else is doing

Original Message

Kirkegaard_Knud_J · March 8, 2016, 3:37pm

Store the device code in special sections of the host object file. This seems the most build-system friendly, although perhaps the most complicated to implementation on our end. Also, as has been pointed out, is also the technique nvcc uses.

Making the device code stored in special (non-loadable) sections of the host object (Hal’s option 4) is what Intel’s compiler has implemented and would like to support for the following reasons:

We have single source files supporting offloading to devices. That should produce single fat objects that include the device objects as well.
Invocation of linker will result in a single dynamic library or executable with device binaries embedded.
It will make the use of device offload transparent to users, support separate compilations, and existing Makefiles.
It ensures the host and target object dependencies are easily maintained.

Makefiles create object files and may move them during build process. These Makefiles will have to be changed to support separate device objects. Naming conventions could also be an issue for separate device objects.

Static libraries can be made up of fat objects as well. When the driver invokes the target linker it knows to look for device objects in the static libraries as well.

Static libraries provided is still only one library even with device code embedded.

With support for fat executables/dynamic libraries it should be fairly straight forward to make fat objects as well.
Customers we have worked with have provided the feedback to generate fat objects for ease of use.
This does leave an open issue of how to assembly files and intermediate IT files are handled and naming conventions for these.

Thanks,

Knud

andreybokhanko · March 9, 2016, 1:59pm

All,

Topic		Replies	Views
[RFC][OpenMP][CUDA] Generic Offload File Bundler Tool Clang Frontend	4	103	February 25, 2016
OpenMP offload and CUDA in the same translation unit? OpenMP	5	326	May 23, 2023
offloading to Nvidia GPUs OpenMP	5	113	December 18, 2017
[RFC] OpenMP offload infrastructure (iteration 2) LLVM Dev List Archives	7	91	May 13, 2015
OpenMP GPU Target Offload in Clang OpenMP	3	164	August 21, 2018

[RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Related Topics