From: "Justin Lebar via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Samuel F Antao" <sfantao@us.ibm.com>
Cc: "Alexey Bataev" <a.bataev@hotmail.com>, "C Bergström via cfe-dev" <cfe-dev@lists.llvm.org>, "John McCall"
<rjmccall@gmail.com>
Sent: Saturday, March 5, 2016 11:18:54 AM
Subject: Re: [cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver
> Ok, you could link each component, but then you couldn't do
> anything, because the device side only works if you have that
> specific host code, allocating the data and invoking the kernel.
Sure, you'd have to do something after linking to group everything
together. Like, more reasonably, you could link together all the
device object files, and then link together the host object files
plus
the one device blob using a tool which understands this blob.
Or you could just pass all of the object files to a tool which
understands the difference between host and device object files and
will DTRT.
> B: nvcc -rdc=true a.o b.o -o a.out
> Wouldn't be desirable to have clang supporting case B as well?
Sure, yes. It's maybe worth elaborating on how we support case A
today. We compile the .cu file for device once for each device
architecture, generating N .s files and N corresponding .o files.
(The .s files are assembled by a black-box tool from nvidia.) We
then
feed both the .s and .o files to another tool from nvidia, which
makes
one "fat binary". We finally incorporate the fatbin into the host
object file while compiling.
Which sounds a lot like what I was telling you I didn't want to do, I
know.
But the reason I think it's different is that there exists
a widely-adopted one-object-file format for cuda/nvptx. So if you do
the above in the right way, which we do, all of nvidia's binary tools
(objdump, etc) just work. Moreover, there are no real alternative
tools to break by this scheme -- the ISA is proprietary, and nobody
has bothered to write such a tool, to my knowledge. If they did, I
suspect they'd make it compatible with nvidia's (and thus our)
format.
Since we already have this format and it's well-supported by tools
etc, we'd probably want to support in clang unbundling the CUDA code
at linktime, just like nvcc.
Anyway, back to your question, where we're dealing with an ISA which
does not have a well-established bundling format. In this case, I
don't think it would be unreasonable to support
clang a-host.o a-device.o b-host.o b-device.o -o a.out
clang could presumably figure out the architecture of each file
either
from its name, from some sort of -x params, or by inspecting the file
-- all three would have good precedent.
The only issue is whether or not this should instead look like
clang a.tar b.tar -o a.out
The functionality is exactly the same.
If we use tar or invent a new format, we don't necessarily have to
change build systems. But we've either opened a new can of worms by
adding a rather more expressive than we want file format into clang
(tar is the obvious choice, but it's not a great fit; no random
access, no custom metadata, lots of edge cases to handle as errors,
etc), or we've made up a new file format with all the problems we've
discussed.
Many of the projects that will use this feature are very large, with highly non-trivial build systems. Requiring significant changes (beyond the normal changes to compiler paths and flags) in order to use OpenMP (including with accelerator support) should be avoided where ever possible. This is much more important than ensuring tools compatibility with other compilers (although accomplishing both goals simultaneously seems even better still). Based on my reading of this thread, it seems like we have several options (these come to mind):
1. Use multiple object files. Many build-system changes can be avoided by using some scheme for guessing the name of the device object files from that of the host object file. This often won't work, however, because many build systems copy object files around, add them to static archives, etc. and the device object files would be missed in this operations.
2. Use some kind of bundling format. tar, zip, ar, etc. seem like workable options. Any user who runs 'file' on them will easily guess how to extract the data. objdump, etc., however, won't know how to handle these directly (which can also have build-system implications, although more rare than for (1)).
3. Treat the input/output object file name as a directory, and store in that directory the host and device object files. This might be effectively transparent, but also suffers from potential build-system problems (rm -f won't work, for example).
4. Store the device code in special sections of the host object file. This seems the most build-system friendly, although perhaps the most complicated to implementation on our end. Also, as has been pointed out, is also the technique nvcc uses.
All things considered, I think that I'd prefer (4). If we're picking an option to minimize build-system changes, which I fully support, picking the option with the smallest chance of incompatibilities seems optimal. There is also other (prior) art here, and we should find out how GCC is handling this in GCC 6 for OpenACC and/or OpenMP 4 (OpenACC - GCC Wiki). Also, we can check on PGI and/or Pathscale (for OpenACC, OpenHMPP, etc.), in addition to any relevant details of what nvcc does here.
Thanks again,
Hal