Hi, I'm one of the people working on CUDA in clang.
In general I agree that the support for CUDA today is rather ad-hoc; it can
likely be improved. However, there are many points in this proposal that I do
not understand. Inasmuch as I think I understand it, I am concerned that it's
adding a new abstractions instead of fixing the existing ones, and that this
will result in a lot of additional complexity.
a) Create toolchains for host and offload devices before creating the actions.
The driver has to detect the employed programming models through the provided
options (e.g. -fcuda or -fopenmp) or file extensions. For each host and
offloading device and programming model, it should create a toolchain.
Seems sane to me.
b) Keep the generation of Actions independent of the program model.
In my view, the Actions should only depend on the compile phases requested by
the user and the file extensions of the input files. Only the way those
actions are interpreted to create jobs should be dependent on the programming
model. This would avoid complicating the actions creation with dependencies
that only make sense to some programming models, which would make the
implementation hard to scale when new programming models are to be adopted.
I don't quite understand what you're proposing here, or what you're trying to
accomplish with this change.
Perhaps it would help if you could give a concrete example of how this would
change e.g. CUDA or Mac universal binary compilation?
For example, in CUDA compilation, we have an action which says "compile
everything below here as cuda arch sm_35". sm_35 comes from a command-line
flag, so as I understand your proposal, this could not be in the action graph,
because it doesn't come from the filename or the compile phases requested by
the user. So, how will we express this notion that some actions should be
compiled for a particular arch?
c) Use unbundling and bundling tools agnostic of the programming model.
I propose a single change in the action creation and that is the creation of
a “unbundling” and "bundling” action whose goal is to prevent the user to
have to deal with multiple files generated from multiple toolchains (host
toolchain and offloading devices’ toolchains) if he uses separate compilation
in his build system.
I'm not sure I understand what "separate compilation" is here. Do you mean, a
compilation strategy which outputs logically separate machine code for each
architecture, only to have this code combined at link time? (In contrast to
how we currently compile CUDA, where the device code for a file is integrated
into the host code for that file at compile time?)
If that's right, then what I understand you're proposing here is that, instead
of outputting N different object files -- one for the host, and N-1 for all our
device architectures -- we'd just output one blob which clang would understand
how to handle.
For my part, I am highly wary of introducing a new file format into clang's
output. Historically, clang (along with other compilers) does not output
proprietary blobs. Instead, we output object files in well-understood,
interoperable formats, such as ELF. This is beneficial because there are lots
of existing tools which can handle these files. It also allows e.g. code
compiled with clang to be linked with g++.
Build tools are universally awful, and I sympathize with the urge not to change
them. But I don't think this is a business we want the compiler to be in.
Instead, if a user wants this kind of "fat object file", they could obtain one
by using a simple wrapper around clang. If this wrapper's output format became
widely-used, we could then consider supporting it directly within clang, but
that's a proposition for many years in the future.
d) Allow the target toolchain to request the host toolchain to be used for a given action.
Seems sane to me.
e) Use a job results cache to enable sharing results between device and host toolchains.
I don't understand why we need a cache for job results. Why can we not set up
the Action graph such that each node has the correct inputs? (You've actually
sketched exactly what I think the Action graph should look like, for CUDA and
OpenMP compilations.)
f) Intercept the jobs creation before the emission of the command.
In my view this is the only change required in the driver (apart from the
obvious toolchain changes) that would be dependent on the programming model.
A job result post-processing function could check that there are offloading
toolchains to be used and spawn the jobs creation for those toolchains as
well as append results from one toolchain to the results of some other
accordingly to the programming model implementation needs.
Again it's not clear to me why we cannot and should not represent this in the
Action graph. It's that graph that's supposed to tell us what we're going to
do.
g) Reflect the offloading programming model in the naming of the save-temps files.
We already do this somewhat; e.g. for CUDA with save-temps, we'll output foo.s
and foo-sm_35.s. Extending this to be more robust (e.g. including the triple)
seems fine.
h) Use special options -target-offload=<triple> to specify offloading targets and delimit options meant for a toolchain.
I think I agree that we should generalize the flags we're using.
I'm not sold on the name or structure (I'm not aware of any other flags that
affect *all* flags following them?), but we can bikeshed about that separately.
i) Use the offload kinds in the toolchain to drive the commands generation by Tools.
I'm not sure exactly what this means, but it doesn't sound
particularly contentious.
3. We are willing to help with implementation of CUDA-specific parts when
they overlap with the common infrastructure; though we expect that effort to
be driven also by other contributors specifically interested in CUDA support
that have the necessary know-how (both on CUDA itself and how it is supported
in Clang / LLVM).
Given that this is work that doesn't really help CUDA (the driver works fine
for us as-is), I am not sure we'll be able to devote significant resources to
this project. Of course we'll be available to assist with code relevant
reviews and give advice.
I think like any other change to clang, the responsibility will rest on the
authors not to break existing functionality, at the very least inasmuch as is
checked by existing unit tests.
Regards,
-Justin