[RFC] Offloading design for SYCL offload kind and SPIR targets

This RFC is intended to discuss proposed changes to compilation flow for
offloading SYCL kernels specifically to SPIR-based targets. Most of the changes
will be made in the clang-linker-wrapper tool.

Introduction

Traditional device offloading models are completely encapsulated within the
compiler driver requiring the driver to perform all of the steps required for
generating the host and device compilation passes. The driver is also
responsible for initiating any of the link-time processing that occurs for each
device target.

An updated offloading model uses the new clang-linker-wrapper tool. Much of the
functionality that is performed during the link phase of the offloading
compilation is removed from the driver and moved to the clang-linker-wrapper
tool.

Below is a general representation of the overall offloading flow that is
performed during a full compilation from source to final executable. The
compiler driver is responsible for creating the multi-targeted object and the
clang-linker-wrapper tool is responsible for the general functionality that is
performed during the link. This compilation is capable of supporting multiple
device targets. Support for Intel-based device targets relies on SPIR-V code
generation. AMDGPU and NVPTX devices can be supported without relying on SPIR-V
code generation, as shown in the figure below.

offloadflow

Diagram 1: Overall compilation flow

Multi-targeted object generation for SYCL offload kinds using clang-offload-packager

clang-offload-packager plays a vital role during multi-targeted object generation. The
multi-targeted object in the proposed offloading model is generated during the host
compilation. The host compilation takes an additional argument which points to
the device binary which will be embedded in the final object. Generation will
be separated out to allow for potential parallelism during compilation of both
the host and target device binaries.

When dealing with multiple device binaries, an additional step is performed to
package the multiple device binaries before being added to the host object.
This additional step is performed with the clang-offload-packager taking image
inputs containing information relating to the target triple, architecture
setting and offloading kind.

The clang-offload-packager is run during 'multi-targeted object' generation regardless
of the number of device binaries being added to the conglomerate multi-targeted object.
The device binaries are contained in what is designated as an ā€˜Offload Binaryā€™.
These binaries can reside in a variety of binary formats including Bitcode
files, ELF objects, executables and shared objects, COFF objects, archives or
simply stored as an offload binary.

We should have the ability to package SPIR-V based device binaries in the
offload section of any given binary. These device binaries will be packaged as
normal with the packager and placed within the given section.

Example usage of the external clang-offload-packager call:

clang-offload-packager --image=file=<name>,triple=<triple>,kind=<kind>

In the proposed offloading model, the compiler driver is responsible for
creating the multi-targeted object.

LLVM IR is used to represent device code embedded in this multi-targeted
object. LLVM IR is known to change across releases of clang and this rate of
change poses a challenge to maintain code compatibility in static libraries and
object files which contain device code IR. We are considering an alternative
approach where static libraries and object files contain a device-specific IR
instead of LLVM IR. For Intel GPUs, we would use SPIR-V, which is more stable
than LLVM IR. We request comments on this alternate approach from the community.

clang-offload-packager will be used to embed device code into the host code.
Following changes will be added to the packager. A new offload kind (SYCL_OFK)
will be made available for SYCL offloads. In case we decide to represent device
code using SPIR-V IR, we should have the ability to package SPIR-V based device
binaries in the offload section of any given host binary. These device binaries
will be packaged as normal with the packager and placed within the given
section. New image kinds will be added to represent such binaries.

SYCL offload support in clang-linker-wrapper

The clang-linker-wrapper provides the interface to perform the needed link
steps when consuming multi-targeted binaries. The linker wrapper performs a majority of
the work involved during the link step during an offload compilation,
significantly reducing the amount of work that is occurring in the compiler
driver. From the compilation perspective, the linker wrapper replaces the
typical call to the host link. This allows for the responsibility of the
compiler driver to be nearly identical when performing a regular compilation
vs an offloading compilation.

From a high level, using the clang-linker-wrapper provides following benefits:

  • Moves all of the device linking responsibility out of the compiler driver
  • Allows for a more direct ability to perform linking for offloading without
    requiring the use of the driver, using more linker like calls
  • Provides additional flexibility with the ability to dynamically modify the
    toolchain execution.

Example usage of the external clang-linker-wrapper call:

clang-linker-wrapper <wrapper opts> -- <linker opts>

Following sub-sections cover the different compilation steps invoked inside the
clang-linker-wrapper. Changes needed to add SYCL compilation support is
showcased in each sub-section.

Device code extraction and linking

During the compilation step, the device binaries are embedded in a section of
the host binary. When performing the link, this section is extracted from the
object and mapped according to the device kind. The clang-linker-wrapper is
responsible for examining all of the input binaries, grabbing the embedded
device binaries and determining any additional device linking paths that need
to be taken.

A new device offload kind is made available for SYCL offloads. New device image
kinds will be added to represent SPIR-V code and AOTcompiled device code. All
input bitcode files will be linked together using the ThinLTO pass. In
addition, SYCL device library files will be provided as inputs by the driver
and will be linked with the input. A list of device libraries that need to
be linked in with user code is provided by the driver. The driver is also
responsible for letting the clang-linker-wrapper know the location of the
device libraries.

Option Expected Behavior
--sycl-device-libraries=<arg> A comma separated list of device libraries that are linked during the device link
--sycl-device-library-location=<arg> The location in which the device libraries reside

Table: Options to pass device libraries to the clang-linker-wrapper

Post-link and SPIR-V translation

After the device binaries are linked together, two additional steps are
performed to prepare the device binary for consumption by an offline
compilation tool for AOT or to be wrapped for JIT processing.

The sycl-post-link tool is used after the device link is performed, applying
any changes such as optimizations and code splitting before passing off to the
llvm-spirv tool, which translates the LLVM-IR to SPIR-V.

Option Expected Behavior
--sycl-post-link-options=<arg> Options that will control sycl-post-link step
--llvm-spirv-options=<arg> Options that will control llvm-spirv step

Table: Options to pass sycl-post-link and llvm-spirv options to the clang-linker-wrapper

Options that will be used by clang-linker-wrapper when invoking the sycl-post-link
tool are provided by the driver via the --sycl-post-link-options=<arg> option.
Options that will be used by clang-linker-wrapper when invoking the llvm-spirv
tool are provided by the driver via the --llvm-spirv-options=<arg> option.

Ahead of Time Compilation for SYCL offload

The updated offloading model will integrate the Ahead of Time (AOT) compilation
behaviors into the clang-linker-wrapper. The actions will typically take place
after the device link, post link, and LLVM-IR to SPIR-V translation steps.

Regardless of the AOT target, the flow is similar, only modifying the offline
compiler that is used to create the target device image. It is expected that
the offline compiler will also use unique command lines specific to the tool to
create the image.

To support the needed option passing triggered by use of the
-Xsycl-target-backend option and implied options based on the optional device
behaviors for AOT compilations for GPU new command line interfaces are needed
to pass along this information.

Target Triple Offline Tool Option for Additional Args
CPU spir64_x86_64 opencl-aot --cpu-tool-arg=<arg>
GPU spir64_gen ocloc --gen-tool-arg=<arg>
FPGA spir64_fpga aoc/opencl-aot --fpga-tool-arg=<arg>

Table: Ahead of Time Info

To complete the support needed for the various targets using the
clang-linker-wrapper as the main interface, a few additional options will be
needed to communicate from the driver to the tool. Further details of usage are
given below.

Option Name Purpose
--fpga-link-type=<arg> Tells the link step to perform ā€˜earlyā€™ or ā€˜imageā€™ processing to create archives for FPGA
--parallel-link-sycl=<arg> Provide the number of parallel jobs that will be used when processing split jobs

Table: Additional Options for clang-linker-wrapper

The clang-linker-wrapper provides an existing option named -wrapper-jobs
that may be useful for our usage.

spir64_gen support

Compilation behaviors involving AOT for GPU involve an additional call to
the OpenCL Offline compiler (OCLOC). This call occurs after the post-link
step performed by sycl-post-link and the SPIR-V translation step which is done
by llvm-spirv. Additional options passed by the user via the
-Xsycl-target-backend=spir64_gen <opts> command as well as the implied
options set via target options such as -fsycl-targets=intel_gpu_skl
will be processed by a new options to the wrapper, --gen-tool-arg=<arg>

To support multiple target specifications, for instance:
-fsycl-targets=intel_gpu_skl,intel_gpu_pvc, multiple --gen-tool-arg
options can be passed on the command line. Each instance will be considered
a separate OCLOC call passing along the <args> as options to the OCLOC call.
The compiler driver will be responsible for putting together the full option
list to be passed along.

-fsycl -fsycl-targets=spir64_gen,intel_gpu_skl
-Xsycl-target-backend=spir64_gen ā€œ-device pvc -options -extraopt_pvcā€
-Xsycl-target-backend=intel_gpu_skl ā€œ-options -extraopt_sklā€

Example: spir64_gen enabling options

ā€“gen-tool-arg=ā€œ-device pvc -options extraopt_pvcā€
ā€“gen-tool-arg=ā€œ-device skl -options -extraopt_sklā€

Example: clang-linker-wrapper options

Each OCLOC call will be represented as a separate device binary that is
individually wrapped and linked into the final executable.

Additionally, the syntax can be expanded to enable the ability to pass specific
options to a specific device GPU target for spir64_gen. The syntax will
resemble --gen-tool-arg=<arch> <arg>. This corresponds to the existing
option syntax of -fsycl-targets=intel_gpu_arch where arch can be a fixed
set of targets.

spir64_x86_64 support

Compilation behaviors involving AOT for CPU involve an additional call to
opencl-aot. This call occurs after the post-link step performed by
sycl-post-link and the SPIR-V translation step performed by llvm-spirv.
Additional options passed by the user via the
-Xsycl-target-backend=spir64_x86_64 <opts> command will be processed by a new
option to the wrapper, --cpu-tool-arg=<arg>

Wrapping of device images

Once the device binary is pulled out of the multi-targeted binary, the binary must be
wrapped and provided the needed entry points to be used during execution.
This is performed during the link phase and controlled by the
clang-linker-wrapper.

SYCL offload model currently uses specialized wrapping information to wrap
device images into host. It is expected that the wrap information that will be
generated in clang-linker-wrapper to be wrapped around the device binary will
match wrapping information that is used for SYCL.

Host link

The final host link is also performed by the linker wrapper. This link is built
upon the full link command line as constructed by the compiler driver, including
all libraries and the linked/wrapped device binaries to complete the compilation
process. We do not expect any changes in this step.

Why do we now have 3 separate RFCs for SYCL? if you all want to propose actually implementing SYCL, we should have it all in 1 place.

That said, this component doesnā€™t really seem to apply to the CFE and should be at the wider LLVM audience.

This was intentional so as to allow different stakeholders to participate in the discussions most relevant to them without getting lost in other conversations.

1 Like

In that case, they should be placed in an appropriate location. In the case of this one, it has nothing I can see that deals with the CFE, so this should be moved in some way.

The location seems appropriate to me; we donā€™t have enough traffic to warrant a category for ā€œmisc clang toolsā€, and things like the linker wrapper and offload bundling are reasonably closely related to code generation. Do you have a suggestion for a better place for this to live?

Ah, Hrm, I guess we DONā€™T have a generic space, I figured wherever LLD and the runtimes got discussed (since that is essentially what this is), but there is not really a generic space for either.

High level points:

  • I think we need SPIRV and SYCL support for the linker-wrapper, so thanks!
  • The linker-wrapper is (supposed to be) a stopgap until we can make the linker do all this.
  • I really dislike the ā€œ-<language>-<generic>ā€ flags. If possible, we should avoid those.
    • As an example -fsycl-targets=intel_gpu_arch is basically --offload-arch=intel_gpu_arch, though I am fine with accepting both. Related, you want to provide a tool/way to resolve --offload-arch=native, even if it just picks all potential targets, or the most generic one.
    • Other -f/Xsycl options should be removed, if possible, or aligned with other languages, IMHO. I mean, ā€œparallel-linkā€ is something has already an alternative or could just be ā€œnon-syclā€. Similarly, we have ways to link in device libs already, etc.
    • Long story short, clang driver options can be ā€œbrandedā€ for users sake, but lower-level tool options donā€™t need to be. Reuse is kind, and on the driver level we need to ensure we support the ā€œgenericā€ versions for SYCL as well.
  • Last point: @jhuber6 should read through this and provide input.

Thanks for the good overview of the offloading compilation pipeline. First off, I agree with Johannes that we should attempt to re-use as many existing flags as possible. I am not intimately familiar with SPIR-V nor SYCL so I donā€™t know exactly what problems may exist there. I would recommend adding some extra targets for --offload-arch= if such a thing is meaningful for targeting SYCL. However that uses ā€œCUDAā€ architectures so we may need to abstract that further.

An important thing to consider is how the clang-linker-wrapper invokes the device toolchain. We do not encode any of that logic directly in the tool and merely rely on clang to know how to perform the appropriate steps. So, for NVIDIA for example, after the LTO pass is run we will pass all the PTXAS files to clang like clang foo.s bar.s --target=nvptx64-nvidia-cuda -march=sm_89 which will then instruct clang to produce the necessary steps to emit a linked image. In this case that would be the ptxas and nvlink tools. I am proposing that Intel supports a target like this, not just for this but because it would also make it easier to port to GPU libc Iā€™m working on to SPIR-V as well.

Thereā€™s already an option like --device-linker-args which will pass -Wl,arg to the clang invocation above, which will be forwarded to the linker. This should suffice for any linker specific options that the SPIR-V toolchain requires on some condition.

We also have some handling for cuda_path in the linker wrapper (note that ROCm/HIP uses only lld for linking so we do not need to know the path) so Iā€™m assuming we would have a similar sycl-path option for locating the libraries.

Currently, the handling of device bitcode libraries is done per-TU using -mlink-builtin-bitcode. Thereā€™s a very hacky option in the linker wrapper called --builtin-bitcode= that will invoke that in the link step but it runs optimizations again so it might slow down compilation. In the future Iā€™m hoping to minimize the use of magic LLVM-IR libraries, potentially providing a generic architecture that can link with all input architectures. The main issue with treating existing device libraries as LTO libraries is usually non-hidden visibility preventing internalization.

Is there anything else that is not clear with the current approach?

Hi @jdoerfert

Thanks very much for your feedback. We will surely work to minimize adding new driver-level as well as tool-level options and also avoid ā€˜brandingā€™ of low-level tool options.

Sincerely

Hi Joseph Huber,

Thanks so much for your detailed feedback here. clang-linker-wrapper is a well-constructed tool and building on top of it has been a good experience thus far. Thanks for the design!
I agree with both you and Johannes about a more prudent use of existing flags. Multiple attempts are already underway to do so.
It is an interesting point about how you call clang inside clang-linker-wrapper to handle device tools to do device-specific linking. We also have dependency on external tool (llvm-spirv) and we might be able to do something similar.

The current approach seems well-explained. I had a couple of questions, but I will take them offline.

Thanks again
Sincerely

Thank you for posting this RFC! It sounds like the concerns raised are being handled and this is ready to proceed. However, please be sure to include @jdoerfert and @jhuber6 on any code reviews in this area unless they say they donā€™t feel they need to be involved.