Is this a “clang distribution” question? I’m not sure why it affects us here and it isn’t just a matter of packaging (for which we don’t need it in-tree, it can just be handle by the script which builds the release packages)
It does not have much to do with clang, which already needs to know where to find CUDA SDK, and needs more than just libdevice from there.
It’s relevant to JITs that do not have the luxury of clang driver and do need libdevice to implement some standard math functions. Right now we have to find libdevice.so somewhere on the end user system, and it’s fragile.
With redistributable libdevice we have more options. We can embed it into the binary and always carry it with us, avoiding a runtime dependency. Or we can install it in a known location, if such thing exists, which may not always be the case in case LLVM is linked into some other app. Each option has pluses and minuses.
I’d consider embedding to be a better option as it makes things more hermetic (we always know that the libdevice we end up using is the libdevice we’ve tested) and removes the hassle associated with shipping and finding a runtime dependency.
IANAL, but with that said, note that the Linux kernel has a side repository where it preserves firmware: About Firmware
I don’t know how to resolve whether such a thing should go in tree or not from a licensing and practicality perspective, but it sure would be convenient if it were somewhere that could be generally shared for builds/releases/testing.
For what it’s worth, one place to obtain libdevice is from the cuda-nvcc packages distributed by NVIDIA: Index of /compute/cuda/repos/ubuntu2004/x86_64
I understand this as well, and:
I fully agree with this: you likely want to embed libdevice within your binary and make it available to the JIT. It’s not clear to me why it should be in-tree for LLVM though?
That is, I see two parts right now:
-
a generic way of “embedding bitcode libraries as blob and making them available to the ExecutionEngine”, that applies regardless of the provenance of the bitcode library (users can write their own runtime similar to lib device).
I see this as pure infrastructure component we need to setup, and it provides to users facilities that MLIR (or rather LLVM?) could/should offer with the ExecutionEngine packaging. -
If we need libdevice in our upstream testing (for example in the end-to-end Sparce codegen on GPU), then we need an option to embed libdevice when we build MLIR itself (MLIR becomes a client itself of the the infra mentioned in (1) above). At this point this becomes more of a MLIR CMake question of plumbing it through correctly when we build MLIR: it is just a build dependency like others.
We may decide to vendor it in the repo but it’s not clear to me why this file in particular is harder to get at build time that all the other thing we need to get from Cuda when we configure the JIT today?
(also the infra in (1) above allows downstream to vendor it if they want to in the context of their build-system, that’s orthogonal to the solution we pick upstream for (2) here: vendoring or not)
That shouldn’t bring “runtime dependencies” here as far as I can tell.
As an example of how IREE does it (not rocket science, just being explicit): iree/CMakeLists.txt at 4a9e22f75e14f3f0b6b0918a58843dfe8b0488c6 · openxla/iree · GitHub
If that property is explicitly provided, then it is used by embedding as a constant into a .o file that the compiler depends on. If not, then the CMake machinery downloads the appropriate CUDA SDK component (by using the CUDA redist repository via a fork of the parse_redist.py sample) and defaults the property to the contained libdevice bitcode file. An alternative mechanism is used on presubmit bots where we embed the SDK archive in the docker image since it can be better cached.
I can speak with some experience that, while not “hard”, all of the build plumbing is a maintenance burden for something that most users want a good default for. Just having it in-repo in some way simplifies a lot of things. For a community project that already suffers from project maintenance overhead, I’d take simple defaults over build shenanigans if possible.
Having libdevice at the “compiler” layer instead of just at the “JIT” layer increases some testing coverage and enabled-by-default code paths.
Again, not an expert here but certainly have battle scars.
+1 to the maintenance cost arguments highlighted by Stella. The burden is not large, but it’s not zero, either.
I listed it as one of the options. I’m not saying that it’s the only one. I should’ve made it more clear that by “making it build-time dependency”, as an alternative, I meant “get it outside of LLVM tree and make it available during the build”. I.e. it could be either fetched by cmake during configuration phase, or manually supplied by the user and found by cmake.
What are the downsides of incorporating libdevice into LLVM tree?
Potential legal issues would be one. If we decide to go that way we’ll need someone qualified to approve it, but for now I’m looking for something more tangible.
I am mostly looking at it from a « i don’t want something hard coded or specialized for one particular device runtime » ; having it in-tree doesn’t not necessarily imply that but that’s does not seem like giving me confidence of a good decoupled direction.
To some extent having a file in-tree likely means an extra build flow compared to a generic solution to handle this « special case », and I object to special casing without good reasons.
I think we are moving towards a generic solution. Clang’s new offloading driver now handles GPU-side linking transparently for the end user and that paves the way for having the actual GPU-side standard library. Eventually we’ll be in position to build GPU-side libm and would no longer need libdevice, at least on the clang side.
For LLVM & offloading the situation is murkier, as there’s currently nothing that would fulfill libcalls it may need to generate. bitcode linking with libdevice is the known working option we have now. A generic solution would require availability of GPU-side libm and ability to link GPU object files from within JIT. The former is not available yet, the latter is possible, but still relies on NVIDIA binaries – either their pre-compiled linker library or on using the NVIDIA driver API to do so.
So, for the time being we’re stuck with libdevice. Making it “just work” for everyone, out of the box, and removing the necessity for every user of LLVM targeting NVPTX to reinvent the wheel of finding and dealing with libdevice sounds like a good enough reason for carrying a copy of libdevice in the tree. We already know that we do need it for important parts of functionality of the NVPTX back-end.
My general concern is compatibility. What if it changes, or already changed. Right now, users get what they put into their “path”. I guess we could provide more than one version; do we want to?
I am not sure we’re talking exactly about the same thing here: I was specifically addressing the mechanism of “we need to have a bitcode device runtime library embedded with our compiler that can be automatically linked in the code we generate”.
That can be libdevice or another, and I differentiate the many pieces of infra needed for this and I can imagine quite a few layer. Having it in-tree or not seems like a minor detail of the stack: but starting from “having it in tree is simpler” looks like huge red flag to me in terms of where this is all going.
You’re mixing two things again here: 1) embedding libdevice in the binary we ship and 2) having it in-tree.
(even 1) is already overly specialized to me, as explained above: this shouldn’t be about libdevice when we build infrastructure here).
I’m not sure we’ll converge here, probably worth a doc and maybe a meeting?
What I’m trying to say that the logistics of obtaining libdevice which we want to embed looks like a good reason to keep libdevice in-tree. Yes, the other way to say it “it’s simpler”.
As you said, the details are not that complicated, and they’ve already been rehashed here. I don’t think there’s much to add to what’s already been stated.
IMO at any given LLVM revision, there should be only one libdevice version, which would be tested to LLVM’s satisfaction. Right now LLVM has neither libdevice, nor the tests for it, so we assume that whichever version we find at runtime will be good enough. Fortunately libdevice has been pretty stable over the years, so it happens to work in practice.
Honestly having a single version in-tree would be more stable than what we do now. We could ensure some relative bitcode compatibility rather than hoping the one we pick up from the user works. I’m not sure if there’s any prior art in LLVM for including binary blobs however. This would also raise the question if we’d give the same treatment to the ROCm device library as it is functionally identical in terms of mashing a magic bitcode file into every clang TU. But that implementation is “open source” so we could theoretically just tell users to build it somehow. I don’t believe AMD is interested in pushing that library upstream in the same way we have libclc.
Here’s a brief recap of the conversation for discussion.
End goal:
A single robust pipeline for
gpucode generation without the current shortcomings of the current pipeline. Where the heavy lifting of device code compilation is performed mostly by LLVM infra.
This pipeline will be slowly built across many patches and discussions, as it involves moving certain bits from clang to llvm as well as creating some components.
Concrete proposed changes:
- The introduction of target attributes to
gpu.module, this attribute will hold device target information about the module, such as if it’snvvmorrocdl, as well as target triple, features and arch. This could eventually lead to the removal of--convert-gpu-to-(nvvm|rocdl)in favor of a singlegpu-to-llvm. The format for such attribute might look like:
gpu.module @foo [nvvm.target<chip = "sm_70">] {
...
}
gpu.launch_opwill not longer be lowered bygpu-to-llvm, but by a different pass. Allowing a more flexible handling of this op, as there are many ways to launch a kernel (cudaLaunchKernel,cudaLaunchCooperativeKernel, etc), and not a 1 to 1 mapping between this op and LLVM.- The introduction of
--gpu-embed-kernel, this pass will have to be executed aftergpu-to-llvmand will serialize thegpu.moduleto an LLVM module. Why a separate pass? To allow running passes over the full LLVM MLIR IR, ie:
builtin.module {
gpu.module ... {
llvm.func @device_foo ...
}
llvm.func @host_foo ...
}
- Migrate the current serialization pipelines into this
gpucode gen structure, while addressing some of the shortcomings of the current serialization passes like the lack of general device bitcode linking in trunk. Allowing downstream users to uselibdevicewithout having to patch the tree to obtain this functionality. - Once the work on the LLVM infra side is ready, migrate all
gpuMLIR code compilation into this pipeline.
No JIT or AOT functionality will be lost at any point, we’ll only gain features. Upon agreement the first 4 items, could be rolled in the coming weeks.
Things outside this proposal that are also open for discussion:
- Migrating from the cuda driver API to the cuda runtime API.
Explanation patch for review:
One of the flexibilities I see with the serialize-to-cubin is that you can set the number of registers that the CUDA JIT uses (max-reg-per-thread) on a per GPU kernel (gpu.func) basis. This is essential to even be able to correctly compile (as an example, you can do min(255, 64K/number of threads in a thread block) – this is guaranteed to ensure that the number of registers a thread uses will not lead to the 64K limit (for each thread block) being exceeded and the GPU launch failing at runtime. One can even propagate info via attributes that get looked at to provide the desired registers limit to CUDA JIT.
If we move the final translation to GPU binary out of the MLIR-side pipeline (make use of clang/llvm), can such customization be accomplished? (Or is it already supported?) Wouldn’t one have to hack clang to implement the necessary support? To be clear, I’m referring to PTX to cubin compilation options (on a per-GPU kernel basis).
That said, the serialize-to-cubin pass needs a patch to support what I mentioned in para 1. Otherwise, for well-optimized kernels with heavy register reuse, it’s possible you’d run out of registers with the current MLIR infra.
I’ll preface the answer to your question with a brief explanation on the latest set of patches:
In short, the patches in D154153 introduce a series of mechanisms for extending MLIR GPU compilation in a somewhat general way without having to patch the tree.
These mechanisms are: GPU target attributes, GPU object manager attributes.
-
GPU target attributes describe how a GPU Module can be converted into a serialized string. These attributes can be implemented by any dialect, the dialect just needs to add an attribute implementing the target attribute interface in D154113.
An example of these is the NVPTX target attribute -we can add a max reg count option. It serializes GPU modules in the same way that theserialize-to-cubinpass does (this behavior may change in the future, and that’s a future conversation). -
GPU object manager attributes describe how to embed and launch GPU binaries in LLVM IR. As is the case with target attributes these are open to be implemented by any dialect, they just have to implement the object manager interface in D154108. These could even be used to implement fat binary support in MLIR.
Thus patching the tree is not longer necessary, as downstream users can implement their own schemes for GPU compilation if they want something different. For example in my work I already have a working offload target & object manager attribute, emitting clang/llvm annotations.
Answer
The clang/llvm route needs work as the infra still sits solely in clang and there are some other things left to be addressed, so a upstream version of this in MLIR is going to take some time, the above patches just open the room for this and more.
Having said that clang does support global compilation options, however as far as I know there’s no support for per module/kernel options -I could be wrong, but that is something addressable as there’s work to be done there before up-streaming. We could even decide to just use bits and implement other things in MLIR.
I think the current patch series ending in https://reviews.llvm.org/D154153 is converging right now and should be ready to land soon, does anyone has lingering concerns here?
The basis of the new mechanism is now in trunk. The idea is to migrate all GPU compilation to this mechanism and eventually deprecate & remove gpu-to-(cubin|hsaco). The ETA for complete removal is not yet determined, a notice will be added to Deprecations & Current Refactoring.
Documentation and a general overview can be found in gpu Dialect or in D154153.
The main idea behind this mechanism is extensibility, as attribute interfaces handle compilation, allowing any dialect to implement them. These interfaces are GPU Target Attributes and GPU Offloading LLVM Translation Attributes.
Target attributes handle serialization to a string representation, while Offloading Translation attributes handle the translation of the ops: gpu.binary & gpu.launch_func.
Together these attributes with the new gpu.binary Op can implement concepts like fat binaries and CUDA or HIP kernel launching mechanisms.
The compilation attributes available in trunk are:
#nvvm.targetfor compiling to cubin, PTX, or LLVM. Compiling to cubin requires a valid CUDAToolkit installation, as the mechanism invokesptxasor links againstnvptxcompiler. However, the mechanism is always present as long as the NVPTX was built. There are no hard CMake dependencies on the toolkit.#rocdl.targetfor compiling to hsaco, ISA, or LLVM. Compiling to hsaco requires a valid ROCM installation.#gpu.select_objectfor embedding a single object in LLVM IR and laucnahing kernels as the current mechanism in GPU to LLVM do.
Currently, only compilation to cubin or hsaco generates valid executables. In a future patch, I’ll add runtime support for PTX, allowing execution and compilation without a CUDAToolkit.
Example:
gpu.module @mymodule [#nvvm.target<O = 3, chip = "sm_90">, #nvvm.target<O = 3, chip = "sm_70">] {
}
// mlir-opt --gpu-module-to-binary
gpu.binary @mymodule [
#gpu.object<#nvvm.target<O = 3, chip = "sm_90">, "sm_90 BINARY">,
#gpu.object<#nvvm.target<O = 3, chip = "sm_70">, "sm_70 BINARY">
];
// By default gpu.binary embeds the first object, for selecting the second object:
gpu.binary @mymodule <#gpu.select_object<1>> [
#gpu.object<#nvvm.target<O = 3, chip = "sm_90">, "sm_90 BINARY">,
#gpu.object<#nvvm.target<O = 3, chip = "sm_70">, "sm_70 BINARY">
];
// Or:
gpu.binary @mymodule <#gpu.select_object<#nvvm.target<O = 3, chip = "sm_70">>> [
#gpu.object<#nvvm.target<O = 3, chip = "sm_90">, "sm_90 BINARY">,
#gpu.object<#nvvm.target<O = 3, chip = "sm_70">, "sm_70 BINARY">
];
Compilation workflow:
mlir-opt example.mlir \
--pass-pipeline="builtin.module( \
nvvm-attach-target{chip=sm_90 O=3}, \ # Attach an NVVM target to a gpu.module op.
gpu.module(convert-gpu-to-nvvm), \ # Convert GPU to NVVM.
gpu-to-llvm, \ # Convert GPU to LLVM.
gpu-module-to-binary \ # Serialize GPU modules to binaries.
)" -o example-nvvm.mlir
mlir-translate example-nvvm.mlir \
--mlir-to-llvmir \ # Obtain the translated LLVM IR.
-o example.ll
If there are any lingering concerns, a bug, or ideas on improving it, you can post them here or on Discord; also, my DMs are open.
Shoutout to @mehdi_amini for all the feedback in the reviews, as well as @krzysz00 .
Hi Fabian, thank you for your hard work, I like it ![]()
We (at Google) are currently integrating your recent changes into our tree and there are two small issues that came up and I wanted to pick your brain how to best work around them.
-
CUDA toolkit path: we don’t have the CUDA toolkit installed in standard or even static locations. The
SerializeToCubinPassused JIT provided by the driver to compile from PTX to CUBIN. TheNVVMTargetuses either ptxas or the ptx-compiler lib, which both require the CUDA toolkit installed. The ptx-compiler approach works reasonably well though even in our setup because it is dynamically linked. -
Target architecture: the
SerializeToCubinPasssimply ignored the target chip and compiled for the architecture of the bound CUDA context. This is broken, but very handy for tests. The newtest-lower-to-nvvmpipeline on the other hand targets a predetermined architecture, which requires running the tests on a specific GPU. This is doable for us.
You mentioned above that compiling only to PTX would be an option. This would nicely solve both hiccups above. Are you still planning to implement that or do you have some thoughts how to best go about it? If this feature would be available in the foreseeable future (say, in a few weeks), it might be easiest to simply disable those tests internally for a bit.
Thank you for your advice!