MLIR GPU libdevice linking support

I’m trying to lower to some code that uses math functions (like sqrt, exponent) that are lowered eventually into libdevice operations like __nv_sqrt or __nv_exp. When invoking the GpuSerializeToCubinPass, I get errors like error: cuLinkComplete(linkState, &cubinData, &cubinSize) failed with error code device kernel image is invalid[error : Undefined reference to '__nv_sqrt' in 'legateMLIRKernel2_kernel'] because I don’t think the code here MLIR: lib/Dialect/GPU/Transforms/SerializeToCubin.cpp Source File is linking against libdevice. I was searching through the NVIDIA documentation and can’t find either how this is done. Has anyone run into this problem before?

I believe libDevice is an implementation using Nvidia NVVM, which at the moment isn’t targeted by MLIR.
NVVM is based on LLVM version 7, and won’t accept to link with later LLVM as far as I know.

This is not true. Clang always links LLVM IR with with libdevice during CUDA compilation, so libdevice.so shipped with pretty much all CUDA versions we care about is compatible.

Ah you’re right sorry: of course in this direction it works: that is recent LLVM can link in the old bytecode from libDevice. I was thinking about the other direction: using libNVVM to compile the code we generate…

So here it is more related to this work I guess [RFC] Extending MLIR GPU device codegen pipeline ?

Yes, you won’t be able to use __nv_sqrt or any math functions inside NVIDIA GPUs unless you link against libdevice.bc, which is not possible under the current serialization pipeline.

If you are on a hurry, need the functionality asap, use linux and clang 17, then you could use this patch:
⚙ D149559 [mlir][gpu] Adds a gpu serialization pipeline for offloading GPUDialect Ops to clang compatible annotations. , it will work as expected and solve this problem, I’ve used it plenty of times without having any issues, on NVIDIA’s V100 and A100. The only reason it’s not committed to trunk is because I realized there was a better approach for approaching a more general issue, abandoned the above patch, and didn’t have the time at the time to do the new approach, but now I’m free.

I’ll submit for review a new patch early next week taking care of this and other issues once and for all.

1 Like

@fabianmc thanks! I think I can wait for your patch – do you mind posting here when you make it? I have a really hacky workaround right now where I compiled my system’s libdevice.bc into ptx with llc and am loading that in the GpuSerializeToCubinPass with cuLinkAddFile.

This is completely inaccurate - please see here: NVPTX codegen for llvm.sin (and friends) - #16 by bondhugula

The pointer there is from @csigg and shows how it’s done with XLA. It’s not more than 20 lines of code to add to SerializeToBlob.cpp (before translating to PTX) to link to libdevice. I’m not saying it’s the best way to do it, but that it’s possible and easy.

@rohany The following will do it:

@@ -37,11 +43,69 @@ gpu::SerializeToBlobPass::SerializeToBlobPass(TypeID passID)
 gpu::SerializeToBlobPass::SerializeToBlobPass(const SerializeToBlobPass &other)
     : OperationPass<gpu::GPUModuleOp>(other) {}
 
+/// Link a bitcode file into `llvmModule`.
+//  This code has been adapted and reused from XLA:
+//  https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc;drc=262777e9f9304c7df6b694934af819c820954ef5;l=334.
+static LogicalResult linkBitcode(StringRef filename, llvm::Module &llvmModule) {
+  llvm::SMDiagnostic diagnosticErr;
+  std::unique_ptr<llvm::Module> bitcodeModule(
+      llvm::parseIRFile(llvm::StringRef(filename.data(), filename.size()),
+                        diagnosticErr, llvmModule.getContext()));
+  if (!bitcodeModule) {
+    llvm::errs() << "Error loading IR module: " << filename << '\n';
+    return failure();
+  }
+  if (!bitcodeModule)
+    return failure();
+
+  // Ignore the data layout of the module we're importing. This avoids a
+  // warning from the linker.
+  llvm::Linker linker(llvmModule);
+  bitcodeModule->setDataLayout(llvmModule.getDataLayout());
+  if (linker.linkInModule(
+          std::move(bitcodeModule), llvm::Linker::Flags::LinkOnlyNeeded,
+          [](llvm::Module &m, const llvm::StringSet<> &gvs) {
+            internalizeModule(m, [&gvs](const llvm::GlobalValue &gv) {
+              return !gv.hasName() || (gvs.count(gv.getName()) == 0);
+            });
+          })) {
+    llvm::errs() << "Error linking bitcode module from " << filename << '\n';
+    return failure();
+  }
+
+  return success();
+}
+
   llvmModule.setDataLayout(targetMachine.createDataLayout());
 
+  // Link in CUDA's libdevice bitcode file which has NVVM bitcode for common
+  // math primitives and bit-manipulation functions.
+  // TODO: Replace this hardcoded path with a cmake provided value.
+  // TODO: In the future, this should be removed in favor of any linking support
+  // that may be added to the LLVM NVPTX backend.
+  const std::string libdevicePath =
+      "/usr/local/cuda/nvvm/libdevice/libdevice.10.bc";
+  if (failed(linkBitcode(libdevicePath, llvmModule)))
+    return std::nullopt;
+

It’s not inaccurate, you need a patch to make it work, thus not supported.

Having said that your patch will do the trick in some cases, as it hardcodes the location of the bitcode library.

1 Like

In what cases will it not work and lead to successful compilation? The hardcoding simply exists because the location isn’t passed from the build setup. Otherwise, the driver that links has to know where the bitcode file is in the CUDA setup, and any mechanism would need to supply it (the larger thing integrating such a pass into a compiler).

Sure, but your wording suggests to readers that this isn’t possible to accomplish from the current setup (as opposed to “not yet supported”).

Thanks for your patch @bondhugula, I can confirm it works. I’m wondering if there is a way to extend it so that libdevice.bc does not read to be read from the filesystem repeatedly, but the APIs on the llvm::Linker that I can find all consume a unique_ptr<llvm::Module>, so the libdevice module needs to be either read in fresh each time, or copied on use, both of which sound expensive.

In terms of efficiency for linking something like libDevice, LLVM has the ability to lazy-load bytecode, selectively load functions and just import these into the current module (this is basically the way ThinLTO processing works)

1 Like

XLA/GPU only recently started using it ([XLA/GPU] Use lazy loading when linking LLVM bitcode library · openxla/xla@0213963 · GitHub) and we saw big improvements in compile time for the linking phase.

3 Likes