With the NVPTX backend I am seeing increased times for code generation on a large number of our functions.
Although not entirely certain a criterion for a function to compile slowly or quickly seem to be whether the function make use of the ‘switch’ instruction or not.
To be more precise: The codegen executes on the compute nodes of the Summit computer system at Oak Ridge National Lab (IBM Power9 CPU).
Our program generates ~300 GPU kernels during runtime, where the codegen for about 100 of them are pleasantly fast, I’m seeing a few milli-seconds.
But the codegen for ~200 kernels takes significantly longer. Each of those kernels take more than a second (!) to compile to PTX.
So, I’m seeing two orders of magnitude increased compilation time for those kernels (likely because they use the ‘switch’ instructions.)
The accumulated time for code generation exceeds 200 seconds for these ~200 functions.
This has made our runs a lot less efficient, to the point where it’s not practical anymore.
Please find attached an example (; ModuleID = '<stdin>'source_filename = "module"target datalayout = "e-i64:6 - Pastebin.com) of one of the functions/modules that takes long to compile.
It contains the ‘switch’ instruction, a nested switch actually. (There’s no original C++ version of this kernel as it got built directly with the IR builder at runtime, based off some expression template magic).
I am using LLVM 13.0.1 in ‘Release’ build with NVPTX backend.
What are my options to improve the timings for the code generation?
I am not familiar with the NVPTX backend, but I’m wondering if there’s a ‘magic number’ somewhere in the code that could be modified such that the codegen would run much faster while generating almost the same code.
if (TargetMachine->addPassesToEmitFile(PM, bos , nullptr , llvm::CGFT_AssemblyFile )) {
Setting the option you suggested, would this give further insights into where in the the codegen pass the time is spent? If so, I’d be happy to add this switch and see the time distribution. But if it’s just timing the codegen pass itself, then I have that timing already (as posted earlier).
LLVM optimization and code generation are split into a bunch of distinct “passes”; stuff like loop strength reduction, instruction selection, register allocation. -time-passes shows how much time each of those parts take.
I ran LLC to generate the PTX for a given module (see below). Observations:
If I read the output correctly the compilation only took 0.25 seconds whereas when I call codegen from within our program it takes much longer (over 1 second).
I wonder what ‘GPU Load and Store Vectorizer’ is doing. If this is utilizing indeed the GPU installed on the system then this might pose a problem in two ways: a) There are still compute kernels running on the GPU (they execute in async to the CPU thread). b) There are 6 threads running on the Power9 CPU, each runs it’s own LLVM codegen. So, 6 codegen processes in parallel. There’s the risk of all utilizing the same physical GPU. (There are 6 GPUs installed per 1 CPU).
On a second thought I don’t think this is using the physical GPU to compile the kernel.
I might have to try calling ‘llc’ from our program via a system() call to see if this speeds up things.
No it wasn’t, LLVM always runs on CPU. PTX will be generated (by LLVM) before sending into GPU, possibly compiled by another proprietary online compiler.
I doubt. I think unless you customize the optimization / codegen pipeline, llc is using the same set of passes to generate PTX.
I was looking through the list of transformation passes… Is the “GPU Load and Store Vectorizer” the same as the ‘bb-vectorizer’? I don’t see any other vectorizer in the list.
I’d like to mimick in our code what LLC is doing in terms of passes. Where in the code can I find the list of passes that are run when the target is ‘nvptx’? Is that defined in LLC or in the NVPTX backend?
EDIT: I see it. The list of passes not as linear as I had hoped.