NVPTX codegen surprisingly slow on some functions

With the NVPTX backend I am seeing increased times for code generation on a large number of our functions.
Although not entirely certain a criterion for a function to compile slowly or quickly seem to be whether the function make use of the ‘switch’ instruction or not.

To be more precise: The codegen executes on the compute nodes of the Summit computer system at Oak Ridge National Lab (IBM Power9 CPU).
Our program generates ~300 GPU kernels during runtime, where the codegen for about 100 of them are pleasantly fast, I’m seeing a few milli-seconds.
But the codegen for ~200 kernels takes significantly longer. Each of those kernels take more than a second (!) to compile to PTX.
So, I’m seeing two orders of magnitude increased compilation time for those kernels (likely because they use the ‘switch’ instructions.)
The accumulated time for code generation exceeds 200 seconds for these ~200 functions.
This has made our runs a lot less efficient, to the point where it’s not practical anymore.

Please find attached an example (; ModuleID = '<stdin>'source_filename = "module"target datalayout = "e-i64:6 - Pastebin.com) of one of the functions/modules that takes long to compile.
It contains the ‘switch’ instruction, a nested switch actually. (There’s no original C++ version of this kernel as it got built directly with the IR builder at runtime, based off some expression template magic).

I am using LLVM 13.0.1 in ‘Release’ build with NVPTX backend.

What are my options to improve the timings for the code generation?

I am not familiar with the NVPTX backend, but I’m wondering if there’s a ‘magic number’ somewhere in the code that could be modified such that the codegen would run much faster while generating almost the same code.

Thanks,
Frank

Can you get -time-passes output? It’s hard to say what would help without knowing where the time is spent.

The only pass that’s running is the codegen pass:

if (TargetMachine->addPassesToEmitFile(PM, bos , nullptr ,  llvm::CGFT_AssemblyFile )) {

Setting the option you suggested, would this give further insights into where in the the codegen pass the time is spent? If so, I’d be happy to add this switch and see the time distribution. But if it’s just timing the codegen pass itself, then I have that timing already (as posted earlier).

LLVM optimization and code generation are split into a bunch of distinct “passes”; stuff like loop strength reduction, instruction selection, register allocation. -time-passes shows how much time each of those parts take.

I ran LLC to generate the PTX for a given module (see below). Observations:

  1. If I read the output correctly the compilation only took 0.25 seconds whereas when I call codegen from within our program it takes much longer (over 1 second).
  2. I wonder what ‘GPU Load and Store Vectorizer’ is doing. If this is utilizing indeed the GPU installed on the system then this might pose a problem in two ways: a) There are still compute kernels running on the GPU (they execute in async to the CPU thread). b) There are 6 threads running on the Power9 CPU, each runs it’s own LLVM codegen. So, 6 codegen processes in parallel. There’s the risk of all utilizing the same physical GPU. (There are 6 GPUs installed per 1 CPU).

llc -march=nvptx64 -mcpu=sm_70 -time-passes < module_evalp21.bc > module_evalp21.ptx
===-------------------------------------------------------------------------===
… Pass execution timing report …
===-------------------------------------------------------------------------===
Total Execution Time: 0.2425 seconds (0.2426 wall clock)

—User Time— --System Time-- --User+System-- —Wall Time— — Name —
0.0823 ( 35.0%) 0.0004 ( 5.4%) 0.0827 ( 34.1%) 0.0827 ( 34.1%) GPU Load and Store Vectorizer
0.0573 ( 24.4%) 0.0000 ( 0.0%) 0.0573 ( 23.6%) 0.0573 ( 23.6%) Loop Strength Reduction
0.0229 ( 9.7%) 0.0024 ( 31.4%) 0.0253 ( 10.4%) 0.0253 ( 10.4%) Straight line strength reduction
0.0176 ( 7.5%) 0.0000 ( 0.0%) 0.0176 ( 7.2%) 0.0176 ( 7.2%) Nary reassociation
0.0170 ( 7.2%) 0.0000 ( 0.1%) 0.0170 ( 7.0%) 0.0170 ( 7.0%) NVPTX DAG->DAG Pattern Instruction Selection
0.0089 ( 3.8%) 0.0030 ( 40.0%) 0.0119 ( 4.9%) 0.0120 ( 4.9%) Split GEPs to a variadic base and a constant offset for better CSE
0.0088 ( 3.7%) 0.0000 ( 0.1%) 0.0088 ( 3.6%) 0.0088 ( 3.6%) NVPTX Assembly Printer
0.0048 ( 2.0%) 0.0000 ( 0.0%) 0.0048 ( 2.0%) 0.0048 ( 2.0%) Early CSE
0.0047 ( 2.0%) 0.0000 ( 0.0%) 0.0047 ( 1.9%) 0.0047 ( 1.9%) Induction Variable Users
0.0014 ( 0.6%) 0.0000 ( 0.0%) 0.0014 ( 0.6%) 0.0014 ( 0.6%) Machine Common Subexpression Elimination
0.0012 ( 0.5%) 0.0000 ( 0.0%) 0.0012 ( 0.5%) 0.0012 ( 0.5%) CodeGen Prepare
0.0000 ( 0.0%) 0.0011 ( 14.5%) 0.0011 ( 0.4%) 0.0011 ( 0.4%) Infer address spaces
0.0010 ( 0.4%) 0.0000 ( 0.0%) 0.0010 ( 0.4%) 0.0010 ( 0.4%) Early CSE #2
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Profile summary info
0.2350 (100.0%) 0.0075 (100.0%) 0.2425 (100.0%) 0.2426 (100.0%) Total

===-------------------------------------------------------------------------===
DWARF Emission
===-------------------------------------------------------------------------===
Total Execution Time: 0.0029 seconds (0.0029 wall clock)

—User Time— --System Time-- --User+System-- —Wall Time— — Name —
0.0029 (100.0%) 0.0000 (100.0%) 0.0029 (100.0%) 0.0029 (100.0%) Debug Info Emission
0.0029 (100.0%) 0.0000 (100.0%) 0.0029 (100.0%) 0.0029 (100.0%) Total

===-------------------------------------------------------------------------===
LLVM IR Parsing
===-------------------------------------------------------------------------===
Total Execution Time: 0.0075 seconds (0.0075 wall clock)

—User Time— --User+System-- —Wall Time— — Name —
0.0075 (100.0%) 0.0075 (100.0%) 0.0075 (100.0%) Parse IR
0.0075 (100.0%) 0.0075 (100.0%) 0.0075 (100.0%) Total

===-------------------------------------------------------------------------===
Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
Total Execution Time: 0.0123 seconds (0.0123 wall clock)

—User Time— --System Time-- --User+System-- —Wall Time— — Name —
0.0035 ( 28.5%) 0.0000 ( 0.0%) 0.0035 ( 28.4%) 0.0035 ( 28.5%) DAG Combining 1
0.0030 ( 24.0%) 0.0000 ( 0.0%) 0.0030 ( 24.0%) 0.0029 ( 24.0%) DAG Combining 2
0.0023 ( 18.9%) 0.0000 (100.0%) 0.0023 ( 18.9%) 0.0023 ( 18.9%) Instruction Selection
0.0015 ( 12.4%) 0.0000 ( 0.0%) 0.0015 ( 12.4%) 0.0015 ( 12.4%) Instruction Scheduling
0.0008 ( 6.4%) 0.0000 ( 0.0%) 0.0008 ( 6.4%) 0.0008 ( 6.4%) Instruction Creation
0.0007 ( 5.3%) 0.0000 ( 0.0%) 0.0007 ( 5.3%) 0.0007 ( 5.3%) DAG Legalization
0.0004 ( 3.0%) 0.0000 ( 0.0%) 0.0004 ( 3.0%) 0.0004 ( 3.0%) Type Legalization
0.0001 ( 0.7%) 0.0000 ( 0.0%) 0.0001 ( 0.7%) 0.0001 ( 0.8%) Vector Legalization
0.0001 ( 0.6%) 0.0000 ( 0.0%) 0.0001 ( 0.6%) 0.0001 ( 0.6%) Instruction Scheduling Cleanup
0.0000 ( 0.2%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) 0.0000 ( 0.2%) DAG Combining after legalize types
0.0123 (100.0%) 0.0000 (100.0%) 0.0123 (100.0%) 0.0123 (100.0%) Total

On a second thought I don’t think this is using the physical GPU to compile the kernel.
I might have to try calling ‘llc’ from our program via a system() call to see if this speeds up things.

PTX is a scalar architecture. Why is there a ‘vectorizer’ at work?

No it wasn’t, LLVM always runs on CPU. PTX will be generated (by LLVM) before sending into GPU, possibly compiled by another proprietary online compiler.

I doubt. I think unless you customize the optimization / codegen pipeline, llc is using the same set of passes to generate PTX.

I was looking through the list of transformation passes… Is the “GPU Load and Store Vectorizer” the same as the ‘bb-vectorizer’? I don’t see any other vectorizer in the list.

I’d like to mimick in our code what LLC is doing in terms of passes. Where in the code can I find the list of passes that are run when the target is ‘nvptx’? Is that defined in LLC or in the NVPTX backend?

EDIT: I see it. The list of passes not as linear as I had hoped.

-debug-pass=Structure will show you the cumulative set of passes run

The LoadStoreVectorizer does have some quadratic behavior in it.