NVPTX codegen surprisingly slow on some functions

fwinter · March 29, 2022, 6:26pm

With the NVPTX backend I am seeing increased times for code generation on a large number of our functions.
Although not entirely certain a criterion for a function to compile slowly or quickly seem to be whether the function make use of the ‘switch’ instruction or not.

To be more precise: The codegen executes on the compute nodes of the Summit computer system at Oak Ridge National Lab (IBM Power9 CPU).
Our program generates ~300 GPU kernels during runtime, where the codegen for about 100 of them are pleasantly fast, I’m seeing a few milli-seconds.
But the codegen for ~200 kernels takes significantly longer. Each of those kernels take more than a second (!) to compile to PTX.
So, I’m seeing two orders of magnitude increased compilation time for those kernels (likely because they use the ‘switch’ instructions.)
The accumulated time for code generation exceeds 200 seconds for these ~200 functions.
This has made our runs a lot less efficient, to the point where it’s not practical anymore.

Please find attached an example (; ModuleID = '<stdin>'source_filename = "module"target datalayout = "e-i64:6 - Pastebin.com) of one of the functions/modules that takes long to compile.
It contains the ‘switch’ instruction, a nested switch actually. (There’s no original C++ version of this kernel as it got built directly with the IR builder at runtime, based off some expression template magic).

I am using LLVM 13.0.1 in ‘Release’ build with NVPTX backend.

What are my options to improve the timings for the code generation?

I am not familiar with the NVPTX backend, but I’m wondering if there’s a ‘magic number’ somewhere in the code that could be modified such that the codegen would run much faster while generating almost the same code.

Thanks,
Frank

efriedma-quic · March 29, 2022, 8:31pm

Can you get -time-passes output? It’s hard to say what would help without knowing where the time is spent.

fwinter · March 30, 2022, 5:46pm

The only pass that’s running is the codegen pass:

if (TargetMachine->addPassesToEmitFile(PM, bos , nullptr ,  llvm::CGFT_AssemblyFile )) {

Setting the option you suggested, would this give further insights into where in the the codegen pass the time is spent? If so, I’d be happy to add this switch and see the time distribution. But if it’s just timing the codegen pass itself, then I have that timing already (as posted earlier).

efriedma-quic · March 30, 2022, 6:32pm

LLVM optimization and code generation are split into a bunch of distinct “passes”; stuff like loop strength reduction, instruction selection, register allocation. -time-passes shows how much time each of those parts take.

fwinter · March 31, 2022, 4:06pm

I ran LLC to generate the PTX for a given module (see below). Observations:

If I read the output correctly the compilation only took 0.25 seconds whereas when I call codegen from within our program it takes much longer (over 1 second).
I wonder what ‘GPU Load and Store Vectorizer’ is doing. If this is utilizing indeed the GPU installed on the system then this might pose a problem in two ways: a) There are still compute kernels running on the GPU (they execute in async to the CPU thread). b) There are 6 threads running on the Power9 CPU, each runs it’s own LLVM codegen. So, 6 codegen processes in parallel. There’s the risk of all utilizing the same physical GPU. (There are 6 GPUs installed per 1 CPU).

llc -march=nvptx64 -mcpu=sm_70 -time-passes < module_evalp21.bc > module_evalp21.ptx
===-------------------------------------------------------------------------===
… Pass execution timing report …
===-------------------------------------------------------------------------===
Total Execution Time: 0.2425 seconds (0.2426 wall clock)

—User Time— --System Time-- --User+System-- —Wall Time— — Name —
0.0823 ( 35.0%) 0.0004 ( 5.4%) 0.0827 ( 34.1%) 0.0827 ( 34.1%) GPU Load and Store Vectorizer
0.0573 ( 24.4%) 0.0000 ( 0.0%) 0.0573 ( 23.6%) 0.0573 ( 23.6%) Loop Strength Reduction
0.0229 ( 9.7%) 0.0024 ( 31.4%) 0.0253 ( 10.4%) 0.0253 ( 10.4%) Straight line strength reduction
0.0176 ( 7.5%) 0.0000 ( 0.0%) 0.0176 ( 7.2%) 0.0176 ( 7.2%) Nary reassociation
0.0170 ( 7.2%) 0.0000 ( 0.1%) 0.0170 ( 7.0%) 0.0170 ( 7.0%) NVPTX DAG->DAG Pattern Instruction Selection
0.0089 ( 3.8%) 0.0030 ( 40.0%) 0.0119 ( 4.9%) 0.0120 ( 4.9%) Split GEPs to a variadic base and a constant offset for better CSE
0.0088 ( 3.7%) 0.0000 ( 0.1%) 0.0088 ( 3.6%) 0.0088 ( 3.6%) NVPTX Assembly Printer
0.0048 ( 2.0%) 0.0000 ( 0.0%) 0.0048 ( 2.0%) 0.0048 ( 2.0%) Early CSE
0.0047 ( 2.0%) 0.0000 ( 0.0%) 0.0047 ( 1.9%) 0.0047 ( 1.9%) Induction Variable Users
0.0014 ( 0.6%) 0.0000 ( 0.0%) 0.0014 ( 0.6%) 0.0014 ( 0.6%) Machine Common Subexpression Elimination
0.0012 ( 0.5%) 0.0000 ( 0.0%) 0.0012 ( 0.5%) 0.0012 ( 0.5%) CodeGen Prepare
0.0000 ( 0.0%) 0.0011 ( 14.5%) 0.0011 ( 0.4%) 0.0011 ( 0.4%) Infer address spaces
0.0010 ( 0.4%) 0.0000 ( 0.0%) 0.0010 ( 0.4%) 0.0010 ( 0.4%) Early CSE #2
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Profile summary info
0.2350 (100.0%) 0.0075 (100.0%) 0.2425 (100.0%) 0.2426 (100.0%) Total

===-------------------------------------------------------------------------===
DWARF Emission
===-------------------------------------------------------------------------===
Total Execution Time: 0.0029 seconds (0.0029 wall clock)

—User Time— --System Time-- --User+System-- —Wall Time— — Name —
0.0029 (100.0%) 0.0000 (100.0%) 0.0029 (100.0%) 0.0029 (100.0%) Debug Info Emission
0.0029 (100.0%) 0.0000 (100.0%) 0.0029 (100.0%) 0.0029 (100.0%) Total

===-------------------------------------------------------------------------===
LLVM IR Parsing
===-------------------------------------------------------------------------===
Total Execution Time: 0.0075 seconds (0.0075 wall clock)

—User Time— --User+System-- —Wall Time— — Name —
0.0075 (100.0%) 0.0075 (100.0%) 0.0075 (100.0%) Parse IR
0.0075 (100.0%) 0.0075 (100.0%) 0.0075 (100.0%) Total

===-------------------------------------------------------------------------===
Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
Total Execution Time: 0.0123 seconds (0.0123 wall clock)

—User Time— --System Time-- --User+System-- —Wall Time— — Name —
0.0035 ( 28.5%) 0.0000 ( 0.0%) 0.0035 ( 28.4%) 0.0035 ( 28.5%) DAG Combining 1
0.0030 ( 24.0%) 0.0000 ( 0.0%) 0.0030 ( 24.0%) 0.0029 ( 24.0%) DAG Combining 2
0.0023 ( 18.9%) 0.0000 (100.0%) 0.0023 ( 18.9%) 0.0023 ( 18.9%) Instruction Selection
0.0015 ( 12.4%) 0.0000 ( 0.0%) 0.0015 ( 12.4%) 0.0015 ( 12.4%) Instruction Scheduling
0.0008 ( 6.4%) 0.0000 ( 0.0%) 0.0008 ( 6.4%) 0.0008 ( 6.4%) Instruction Creation
0.0007 ( 5.3%) 0.0000 ( 0.0%) 0.0007 ( 5.3%) 0.0007 ( 5.3%) DAG Legalization
0.0004 ( 3.0%) 0.0000 ( 0.0%) 0.0004 ( 3.0%) 0.0004 ( 3.0%) Type Legalization
0.0001 ( 0.7%) 0.0000 ( 0.0%) 0.0001 ( 0.7%) 0.0001 ( 0.8%) Vector Legalization
0.0001 ( 0.6%) 0.0000 ( 0.0%) 0.0001 ( 0.6%) 0.0001 ( 0.6%) Instruction Scheduling Cleanup
0.0000 ( 0.2%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) 0.0000 ( 0.2%) DAG Combining after legalize types
0.0123 (100.0%) 0.0000 (100.0%) 0.0123 (100.0%) 0.0123 (100.0%) Total

fwinter · March 31, 2022, 4:11pm

On a second thought I don’t think this is using the physical GPU to compile the kernel.
I might have to try calling ‘llc’ from our program via a system() call to see if this speeds up things.

fwinter · March 31, 2022, 4:46pm

PTX is a scalar architecture. Why is there a ‘vectorizer’ at work?

mshockwave · March 31, 2022, 5:20pm

No it wasn’t, LLVM always runs on CPU. PTX will be generated (by LLVM) before sending into GPU, possibly compiled by another proprietary online compiler.

I doubt. I think unless you customize the optimization / codegen pipeline, llc is using the same set of passes to generate PTX.

github.com

llvm/llvm-project/blob/65bdeddb1e5c0d27be0397379131b2d712c7a227/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp#L26


      
          // This pass is intended to be run late in the pipeline, after other
          // vectorization opportunities have been exploited.  So the assumption here is
          // that immediately following our new vector load we'll need to extract out the
          // individual elements of the load, so we can operate on them individually.
          //
          // On CPUs this transformation is usually not beneficial, because extracting the
          // elements of a vector register is expensive on most architectures.  It's
          // usually better just to load each element individually into its own scalar
          // register.
          //
          // However, nVidia and AMD GPUs don't have proper vector registers.  Instead, a
          // "vector load" loads directly into a series of scalar registers.  In effect,
          // extracting the elements of the vector is free.  It's therefore always
          // beneficial to vectorize a sequence of loads on these architectures.
          //
          // Vectorizing (perhaps a better name might be "coalescing") loads can have
          // large performance impacts on GPU kernels, and opportunities for vectorizing
          // are common in GPU code.  This pass tries very hard to find such
          // opportunities; its runtime is quadratic in the number of loads in a BB.
          //
          // Some CPU architectures, such as ARM, have instructions that load into

fwinter · April 1, 2022, 3:42pm

I was looking through the list of transformation passes… Is the “GPU Load and Store Vectorizer” the same as the ‘bb-vectorizer’? I don’t see any other vectorizer in the list.

I’d like to mimick in our code what LLC is doing in terms of passes. Where in the code can I find the list of passes that are run when the target is ‘nvptx’? Is that defined in LLC or in the NVPTX backend?

EDIT: I see it. The list of passes not as linear as I had hoped.

arsenm · April 1, 2022, 8:08pm

-debug-pass=Structure will show you the cumulative set of passes run

The LoadStoreVectorizer does have some quadratic behavior in it.

Topic		Replies	Views
Codegen slower with new PassManager IR & Optimizations nvptx	4	159	February 9, 2024
My own codegen is 2.5x slower than llc? LLVM Dev List Archives	4	90	May 29, 2018
Where is opt spending its time? LLVM Dev List Archives	3	78	March 10, 2016
Status of PTX Backend LLVM Dev List Archives	3	97	October 8, 2010
Slow jitter. LLVM Dev List Archives	19	100	August 27, 2009

NVPTX codegen surprisingly slow on some functions

Related topics