Codegen slower with new PassManager

I’m currently making the move to the new pass manager and have arrived at a working version. However, the codegen takes significantly more time now than before with the legacy pass manager.
My application builds a module, links in CUDA libdevice, runs some passes and calls NVPTX codegen. Here’s the setup that I use with the legacy manager:

// build Mod, link in libdevice
llvm::legacy::PassManager PM;
PM.add( llvm::createInternalizePass( all_but_kernel_name ) );
PM.add( llvm::createNVVMReflectPass( sm ));
PM.add( llvm::createGlobalDCEPass() );
PM.run(*Mod);
// NVPTX codegen is fast

With the new pass manager I’m doing the following and the subsequent codegen takes signifacantly longer (more than 10x)

// build Mod, link in libdevice
LoopAnalysisManager LAM;
FunctionAnalysisManager FAM;
CGSCCAnalysisManager CGAM;
ModuleAnalysisManager MAM;

PassBuilder PB(TargetMachine.get());

PB.registerModuleAnalyses(MAM);
PB.registerCGSCCAnalyses(CGAM);
PB.registerFunctionAnalyses(FAM);
PB.registerLoopAnalyses(LAM);
PB.crossRegisterProxies(LAM, FAM, CGAM, MAM);

ModulePassManager MPM = PB.buildPerModuleDefaultPipeline(OptimizationLevel::O2);
MPM.addPass(InternalizePass(all_but_kernel_name));
MPM.addPass(GlobalDCEPass());

MPM.run(*Mod,MAM);
// NVPTX codegen takes much more time.

I timed the MPM.run call and it’s pretty fast. But the codegen takes more time now.
Is this the closest I can get to the previous setup?

You’re running the -O2 pipeline with the new pass manager but not the old one. That can increase size due to optimizations like inlining.

I am aware but I don’t know how to avoid the call to
PB.buildPerModuleDefaultPipeline because this is the one that schedules the NVVM reflect pass right at the beginning. I tried setting the optimization level to O0 but that ran into problems with LLVM 16:

llvm/lib/Passes/PassBuilderPipelines.cpp:1399: llvm::ModulePassManager llvm::PassBuilder::buildPerModuleDefaultPipeline(llvm::OptimizationLevel, bool): Assertion `Level != OptimizationLevel::O0 && “Must request optimizations for the default pipeline!”’ failed.

with LLVM 17 this is fine though. But still, setting O0 the codegen time doesn’t go down.

How can I make sure NVVM reflect is correctly scheduled without any call to PB.buildPerModuleDefaultPipeline?
(The module I’m building doesn’t have any stack allocation (alloca) or loops. So no need for any standard opt).

PB.buildO0DefaultPipeline(OptimizationLevel::O0) should do it

(it was a weird quirk that you had to call buildO0DefaultPipeline() instead of buildPerModuleDefaultPipeline(OptimizationLevel::O0), that was changed recently)

Yes, that works with LLVM 16. Thanks!