[RFC][SPGO][PassPipelines] Adding InstCombinePass and SimplifyCFGPass before SampleProfileLoaderPass when building in SPGO mode

Background
In the PGO building phase, according to detailed and accurate sample count, SampleProfileLoaderPass would do an aggressive inlining for all hot functions ahead of general inlining passes. Current passes pipeline of SimplifyCFG and InstCombine around SampleProfileLoaderPass is as follow (w/o LTO):
SimplifyCFG → SROA → SampleProlfileLoader → InstCombine -->SimplifyCFG → Inliner
(In general passes pipelines without SPGO, InstCombinePass and SimplifyCFGPass both run before Inliners.)

Issue
In this scenario, hot simple/short functions are inlined into a complicated context at the very early phase (SampleProlfileLoader) without doing (fully) SimplifyCFG or InstCombine. Though SimplifyCFG and InstCombine will be done several times after SampleProlfileLoader, considering these two passes go through all BBs and all instructions in top-down order, it is very possible to miss/break local optimal chance for the simple/short inlinee functions.
We (Intel) are doing in SPGO/HWPGO performance tuning for some cases and meet some issues about it. This is one simple example of the issues that it misses the best optimization chance:

define i8 @callee(i8 %0, i8 %1){  
;It does a classic smax for i8. It always sext to i32 to do compare and does i8 selection.
;If there are more comparisons, it suppose to have many branches need to be simplified.
  %2 = sext i8 %0 to i32
  %3 = sext i8 %1 to i32
  %4 = icmp sgt i32 %2, %3
  %. = select i1 %4, i8 %0, i8 %1
  ret i8 %.
}
define void @caller(ptr %0, i8 %1) {
  %1 = load i32, ptr @A
  ;trunc from i32 to i8, then sext from i8 to i32 in callee().
  %2 = trunc i32 %1 to i8  
  for (i;i<n;i++;) { 
  ;The callee() is called in a loop.
  ;So the loop would be optimized by LoopVectorizationPass
  ;and finally be vectorized compare and select instructions.
    %3 = getelementptr i8, ptr %0, i64 %i
    %4 = load i8, ptr %3
    call i8 @callee(i8 %4, i8 %2)
  }
}

Since callee() is inlined in caller() by SampleProlfileLoader, the %2’s “trunc i32 to i8; sext i8 to i32” will be optimized to “shl; ashr” in later InstCombinePass. It finally leads to vectorized icmp_i32 and select_i8 IR, which is very unfriendly to x86 vectorized instruction in AVX2 set (no K registers support), because the width of icmp (8 * i32) is different from width of select (8 * i8) and we have to generate pack or shuffle instructions to align their width.
If we could do a simplifyCFG for callee() before it is inlined, the “sext i8 to i32; sext i8 to i32; icmp i32; select i8” can be optimized to “icmp i8; select i8”, then their final vector instructions are very tidy and compact ( cmp and sel are both 32 * i8). It helps performance very much.

Though we could solve the above issue in InstCombinePass by skipping “trunc; sext” pattern optimization through setting a specific condition. But I think this way of setting specific condition is tricky. This way is like a patch, every time we meet a new case needed to handle in SPGO, we have to write a patch. It bloats llvm code, and hard to read and maintain. The similar issues also happen in SimplifyCFG.
Actually SimplifyCFG and InstCombine are both tuned strong and comprehensive enough to cover almost all optimization chances, the above issue unique explored in SPGO mode which is hard to reproduce under general mode because SimplifyCFG and InstCombine both run before general ininlers in general mode. So I think it is no need to add the handler of SPGO only special cases in general SimplifyCFG and InstCombine passes.

Proposal
The root case of this kind of issues is because inlining happens very early in SampleProlfileLoader when enable SPGO, and the later InstCombine and SimplifyCFG run in top-down order which is hard to catch local optimal optimization chance in original callee functions. Though we could add specific pattern match to deal with it in SimplifyCFG or InstCombine themselves, I think it is not necessary to do so, especially for some cases that are only explored in SPGO mode.
So I proposal to add InstCombine and SimplifyCFG passes only under SPGO mode before SampleProlfileLoaderPass, and this idea is accordant with the logic of general mode that running InstCombine and SimplifyCFG before inliner.

At least this specific problem looks like an InstCombine failure, that needs to be fixed there rather than worked around through phase ordering changes. Peculiarly, variants of this problem has also been recently mentioned in [InstCombine] sext(trunc(x)) should not be transformed to ashr(shl(trunc(x)))) when there is no chance to fold · Issue #116019 · llvm/llvm-project · GitHub and Use Smallest types in IR, which also illustrates that this isn’t really an SPGO specific problem.

1 Like

[InstCombine] sext(trunc(x)) should not be transformed to ashr(shl(trunc(x)))) when there is no chance to fold · Issue #116019 · llvm/llvm-project · GitHub is different from my issue. It is sext(trunc(x)) to ashr(shl(trunc(x))), in which trunc(x) is not erased. But in my case, sext(trunc(x)) to ashr(shl(x)) is reasonable. The transferring does not increase instruction and remove trunc. In SIMD mode at least x86, trunc corresponds to PACK instruction, which is more expensive than shift.
Use Smallest types in IR is also different from my issue. It is discussing if we could use a smaller type. I think it is another topic.

The sext(trunc(x)) is just only an example, please don’t only focus on the simple example. Let me emphasize my point:
In general mode, InstCombine and SimplifyCFG both run before Inliner. It has been there long time ago. So if without SPGO, callee() function in my example would be optimized by InstCombine and SimplifyCFG firstly, and then by Inliner. It is a very nature process that firstly combining/Simplifying IRs/BBs in callee() itself, then inlining callee() into caller(), then find other chance of combining/Simplifying IRs/BBs for caller(callee())… So we would not meet the pattern sext(trunc(x)) in general mode in my example, because sext has already erased in callee(). That is also the reason why I declared to solve SPGO specific problem.
According to programmer’s coding logic, the combining/Simplifying chance inside a function should have higher priority than function’s outer. What my proposal is only follow general mode’s logic: Adding InstCombine and SimplifyCFG before SampleProfileLoader’s Inliner in SPGO mode.

To clarify, my point here was mostly just that the sext(trunc) → ashr(shl) conversion currently causes various optimization issues. The details differ each time. The solution to this might be to not canonicalize to ashr(shl) and only convert to sext_inreg in the backend. Or it might be to improve other optimizations to recognize ashr(shl) as a special case.

It’s hard to confirm because you did not provide a working IR sample, but I think your case is quite similar to the first function in the linked discourse thread.

Something to clarify here: There is an InstCombine and SimplifyCFG run before the inliner (the “global cleanup PM”), but it’s there for historical reasons and not a principled part of the pipeline design. We would like to get rid of it, though it’s not so easy (Remove GlobalCleanupPM from default pipelines · Issue #94257 · llvm/llvm-project · GitHub).

The way this is actually supposed to be handled is via CGSCC interleaving of the inliner and the function simplification pipeline. It’s not just InstCombine and SimplifyCFG being run once over all functions, it’s the whole function simplification pipeline running before a function gets inlined (at each level, not just once).

My general view on this is that this is, primarily, a cost-modelling consideration. We want functions to be simplified before inlining to model the inlining cost more precisely.

I don’t think that, in principle, the optimizer should rely on this for optimization quality purposes. Sure, this is always going to happen in practice to some degree (just the fact that the function pipeline may run over the same piece of IR many times across multiple inlines can paper over phase ordering issues), but this isn’t how it’s supposed to work.

In particular, keep in mind that that the IR might already come to LLVM in pre-inlined form. Maybe this is because the code was generated by macros (particularly common for your max example in C). Maybe it’s because some inlining already happened at some higher level (e.g. via MIR inlining in Rust, or maybe MLIR inlining before lowering to LLVM IR). This is why I think it’s important to address optimization issues with “pre-inlined IR” like the one you highlighted.

Now, at a higher level, I do agree that it’s a problem that we can get wildly different inlining/optimization interactions depending on when a function gets inlined. You’ll see very different behavior depending on whether this happens in the main inliner, the always inliner or SamplePGO pre-inlining. However, I don’t think that just adding an extra InstCombine run is going to materially address that problem – if we want to have consistent optimization behavior, SamplePGO would need to integrate with the main inliner rather than doing its own thing.

Very thanks for your detailed answer and advises!

Suppose there are three IRs A,B,C in order, I guess InstCombine’s limitation is should it combine A&B, or combine B&C? In my example, it combines A&B, but not B&C. With IRs becoming complicated, it is hard to tell which combining is better.

I am not familiar with CGSCC. According to my short glance in code, are you saying buildInlinerPipeline()? If yes, it is after SampleProfileLoader? So it can not solve my problem now?

Yes, we can not avoid frontend doing inliner before llvm. Could we do OutLiner at the very beginning of llvm passes to solve this problem?

I do have concern that why SampleProfileLoader does inlining in it. I guess one of the reasons is to get accurate sample count for each IR, because some functions have already inlined in the first building, SampleProfileLoader at least reproduce the inlining place to compute each IR’s sample count more precisely. For example, funA has been inlined in both funB and funC in first build, so funA has two places (in funB and funC) to record its sample count in profile file. If SampleProfileLoader does not do the same inlining behavior, how it records funA’s profiling info separately? In this perspective, I think it is hard to isolate inlining from SampleProfileLoader. If postpone SampleProfileLoader into main inliner, I guess it is not good at SPGO’s effect. The earlier to get sample info, the better to do correct optimization.