Background
In the PGO building phase, according to detailed and accurate sample count, SampleProfileLoaderPass would do an aggressive inlining for all hot functions ahead of general inlining passes. Current passes pipeline of SimplifyCFG and InstCombine around SampleProfileLoaderPass is as follow (w/o LTO):
SimplifyCFG → SROA → SampleProlfileLoader → InstCombine -->SimplifyCFG → Inliner
(In general passes pipelines without SPGO, InstCombinePass and SimplifyCFGPass both run before Inliners.)
Issue
In this scenario, hot simple/short functions are inlined into a complicated context at the very early phase (SampleProlfileLoader) without doing (fully) SimplifyCFG or InstCombine. Though SimplifyCFG and InstCombine will be done several times after SampleProlfileLoader, considering these two passes go through all BBs and all instructions in top-down order, it is very possible to miss/break local optimal chance for the simple/short inlinee functions.
We (Intel) are doing in SPGO/HWPGO performance tuning for some cases and meet some issues about it. This is one simple example of the issues that it misses the best optimization chance:
define i8 @callee(i8 %0, i8 %1){
;It does a classic smax for i8. It always sext to i32 to do compare and does i8 selection.
;If there are more comparisons, it suppose to have many branches need to be simplified.
%2 = sext i8 %0 to i32
%3 = sext i8 %1 to i32
%4 = icmp sgt i32 %2, %3
%. = select i1 %4, i8 %0, i8 %1
ret i8 %.
}
define void @caller(ptr %0, i8 %1) {
%1 = load i32, ptr @A
;trunc from i32 to i8, then sext from i8 to i32 in callee().
%2 = trunc i32 %1 to i8
for (i;i<n;i++;) {
;The callee() is called in a loop.
;So the loop would be optimized by LoopVectorizationPass
;and finally be vectorized compare and select instructions.
%3 = getelementptr i8, ptr %0, i64 %i
%4 = load i8, ptr %3
call i8 @callee(i8 %4, i8 %2)
}
}
Since callee() is inlined in caller() by SampleProlfileLoader, the %2’s “trunc i32 to i8; sext i8 to i32” will be optimized to “shl; ashr” in later InstCombinePass. It finally leads to vectorized icmp_i32 and select_i8 IR, which is very unfriendly to x86 vectorized instruction in AVX2 set (no K registers support), because the width of icmp (8 * i32) is different from width of select (8 * i8) and we have to generate pack or shuffle instructions to align their width.
If we could do a simplifyCFG for callee() before it is inlined, the “sext i8 to i32; sext i8 to i32; icmp i32; select i8” can be optimized to “icmp i8; select i8”, then their final vector instructions are very tidy and compact ( cmp and sel are both 32 * i8). It helps performance very much.
Though we could solve the above issue in InstCombinePass by skipping “trunc; sext” pattern optimization through setting a specific condition. But I think this way of setting specific condition is tricky. This way is like a patch, every time we meet a new case needed to handle in SPGO, we have to write a patch. It bloats llvm code, and hard to read and maintain. The similar issues also happen in SimplifyCFG.
Actually SimplifyCFG and InstCombine are both tuned strong and comprehensive enough to cover almost all optimization chances, the above issue unique explored in SPGO mode which is hard to reproduce under general mode because SimplifyCFG and InstCombine both run before general ininlers in general mode. So I think it is no need to add the handler of SPGO only special cases in general SimplifyCFG and InstCombine passes.
Proposal
The root case of this kind of issues is because inlining happens very early in SampleProlfileLoader when enable SPGO, and the later InstCombine and SimplifyCFG run in top-down order which is hard to catch local optimal optimization chance in original callee functions. Though we could add specific pattern match to deal with it in SimplifyCFG or InstCombine themselves, I think it is not necessary to do so, especially for some cases that are only explored in SPGO mode.
So I proposal to add InstCombine and SimplifyCFG passes only under SPGO mode before SampleProlfileLoaderPass, and this idea is accordant with the logic of general mode that running InstCombine and SimplifyCFG before inliner.