split hot and cold part of a function into separate function

currently, gcc support function attribute “cold”, which can hint compiler split caller function’s cold into two separate function, one is hot the other is cold.

One example is here: https://godbolt.org/z/j7sK4hd48

my question is Clang/llvm has such function/capability ?

Hi,

IIRC, clang/llvm has HotColdSplit and partial inline passes which has similar functionality. However, these two passes are not enabled by default for some reasons.

Thanks,
Chuanqi

are there some example ?
And this option -hot-cold-split need profile data ?

Now if we use --hot-cold-split, the compiler would tell us unsupported option. So I can’t find simple example from compiler explorer.
Here is the slides from the web: https://llvm.org/devmtg/2019-10/slides/Kumar-HotColdSplitting.pdf
BTW, may I ask what’s the intention for the question? Do we find that it is a performance gap from clang and gcc?

Thanks,
Chuanqi

Because lots of code in a function was error/failure-handle code. The hot path of a function mostly very thin.

By use hot-cold-split, the hot part can keep more local and cluster.

Which help cpu i-cache hit.

I-cache miss hurt performance significantly.

I believe intra-function hot cold code splitting is in the scope of the Propeller project, which Sriram Tallam worked on. I’m not sure what the status of the feature is at this moment.

I believe that the hot cold split pass is an IR pass, which means that it outlines code at the IR level. This will prevent the register allocator from working across the boundary between hot and cold code, so I don’t believe it has as much performance potential as splitting the function during code generation. Looking at the example, I believe GCC is using this strategy, it is not calling outlined code.

I believe intra-function hot cold code splitting is in the scope of the Propeller project, which Sriram Tallam worked on. I’m not sure what the status of the feature is at this moment.

This is available in LLVM with option -fsplit-machine-functions with PGO and it uses PGO profiles to split a function’s cold basic blocks which can then be placed arbitrarily. It is tested on instrumented PGO where it shows gains of a couple of percent. With Sampled PGO, we are still working on tuning the split.

We have also added support for Propeller, which uses another round of profiling to precisely layout basic blocks and split functions. While this is more effective that -fsplit-machine-functions, it requires another round of sampled profiling. Please see the documentation here to optimize binaries with Propeller: : https://github.com/google/autofdo/blob/propeller/OptimizeClangO3WithPropeller.md

Both -fsplit-machine-functions and Propeller use the basic block sections feature to perform function splitting.

I believe that the hot cold split pass is an IR pass, which means that it outlines code at the IR level. This will prevent the register allocator from working across the boundary between hot and cold code, so I don’t believe it has as much performance potential as splitting the function during code generation. Looking at the example, I believe GCC is using this strategy, it is not calling outlined code.

Yep, GCC too splits functions just like -fsplit-machine-functions during code generation and not early like hot cold splitting. For performance, we have found this more effective.

can we do not depends on p profiling?

use hint just like: [unlikely]

Which make more controllable for programmers

发自我的iPhone

can we do not depends on p profiling?

use hint just like: [unlikely]

Could you please rephrase? I think you mean if we could split without profile information? We don’t support that right now and we haven’t had much success with that with regards to performance. If you are not using profile-guided builds, you are dropping a lot of performance anyways.

2021年5月9日 上午3:39,Sriraman Tallam <tmsriram@google.com> 写道:

can we do not depends on p profiling?

use hint just like: [unlikely]

Could you please rephrase? I think you mean if we could split without profile information? We don’t support that right now and we haven’t had much success with that with regards to performance. If you are not using profile-guided builds, you are dropping a lot of performance anyways.

Yes. Because profile need choice perfect workload, and it maybe hard or infeasible for some software. Which has many workload need to support well and need balance between these workload.