We would like to put forth a plan to enable machine function splitting on AArch64. Please see the details below. We very much appreciate any inputs!
Machine function splitting, a.k.a. MFS, divides a function into hot and split parts using profile information. The linker then groups all hot and all split blocks across functions together, decreasing fragmentation and improving Icache and ITLB utilization.
MFS has been tested and enabled on X86 with good speedup on workloads, this project intends to enable it for AArch64. We have observed a significant (~21% over an average of 3 runs) reduction in iTLB misses with MFS enabled while running clang as a workload.
This is initially intended to be an opt-in change and placed behind a flag. As it becomes more tested and tuned, we intend to turn it on by default for profile use builds.
There are a number of issues we feel need to be addressed before having MFS enabled on AArch64, these include branch relaxation, 128MB+ jump distance, inline assembly, jump table and compilation time.
For a more detailed discussion, please see below.
MFS assigns basic blocks into hot/split sections. AArch64 conditional branches can have a distance of ±1MB (in comparison X86 can have ±2GB), While this is usually sufficient for jumps within a function, this could fall short if MFS places hot and split blocks of a function into different sections which linker places more than 1MB apart.
Branch relaxation passes in the backend can extend branches with insufficient jump distance with ±128MB trampolines in case of AArch64 as shown.
Conditional branch and the distance to TARGET is too long: BEQ TARGET ←------------- This branch can jump ±1MB FALLTHROUGH: After Branch relaxation transformation: BEQ TRAMPOLINE B FALLTHROUGH TRAMPOLINE: B TARGET ←-------- This branch can jump ±128MB FALLTHROUGH: NOTE: This is one of the multiple ways to relax the branch
MFS can introduce cross section branches which have a high likelihood of falling short of the 1M limit. Thus we need to update the branch relaxation pass to convert all hot-split cross section branches to use this trampoline. We currently are able to compile and run the clang and some internal workloads.
After the branches are relaxed. There may still be the problem of branch distance exceeding ±128MB. e.g. in binary with large text sections or sections placed far away from each other. There are multiple ways to solve this problem, explained below. As to which path we should take depends on what kind of performance we get for each solution.
The user can guarantee this through an option. And a linker option can be provided at the same time to detect any offending uses. Based on our measurements, we expect the hot/split section for most binaries to be smaller than 128MB in total. One data point shows clang, a medium size binary, has 6.36MB and 10.1MB in .text.hot and .text.split respectively.
Linker has range-extension thunks ⚙ D39744 [LLD][ELF][AArch64] Add support for AArch64 range extension thunks. which creates a thunk to extend the branch distance of unconditional branch and branch and link instructions.
While this should fix the 128MB limitation, it has 2 problems.
It introduces 3 additional instructions. This can negatively impact the performance gain we get with MFS.
adrp x16, #268435456 add x16, x16, #288 br x16
Range extension thunks clobber x16 (and in theory are allowed to clobber x17). So we’d need to ensure that x16 and x17 are not used in the function or avoid MFS if they are used. This can be done by scanning the function for x16 and x17 uses.
We could also create multiple hot split sections and ask the linker to place them next to each other. Each hot/split section group does not exceed 128MB in size. As a base case, we can divide every module into a hot and split section. This could create a lot of fragmentation and not get the benefit of MFS. To improve upon this, we can ask the linker to put multiple modules into a group and create hot and split sections for the group.
Inline assembly by user may be unable to relax. The user should not make any assumptions regarding where the source and target basic blocks are placed.
We plan to issue a warning to the user when a basic block in the function is the target of inline assembly. This allows the user to relate downstream linker error with this issue and fix the code. A more conservative user may also promote this error to a failure.
It’s known that jump table does not work well with MFS right now. We can have a check to disallow jump table and MFS at the same time.
Branch relaxation runs iteratively till all branches that need to be relaxed are relaxed. Introducing cross-section branch relaxation makes the number of branches that need to be relaxed significantly more than before and also potentially increases the number of iterations. This leads to worse compilation time. With -ftime-report, we see time spent in the branch relaxation pass increases by 37% when compiling a large C++ file. However, the impact on overall compilation time is limited as the relaxation pass only takes a small fraction of the total compilation time.
In case this becomes a problem, we propose to do a first round of cross-section relaxation (without iteration) before running same-section relaxation. The ones relaxed due to cross section are not going to be relaxed again.