[RFC] Machine Function Splitting (MFS) on AArch64

Greetings,

We would like to put forth a plan to enable machine function splitting on AArch64. Please see the details below. We very much appreciate any inputs!

Summary:

Machine function splitting, a.k.a. MFS, divides a function into hot and split parts using profile information. The linker then groups all hot and all split blocks across functions together, decreasing fragmentation and improving Icache and ITLB utilization.

[RFC] Machine Function Splitter

MFS has been tested and enabled on X86 with good speedup on workloads, this project intends to enable it for AArch64. We have observed a significant (~21% over an average of 3 runs) reduction in iTLB misses with MFS enabled while running clang as a workload.

This is initially intended to be an opt-in change and placed behind a flag. As it becomes more tested and tuned, we intend to turn it on by default for profile use builds.

There are a number of issues we feel need to be addressed before having MFS enabled on AArch64, these include branch relaxation, 128MB+ jump distance, inline assembly, jump table and compilation time.

For a more detailed discussion, please see below.

Branch Relaxation:

MFS assigns basic blocks into hot/split sections. AArch64 conditional branches can have a distance of ±1MB (in comparison X86 can have ±2GB), While this is usually sufficient for jumps within a function, this could fall short if MFS places hot and split blocks of a function into different sections which linker places more than 1MB apart.

Branch relaxation passes in the backend can extend branches with insufficient jump distance with ±128MB trampolines in case of AArch64 as shown.

Conditional branch and the distance to TARGET is too long:
   BEQ TARGET ←------------- This branch can jump ±1MB
FALLTHROUGH:

After Branch relaxation transformation:
   BEQ TRAMPOLINE
   B FALLTHROUGH
TRAMPOLINE:
   B TARGET ←-------- This branch can jump ±128MB
FALLTHROUGH:

NOTE: This is one of the multiple ways to relax the branch

MFS can introduce cross section branches which have a high likelihood of falling short of the 1M limit. Thus we need to update the branch relaxation pass to convert all hot-split cross section branches to use this trampoline. We currently are able to compile and run the clang and some internal workloads.

128MB+ Branch Distance:

After the branches are relaxed. There may still be the problem of branch distance exceeding ±128MB. e.g. in binary with large text sections or sections placed far away from each other. There are multiple ways to solve this problem, explained below. As to which path we should take depends on what kind of performance we get for each solution.

Solution #1 - Optimistically assume Hot/Split sections are no more than 128MB apart.

The user can guarantee this through an option. And a linker option can be provided at the same time to detect any offending uses. Based on our measurements, we expect the hot/split section for most binaries to be smaller than 128MB in total. One data point shows clang, a medium size binary, has 6.36MB and 10.1MB in .text.hot and .text.split respectively.

Solution #2 - Automatically inserted range-extension thunks:

Linker has range-extension thunks ⚙ D39744 [LLD][ELF][AArch64] Add support for AArch64 range extension thunks. which creates a thunk to extend the branch distance of unconditional branch and branch and link instructions.

While this should fix the 128MB limitation, it has 2 problems.

  1. It introduces 3 additional instructions. This can negatively impact the performance gain we get with MFS.

    adrp    x16, #268435456
    add     x16, x16, #288
    br      x16
    
  2. Range extension thunks clobber x16 (and in theory are allowed to clobber x17). So we’d need to ensure that x16 and x17 are not used in the function or avoid MFS if they are used. This can be done by scanning the function for x16 and x17 uses.

Solution #3 - Multiple Hot/Split Sections:

We could also create multiple hot split sections and ask the linker to place them next to each other. Each hot/split section group does not exceed 128MB in size. As a base case, we can divide every module into a hot and split section. This could create a lot of fragmentation and not get the benefit of MFS. To improve upon this, we can ask the linker to put multiple modules into a group and create hot and split sections for the group.

Inline Assembly:

Inline assembly by user may be unable to relax. The user should not make any assumptions regarding where the source and target basic blocks are placed.

We plan to issue a warning to the user when a basic block in the function is the target of inline assembly. This allows the user to relate downstream linker error with this issue and fix the code. A more conservative user may also promote this error to a failure.

Jump Table:

It’s known that jump table does not work well with MFS right now. We can have a check to disallow jump table and MFS at the same time.

Compilation Time:

Branch relaxation runs iteratively till all branches that need to be relaxed are relaxed. Introducing cross-section branch relaxation makes the number of branches that need to be relaxed significantly more than before and also potentially increases the number of iterations. This leads to worse compilation time. With -ftime-report, we see time spent in the branch relaxation pass increases by 37% when compiling a large C++ file. However, the impact on overall compilation time is limited as the relaxation pass only takes a small fraction of the total compilation time.

In case this becomes a problem, we propose to do a first round of cross-section relaxation (without iteration) before running same-section relaxation. The ones relaxed due to cross section are not going to be relaxed again.

Thanks
-Adrian

1 Like

@snehasish @efriedma-quic @davemgreen

Thanks for sharing this Adrian. I have a couple of comments,

For the trampoline code generated –

 TRAMPOLINE:
   B TARGET ←-------- This branch can jump ±128MB
FALLTHROUGH:

This is going to add an overhead of to each hot function part that is split since it is colocated with the hot part. The number of trampoline targets will be proportional to the number of cold edges. I wonder if it makes sense to have a threshold to tune whether we split small functions where the overhead may be high.

For the compilation time, can you also report the overall end to end increase to put it into perspective?

fyi @tmsriram

Hi Adriam,

I think this is great! I agree, at the very least, that this be an easy option to set, default on for x86 (to keep current behaviour) but off for AArch64 and other targets.

Some comments inline:

How does x86 handles those?

This is one way of doing it, but it also triples the number of branches on what is likely hot code, decreasing the effectiveness of branch prediction (though, such short branches are still in the fetch window). Did you measure how that affects performance?

This kind of option works well for embedded/mobile applications, where developers have a more controlled compilation process, but it will introduce complexities (and potential random build failures) for everyone else.

While this “works”, perhaps it should be used when you can’t do anything else (and in the initial implementation).

I think this is what we should aim for, long term. It does introduce the same complexity as constant pools, but it solves most of the problems the other two solutions bring.

I see the “final” solution being a mix of #2 and #3, with the former as a fall-back when the latter can’t be made.

I think we can disable MFS in those cases, at least for the time being. It should be possible to inspect inline asm / jump tables for register use and branching patterns, but each one sounds like a whole new research topic.

Honestly, this does sound like a good default option, even once it’s more stable later on.

Perhaps the flag enabling it should have options like “off”, “shallow”, “deep”, and only mark default for targets either “off” or “shallow”. Builds that can prove the extra compile time makes a big run time difference can enable the flag for particular files, for example.

Do you also have numbers on other benchmarks, like SPEC? As for compile-time, did you measure the impact on CTMark from llvm-test-suite?

cc @barrelshifter @AndrewLitteken in case there is an interaction with the MachineInstruction/IR outliners

x86 executables using the small/medium code models are required to have less than 2GB of code, and all x86 branch instructions can encode +/-2GB offsets. So branch relaxation isn’t a thing, branches stay legal after splitting, and jumps in inline assembly are probably legal.

Not sure about x86 jump tables.

Branch relaxation is very cheap in the first place, and we can likely make it cheaper if necessary; this doesn’t seem like a significant issue.

Can we just teach the SelectionDAG switch lowering code not to emit jump tables that contain both hot and cold blocks? Or do we not know which blocks will be considered hot/cold at that point?

asm goto is rare enough that it’s probably fine to just automatically disable splitting in functions that contain one.

If the linker is aware of what’s going on, it could arrange the code automatically, yes. But existing ELF semantics have no way to express this, so it’s not something we could turn on by default anytime soon.

Maybe you could mark up the relevant branches as clobbering x16/x17 before register allocation, so we don’t completely block use of the registers. But maybe not a big deal either way; AArch64 has a lot of general-purpose registers.

This could be an useful optimization. I will try it as tune the current implementation.

I will report the end to end compilation time on SEPCInt2017 next.

Hi

I will measure and report on SPECInt2017 and CTMarknext.

Thanks

I dont think so either. I will report more numbers on compilation time.

Make sense.

This is a good idea. I will give it a try.