RFC: Shrink wrapping vs SplitCSR

Hi all,

We've seen several examples recently of performance opportunities on
POWER if we can improve the location of save/restore code for
callee-saved registers. Both Nemanja and myself have discussed this with
several people, and it seems that there are two possibilities for
improving this:

  1. Extend shrink wrapping to make the analysis of callee-saved
     registers more precise.
  2. Focus on enabling and (possibly) improving SplitCSR.

I would like opinions from people on the preferred way to proceed.

I am leaning toward improving shrink wrapping, at least as a short-term
solution. However, I fully admit that this is because I am familiar with
the shrink wrapping code and completely naive about SplitCSR and what
work would be necessary to get this working well.

My proposal would be to implement the flow sensitive analysis described
by Fred Chow (PLDI '88) and make the necessary extensions in shrink
wrapping to handle multiple save/restore points. At that point we can do
an evaluation to understand the improvements it provides and the impact
on compile time. Once we have these results, we can look at the best way
to enable it (e.g., option, target opt-in, higher opts, etc.).

Thoughts?

Kit

Hi Kit,

Hi all,

We've seen several examples recently of performance opportunities on
POWER if we can improve the location of save/restore code for
callee-saved registers. Both Nemanja and myself have discussed this with
several people, and it seems that there are two possibilities for
improving this:

1. Extend shrink wrapping to make the analysis of callee-saved
    registers more precise.
2. Focus on enabling and (possibly) improving SplitCSR.

I would like opinions from people on the preferred way to proceed.

I am leaning toward improving shrink wrapping, at least as a short-term
solution. However, I fully admit that this is because I am familiar with
the shrink wrapping code and completely naive about SplitCSR and what
work would be necessary to get this working well.

My proposal would be to implement the flow sensitive analysis described
by Fred Chow (PLDI '88) and make the necessary extensions in shrink
wrapping to handle multiple save/restore points. At that point we can do
an evaluation to understand the improvements it provides and the impact
on compile time. Once we have these results, we can look at the best way
to enable it (e.g., option, target opt-in, higher opts, etc.).

Back in 2009, there was an implementation of Fred Chow’s algorithm, that has been removed in r193749, because it was unused and untested.

I have been working on a new implementation of Fred Chow’s algorithm for a while now.

It seems that this algorithm has been avoided because of the compile time impact that was not worth compared to the performance improvement.

The main reason is that there are probably loops in the CFG. In my implementation, we decided that we never want to save / restore inside a loop, so we consider loops as a single block. We are using scc_iterators instead of MachineLoopInfo in order to handle irreducible loops as well, that are not handled by the current SW implementation. That way, we can compute the anticipation / availability attributes in linear time.

In terms of correctness, there is one test on X86 that still fails, and two others on AArch64, because of the way compact unwinding encodes register saves.

In terms of compile-time, there are no regressions on CTMark.

In terms of code-size, we get a 0.8% increase on X86_64 and AArch64, mostly because we cannot use push / pop anymore.

For now, we only worked on shrink-wrapping CSRs, but keep the stack setup in the entry block / return blocks, which can give worse results in some cases compared to the current shrink-wrapping pass. I am currently working on fixing this.

For execution-time, not many improvements showed up due to the stack setup and the transformation of push/pop -> mov $reg, (mem)/mov (mem), $reg which can be partially solved.

In terms of what the algorithm can do, and how it can outperform the current one, I got some stats based on where we save / restore, along with the block frequency (with PGO), and we can see a theoretical 8% improvement.

I put an early review here a while ago: ⚙ D30808 shrink-wrap: implement more advanced algorithm, and I will update it soon. As you can see, all the PrologEpilogInserter and (X86|AArch64)TargetFrameLowering code look horribly hacky, because so many things assume the stack setup and callee saved saves will stick together. Fixing all those assumptions is going to be the most tricky part of shrink-wrapping, not the algorithm itself.

There are some cases where the current shrink-wrapping is better, and that’s in cases like in the Fig.3 of Chow’s paper, and that can be probably solved by using something similar to what’s described in this paper: Post Register Allocation Spill Code Optimization by Christopher Lupo and Kent D. Wilken, which uses the PST to optimize saves / restores placement along with a cost model.

Cheers,

Francis Visoiu Mistrih <fvisoiumistrih@apple.com> writes:

I'm resending this as I forgot to CC llvm-dev on my reply last night....

Hi Francis,

Thanks for the detailed reply!
I took a quick look at the patch, and I'll try to take a closer look at
it tomorrow.

Could you please subscribe me to future patches? Either myself or
someone from my team can help with testing and/or integration on PPC if
you'd like. This is something we've been struggling with for a while and
are anxious to make some progress on it :slight_smile:

Kit

Just FYI:
You can use Herald (https://reviews.llvm.org/herald/) to configure rules that auto-add you as a subscriber/reviewer to various changes.
Pretty much anything you can query you can do :slight_smile: