Question regarding basic-block placement optimization

Ok, I think I have a working pass that is based much more on what we’ve talked about here. The patch is attached. I’d love to commit it soon-ish and then start tweaking it based on feedback from you, others, and looking at how it actually works in the wild.

It’s essentially a re-write, so it may be hard to read… Let me know if some other form would be easier.

Some responses to your comments below, perhaps giving insight into the choices I made or what the pass is currently doing and what I haven’t addressed yet.

Once you decide to break these constraints, you have effectively

designated the offending paths as “cold”. It’s up to you how to
decide which paths are cold. You’ll need some threshold based on
confidence in the profile, and we can measure the effectiveness of
your heuristics and add more over time.

FYI, I ended up with a heuristic that requires a hot edge to be 4x more likely than any other competing in the CFG. This is very loosely based on the definition of a ‘hot’ edge in BranchProbabilityInfo (80% vs. 20%).

I don’t care much about how you layout the cold chains as long as they
are not interleaved with the rest of the code, thus breaking the
topological ordering and forcing extra branches. In practice, this
means they get their own little Siberia after the function return,
effectively splitting the function into two sections. However, the
notion of “coldness” is really relative to the current loop head.

I’ve not really tried to ensure this happens correctly. Mostly, this is just forming hot-paths through the code when they are really hot. I’ll work on better sectioning off the cold paths that are cut off in a followup patch.

  1. Construct chains of blocks within the loop. Here the “entry” chain
    is the loop header, or function entry. You can probably represent
    each already laid out inner loop as a single chain. The graph of
    chains within a loop is acyclic except for irreducible control
    flow, which doesn’t have any structure to preserve anyway. There’s
    no need to compute SCCs.

I’m actually not satisfied with how I’ve done this. I think there is likely a cleaner / better solution than the one I have in the patch, but it largely works, and is easy to change later. The patch just walks forward over the function, and tries to chain blocks together with their successors when the CFG permits, choosing successors based on whatever (even weak) signal the probability info gives us.

  1. Splice the chains together, preserving topological order. Assigning
    each block to a DFS reverse-post-order number provides a nice
    ordering that is easy to compare or sort.

Currently I’m doing this once at the very end. I wonder if it would be better to do iteratively as we walk up through the loops?

Let me give you a little example to think about:

A:
br B(80%), C(20%)
B:
br D
C:
br D(90%), E(10%)
D:
br F
E:
br F
F:
ret

This is a fine topological sort but bad layout based on expected
branch probability (even with low confidence in the profile).

The only good layout that preserves topological ordering is:
A, B, C, E, D, F

I made a test case out of this, and I think it works. It’s a bit confusing at it just replaces branches to F with ret. That removes the edge from D to F, and makes the following ordering also viable: A, B, C, D, E. That’s the ordering it chooses, and it also duplicates D up into B instead of branching from B to D. :: shrug ::. It looks pretty good to me though. =]

Finally, we need some way to validate the design and verify the
implementation. Weighted statistics will be very useful here, similar
to those that Jakob used in the register allocator. For example:

For each nonadjacent edge:
TakenBranchFrequency += frequency(Edge)

This metric will prefer the most “extreme” layout because it assumes
perfect accuracy of block frequencies. It would be nice to look at the
same metric computed from skewed branch probabilities to see how
sensitive it is to small shifts in branch behavior.

Taken branch frequency is the most obvious, but doesn’t completely
capture locality. You could also measure distance jumped fairly
easily, also weighted by edge frequency.

It might be useful to gather these stats as late as possible to take
into consideration the final impact of layout, including the effect on
any later passes.

Cool, I’m working on this next. I’ve also uncovered some completely broken aspects of branch probability. Going back and adding tests and fixes for it. =]

mbp-redux.patch (32.8 KB)

After chatting briefly w/ Owen in IRC, he encouraged me to just commit this as it’s behind an off-by-default flag, and not touching anything else. I’m going to go ahead with that so I can more easily start working on statistics and clean up associated other stuff. It’ll also help me debug some weird probabilities.

Feel free to carry on with discussion here or on the commit thread. =] Thanks again for all the really great feedback and thoughts!!

That’s fine. I’ve been following along and what you’re doing sounds right. I’ll try to provide feedback on your latest commit once I have time to fully understand how it works.

Andy

Great. I didn’t expect it to work as test case without tweaking. The case when no blocks get folded or tail duped is interesting. You’ll have to disable those codegen optimizations or just add blocks and instructions until they’re defeated.

Andy

Once you move a block " out of line" it will naturally end up floating to the end of the function when you do a good job splicing the other warm chains. So I don’t think it needs to be explicit in the design. The explicit decision is simply when to move a chain out of line ( no longer topplogical).

Depending on the algorithm you may need to do some work to ensure loops are followed by their “best” loop exit.

Andy