Question regarding basic-block placement optimization

Hello,

I’m working on basic-block placement optimizations based on branch probability information. I’ve run into a stumbling block though. One of the existing passes to do this, essentially a dead pass ‘block-placement’ operates on the IR, reordering the blocks of the function, and relying on the code generator to preserve that order unless it has a reason to rearrange it. This pass is entirely untested AFAICT, and it doesn’t work any more…

That said, I’ve replicated its functionality with a much updated transformation algorithm (no longer O(N^2), and finds better orderings!) and predicated it on the new frequency and probability information. It “works” great, but it doesn’t actually optimize anything. It re-arranges all the blocks of the IR function, laying them out perfectly. And then the code generation layer throws that away and seems to establish its own ordering altogether.

What’s the best approach here? Is ignoring the order of blocks in the IR function desired behavior? (It does seem reasonable, it just seems to have not always been that way, so I wondered.)

Should I sink my pass down to a codegen pass?

If this really should live as a CodeGen pass, then what should become of the existing code-placement pass in the CodeGen layer? Currently that pass does two things:

  1. It aligns loop bodies. This seems like a completely separate concern, although likely a good one to do after any other placement optimizations.
  2. It rearranges very specific loop structures using very specific heuristics. Some of these seem redundant in the presence of probability based information, but others seem orthogonal. Honestly, I don’t understand any of this very well, so it’s likely I’m just misreading it.

I’m a bit wary of implementing this in the CodeGen layer as I worry that the layout of the MBBs and their various branches will have already made assumptions regarding the structure and shape of the CFG of the function… But it would also perhaps give more stability to the resulting orderings, improving the benefits of the pass.

Thoughts?

Chandler,

Can you please elaborate on the following statement?

” I've replicated its functionality with a much updated transformation algorithm (no longer O(N^2), and finds better orderings!)”

Is this a new algorithm or something you can refer to. Thanks.

Sergei

I think this should really live as a CodeGen pass. Is there any good reason to make it an IR pass?

Cameron

So, as it happens, I was completely wrong here. CodeGen correctly preserves the ordering of blocks from IR, unless it can do folding, etc. My original test cases ended up being folded into other structures, hiding the fact that the ordering wasn’t what was changing.

I’ve run much more real-world style test cases through the pass, and it is working beautifully. I’ll be attaching a patch shortly, I’m currently cleaning it up based on some over-the-shoulder feedback from Nick Lewycky.

As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don’t have the nicely analyzable ‘switch’ terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.

Operating at the IR level makes writing the pass a breeze, and we can quickly and efficiently simply lay the blocks out in the ideal ordering. Then the SelectionDAG and other layers can preserve this modulo folding and other optimization opportunities.

At the most, we should likely add probability awareness to some of these CodeGen passes to turn off (or on) their transforms around unlikely (or likely) edges.

As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don't have the nicely analyzable 'switch' terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.

We should be able to analyze more terminators in the backend. Most other compilers have this ability. If we don't fix this problem now, we will only increase the distance between the functionality of the backend IR and the mid-level IR.

Operating at the IR level makes writing the pass a breeze, and we can quickly and efficiently simply lay the blocks out in the ideal ordering. Then the SelectionDAG and other layers can preserve this modulo folding and other optimization opportunities.

The biggest problem with doing block layout at the IR level is that you don't have the final CFG. Lots of passes can modify the CFG, and they will have to rely on the sorts of local heuristics that already exist (and are known to perform poorly in some cases) to redo block layout after these changes. You also can't represent jump fall-throughs in the mid-level IR (nor should you be able to).

Cameron

I think this should really live as a CodeGen pass. Is there any good reason to make it an IR pass?

So, as it happens, I was completely wrong here. CodeGen correctly preserves the ordering of blocks from IR, unless it can do folding, etc.

That’s right. However, the CFG changes quite a bit during CodeGen.

Switches can be lowered into branch trees, multiple passes can split critical edges, and then there is taildup and tailmerge.

An IR code layout algorithm simply doesn’t know the final CFG.

As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don’t have the nicely analyzable ‘switch’ terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.

Those are all the wrong reasons for not doing the right thing.

Some basic blocks are glued together and must be placed next to each other. That situation can be recognized by “MBB->canFallThrough() && TII->AnalyzeBranch(MBB…)”.

Treat glued-together blocks as super-blocks, and everything should be as breezy as IR.

I realize the MBB interface is not the prettiest. Suggestions for cleaning it up are very welcome.

/jakob

I think this should really live as a CodeGen pass. Is there any good reason to make it an IR pass?

So, as it happens, I was completely wrong here. CodeGen correctly preserves the ordering of blocks from IR, unless it can do folding, etc.

That’s right. However, the CFG changes quite a bit during CodeGen.

Switches can be lowered into branch trees, multiple passes can split critical edges, and then there is taildup and tailmerge.

An IR code layout algorithm simply doesn’t know the final CFG.

To be clear, I don’t disagree with any of this. =] However, my rough experiments thus far (hoping to have real benchmark data soon) seem to indicate that just giving a baseline ordering of block to the CodeGen layer

As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don’t have the nicely analyzable ‘switch’ terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.

Those are all the wrong reasons for not doing the right thing.

Sorry, I’m not trying to do the wrong thing because of this… Currently, it feels like a trade-off in terms of cost/benefit. It’s not yet clear to me that the benefit of doing this analysis in the CodeGen layer outweighs the cost and I was trying to clarify what the costs I perceive are.

Some basic blocks are glued together and must be placed next to each other. That situation can be recognized by “MBB->canFallThrough() && TII->AnalyzeBranch(MBB…)”.

Treat glued-together blocks as super-blocks, and everything should be as breezy as IR.

But that’s just the thing – a primary goal of this pass would be to change the fall-through pattern. Currently, that can be done very easily, although to a limited extent, by changing the IR which enters the selection dag.

Maybe what we need is to have this pass at both layers? Then the codegen layer can work on the glued-together blocks to check for (and correct) any inappropriate CFG changes made in the intervening passes?

Also, it’s still not clear to me how to analyze switches in CodeGen, but that’s likely my lack of having read the appropriate interfaces thoroughly.

Treat glued-together blocks as super-blocks, and everything should be as breezy as IR.

But that’s just the thing – a primary goal of this pass would be to change the fall-through pattern.

That’s not a problem. I wasn’t talking about normal fall-through blocks. You simply call MBB->updateTerminator() for those.

Also, it’s still not clear to me how to analyze switches in CodeGen, but that’s likely my lack of having read the appropriate interfaces thoroughly.

Switches aren’t real, so they don’t exist in CodeGen.

Parts of switches can be lowered to jump tables and indirect branches.

An indirect branch will cause AnalyzeBranch() to fail, but canFallThrough() will still return false, and it is safe to move the successors around. This also works for computed goto.

/jakob

I think it’s mostly about understanding how MBBs work.

Ignoring calls and returns, most machines have three kinds of branches:

  1. Unconditional
  2. Conditional
  3. Indirect.

The AnalyzeBranch() function understands the first two kinds, so if that function returns false (as in it’s false that it didn’t succeed) you can move the successors around, and you know that placing a successor immediately after the block and calling updateTerminator() will give you a fall-through.

If AnalyzeBranch() fails, you can still check if the last instruction in the block is an unpredicated barrier. If so, it is still safe to move the successors around, but that block will never be a fall-through. The canFallThrough() function implements this check.

If the last instruction in the block is predicated or not a barrier, you must keep it together with its layout successor. This should only happen in rare cases where it is necessary. For example, I am planning to lower invoke instructions into call instructions that are terminators. This is necessary to accurately model control flow to landing pads. Such a call instruction must fall through to its layout successor.

Some experimental targets don’t implement AnalyzeBranch, so everything looks like an indirect branch. Those targets get the code placement they deserve.

I am not claiming the API is awesome, but the information you need is there, and you have the same freedom as for IR.

We explicitly designed the branch weights so switch lowering could annotate all the new branches with exact weights. It would be a shame to ignore that information.

So the benefits are:

  • Profile-driven fall-through layout of lowered switches. That should be a pretty big deal.
  • Proper placement of split critical edges.
  • The ability to implement stuff like: “Don’t put too many branches in a fetch group, or you’ll freak out the branch predictor”.

/jakob

As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don’t have the nicely analyzable ‘switch’ terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.

Those are all the wrong reasons for not doing the right thing.

Sorry, I’m not trying to do the wrong thing because of this… Currently, it feels like a trade-off in terms of cost/benefit. It’s not yet clear to me that the benefit of doing this analysis in the CodeGen layer outweighs the cost and I was trying to clarify what the costs I perceive are.

I think it’s mostly about understanding how MBBs work.

Indeed, that seems to be the case. =D Thanks for explaining things below, it helped me a lot.

Ignoring calls and returns, most machines have three kinds of branches:

  1. Unconditional
  2. Conditional
  3. Indirect.

The AnalyzeBranch() function understands the first two kinds, so if that function returns false (as in it’s false that it didn’t succeed) you can move the successors around, and you know that placing a successor immediately after the block and calling updateTerminator() will give you a fall-through.

If AnalyzeBranch() fails, you can still check if the last instruction in the block is an unpredicated barrier. If so, it is still safe to move the successors around, but that block will never be a fall-through. The canFallThrough() function implements this check.

If the last instruction in the block is predicated or not a barrier, you must keep it together with its layout successor. This should only happen in rare cases where it is necessary. For example, I am planning to lower invoke instructions into call instructions that are terminators. This is necessary to accurately model control flow to landing pads. Such a call instruction must fall through to its layout successor.

Some experimental targets don’t implement AnalyzeBranch, so everything looks like an indirect branch. Those targets get the code placement they deserve.

I am not claiming the API is awesome, but the information you need is there, and you have the same freedom as for IR.

We explicitly designed the branch weights so switch lowering could annotate all the new branches with exact weights. It would be a shame to ignore that information.

So the benefits are:

  • Profile-driven fall-through layout of lowered switches. That should be a pretty big deal.
  • Proper placement of split critical edges.
  • The ability to implement stuff like: “Don’t put too many branches in a fetch group, or you’ll freak out the branch predictor”.

These all seem like really good reasons, and your explanation helps me a lot. I’ll take a stab at re-implementing this on MBBs. In the mean time, I’ve attached my patch with the IR-level patch as most of the interesting logic will remain the same. Please be gentle, it’s my first proper LLVM pass, and my knowledge of optimization pass research & papers is limited. The part I’m least pleased with is the computation of weight for each chain in an SCC of chains, but so far I’ve not come up with anything better. =] I’ve tested this on several ad-hoc test cases, but nothing really thorough. Going to try moving it to the codegen layer first.

One question that remains, mostly to ensure I’ve understood you correctly: For switches, it is represented as an N-ary conditional terminator, and the targets of the switch can be freely intermingled with other MBBs?

I’ll attach an updated patch to work on MBBs when I have one…

block_placement_ir.patch (23.6 KB)

Ok, wow that wasn’t hard at all. This is still very much a rough draft, but it’s probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.

Still, for the test cases that don’t tickle the iteration bug, it generates beautiful, well laid out code. =]

% cat ifchain.ll
; RUN: opt < %s -analyze -block-freq | FileCheck %s

declare void @error(i32 %i, i32 %a, i32 %b)

define i32 @test(i32 %i, i32* %a, i32 %b) {
entry:
%gep1 = getelementptr i32* %a, i32 1
%val1 = load i32* %gep1
%cond1 = icmp ugt i32 %val1, 1
br i1 %cond1, label %then1, label %else1, !prof !0

then1:
call void @error(i32 %i, i32 1, i32 %b)
br label %else1

else1:
%gep2 = getelementptr i32* %a, i32 2
%val2 = load i32* %gep2
%cond2 = icmp ugt i32 %val2, 2
br i1 %cond2, label %then2, label %else2, !prof !0

then2:
call void @error(i32 %i, i32 1, i32 %b)
br label %else2

else2:
%gep3 = getelementptr i32* %a, i32 3
%val3 = load i32* %gep3
%cond3 = icmp ugt i32 %val3, 3
br i1 %cond3, label %then3, label %else3, !prof !0

then3:
call void @error(i32 %i, i32 1, i32 %b)
br label %else3

else3:
%gep4 = getelementptr i32* %a, i32 4
%val4 = load i32* %gep4
%cond4 = icmp ugt i32 %val4, 4
br i1 %cond4, label %then4, label %else4, !prof !0

then4:
call void @error(i32 %i, i32 1, i32 %b)
br label %else4

else4:
%gep5 = getelementptr i32* %a, i32 3
%val5 = load i32* %gep5
%cond5 = icmp ugt i32 %val5, 3
br i1 %cond5, label %then5, label %exit, !prof !0

then5:
call void @error(i32 %i, i32 1, i32 %b)
br label %exit

exit:
ret i32 %b
}

!0 = metadata !{metadata !“branch_weights”, i32 4, i32 64}

% ./bin/llc -O2 -o - ifchain.ll
.file “ifchain.ll”
.text
.globl test
.align 16, 0x90
.type test,@function
test: # @test
.Ltmp4:
.cfi_startproc

BB#0: # %entry

pushq %rbp
.Ltmp5:
.cfi_def_cfa_offset 16
pushq %r14
.Ltmp6:
.cfi_def_cfa_offset 24
pushq %rbx
.Ltmp7:
.cfi_def_cfa_offset 32
.Ltmp8:
.cfi_offset %rbx, -32
.Ltmp9:
.cfi_offset %r14, -24
.Ltmp10:
.cfi_offset %rbp, -16
movl %edx, %ebx
movq %rsi, %r14
movl %edi, %ebp
cmpl $2, 4(%r14)
jb .LBB0_2
.LBB0_2: # %else1
cmpl $3, 8(%r14)
jb .LBB0_4
.LBB0_4: # %else2
cmpl $4, 12(%r14)
jb .LBB0_6
.LBB0_6: # %else3
cmpl $5, 16(%r14)
jb .LBB0_8
.LBB0_8: # %else4
cmpl $4, 12(%r14)
jb .LBB0_10
.LBB0_10: # %exit
movl %ebx, %eax
popq %rbx
popq %r14
popq %rbp
ret
.LBB0_1: # %then1
movl %ebp, %edi
movl $1, %esi
movl %ebx, %edx
callq error
.LBB0_3: # %then2
movl %ebp, %edi
movl $1, %esi
movl %ebx, %edx
callq error
.LBB0_5: # %then3
movl %ebp, %edi
movl $1, %esi
movl %ebx, %edx
callq error
.LBB0_7: # %then4
movl %ebp, %edi
movl $1, %esi
movl %ebx, %edx
callq error
.LBB0_9: # %then5
movl %ebp, %edi
movl $1, %esi
movl %ebx, %edx
callq error
.Ltmp11:
.size test, .Ltmp11-test
.Ltmp12:
.cfi_endproc
.Leh_func_end0:

.section “.note.GNU-stack”,"",@progbits

block_placement_codegen.patch (25.3 KB)

Ok, wow that wasn’t hard at all. This is still very much a rough draft, but it’s probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.

Oh, of course, i’m not updating the terminators yet… I’ll add that…

Could that cause an infloop when iterating over MBBs in a function?
Specifically what I’m seeing is a single MBB iterator which when incremented returns itself. Every time. even calling ‘dump’ on the MachineFunction inf-loops, so clearly I’ve corrupted the state of the function somehow.

Ok, wow that wasn't hard at all.

Awesome :wink:

This is still *very* much a rough draft, but it's probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.

Maybe splicing a block into it current position will create a loop?

Some random notes:

- Please add a description of the algorithm.
- Please add a comment to the BlockChain class.
- Use a separate anonymous namespace per class, and don't indent for the namespace.

+BlockChain *BlockPlacement2::CreateChain(MachineBasicBlock *BB) {
+ Chains.push_back(BlockChain(BlockToChain, BB));
+ BlockToChain[BB] = &Chains.back();
+ assert(ActiveChains.insert(&Chains.back()));
+ return &Chains.back();
+}

Whoa, you are storing pointers into a growing vector. You should at least assert(Chains.size() != Chains.capacity()) before pushing.

+ RequiredChain->BBEnd = ++MachineFunction::iterator(From);

Does C++ allow this these days? I would prefer llvm::next(MachineFunction::iterator(From))

Note that MachineFunction::iterator decays into an MBB pointer, so you can say FI->canFallThrough() and AnalyzeBranch(*FI...)

+ WeightedEdge WE = { BaseFrequency * MBPI->getEdgeProbability(From, To),

Did we add API for this computation? We intended to add a function that computes this and saturates on overflow etc.

Still, for the test cases that don't tickle the iteration bug, it generates beautiful, well laid out code. =]

I am sure others have strong opinions about the algorithm :wink:

I would like to see a more explicit handling of loop layout. There is a tradeoff between entering the loop with a branch, and having the exit block last. It is not clear to me that this algorithm handles that properly.

Be careful about placing too much trust in the behavior of branch probabilities. They go out of whack when they saturate.

Second, you should handle blocks that can never fall through (indirect branches from jump tables). There is no need to try to place their successors.

/jakob

Ok, wow that wasn't hard at all.

Awesome :wink:

This is still *very* much a rough draft, but it's probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.

Maybe splicing a block into it current position will create a loop?

Some random notes:

- Please add a description of the algorithm.
- Please add a comment to the BlockChain class.
- Use a separate anonymous namespace per class, and don't indent for the namespace.

+BlockChain *BlockPlacement2::CreateChain(MachineBasicBlock *BB) {
+ Chains.push_back(BlockChain(BlockToChain, BB));
+ BlockToChain[BB] = &Chains.back();
+ assert(ActiveChains.insert(&Chains.back()));
+ return &Chains.back();
+}

Whoa, you are storing pointers into a growing vector. You should at least assert(Chains.size() != Chains.capacity()) before pushing.

+ RequiredChain->BBEnd = ++MachineFunction::iterator(From);

Does C++ allow this these days? I would prefer llvm::next(MachineFunction::iterator(From))

Note that MachineFunction::iterator decays into an MBB pointer, so you can say FI->canFallThrough() and AnalyzeBranch(*FI...)

+ WeightedEdge WE = { BaseFrequency * MBPI->getEdgeProbability(From, To),

Did we add API for this computation? We intended to add a function that computes this and saturates on overflow etc.

Still, for the test cases that don't tickle the iteration bug, it generates beautiful, well laid out code. =]

I am sure others have strong opinions about the algorithm :wink:

I would like to see a more explicit handling of loop layout. There is a tradeoff between entering the loop with a branch, and having the exit block last. It is not clear to me that this algorithm handles that properly.

I agree. The current CodePlacementOpt pass handles blocks layout within a loop. The new BB placement optimization should be integrated with it.

Evan

This is still *very* much a rough draft, but it's probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.

Some random notes:

- Please add a description of the algorithm.
- Please add a comment to the BlockChain class.
- Use a separate anonymous namespace per class, and don't indent for the namespace.

...

Note that MachineFunction::iterator decays into an MBB pointer, so you can say FI->canFallThrough() and AnalyzeBranch(*FI...)

+ WeightedEdge WE = { BaseFrequency * MBPI->getEdgeProbability(From, To),

Did we add API for this computation? We intended to add a function that computes this and saturates on overflow etc.

Overflow is handled transparently in the overloaded BlockFrequency::operator*(BranchProbability). But you could use the existing API anyway by adding a helper to MachineBlockFrequencyInfo calling BlockFrequencyImpl::getEdgeFreq(Src,Dst).

Still, for the test cases that don't tickle the iteration bug, it generates beautiful, well laid out code. =]

I am sure others have strong opinions about the algorithm :wink:

I may be one of those others ;-). Although handling SCCs (loops) probably saves Chandler's design from going too far off the rails. I don't remember that being in the published algorithm.

I would like to see a more explicit handling of loop layout. There is a tradeoff between entering the loop with a branch, and having the exit block last. It is not clear to me that this algorithm handles that properly.

How do you ensure that BlockChains start at loop headers?

How do you ensure that loop exits are laid out following the loop body, especially in the presence of critical edges?

Why not layout blocks according to MachineLoopInfo? The SCC abstraction seems overly generic and unnecessary.

When you merge chains, why don't you delete the edge between the chains?

Why do you run profile guided block layout after the existing CodePlacementOpt? Shouldn't it be the other way around so that CodePlacementOpt can cleanup loops, which BlockPlacement2 doesn't handle well? I think the answer is that BlockPlacement2 actually depends on loops being laid out sensibly before running, but that needs to be explained.

Be careful about placing too much trust in the behavior of branch probabilities. They go out of whack when they saturate.

Second, you should handle blocks that can never fall through (indirect branches from jump tables). There is no need to try to place their successors.

Please put this under some flag for now. Enabling it by default opens up a whole set of issues that are premature to discuss. Until the overall code layout strategy is more coherent, people will only want to enable profile guided layout when they have complete profile info, or really need to take full advantage of builtin expect. If you're not one of those people, you'll want to leave this disabled to avoid unnecessary misery when debugging generated disassembly, and reduce the number of places that codegen makes semi-random decisions that perturb the code.

Otherwise, the code looks nice! Good comments.

-Andy

Thanks for all of the comments Andy and Jakob. A new patch is attached that is much less of a rough draft. Sorry for any confusion due to the early state of the patch.

Also, many many thanks to Jakob for his explanation of how the branches between MBBs should be handled, and Andy, Jim, and Eric who helped answer my questions on IRC when I ran into stumbling blocks. Also, Nick, who shoulder surfed and helped with a lot of the tricky debugging. The previous patch had some… very annoying to track down bugs. =]

Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I’ll be the first to say it’s almost certainly not ready for that just yet.

I’m more hopeful that we can delete the existing block placement pass, and direct anyone interested in profile file guided stuff to write a simple pass to load profiles into metadata. I suspect this pass is already superior to that one.

Responses to some comments inline:

This is still very much a rough draft, but it’s probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.

Some random notes:

  • Please add a description of the algorithm.

There is more described now in various comments, but I’m going to work on an overview in the file comments. Not done yet, but wanted to get it back out for review in a more polished state.

  • Please add a comment to the BlockChain class.

Done.

  • Use a separate anonymous namespace per class, and don’t indent for the namespace.

Done.

Note that MachineFunction::iterator decays into an MBB pointer, so you can say FI->canFallThrough() and AnalyzeBranch(*FI…)

I’m aware, but there were a few weird places where it didn’t happen, and ‘From’ seemed like a more readable name anyways… Lemme know if you’d really rather I use the iterator everywhere.

  • WeightedEdge WE = { BaseFrequency * MBPI->getEdgeProbability(From, To),

Did we add API for this computation? We intended to add a function that computes this and saturates on overflow etc.

Overflow is handled transparently in the overloaded BlockFrequency::operator*(BranchProbability). But you could use the existing API anyway by adding a helper to MachineBlockFrequencyInfo calling BlockFrequencyImpl::getEdgeFreq(Src,Dst).

Yea, this should be handled by the BlockFrequency object correctly here, but if you’d rather see the helper I can add that. Was just keeping this patch more focused on the pass.

Still, for the test cases that don’t tickle the iteration bug, it generates beautiful, well laid out code. =]

I am sure others have strong opinions about the algorithm :wink:

I may be one of those others ;-). Although handling SCCs (loops) probably saves Chandler’s design from going too far off the rails. I don’t remember that being in the published algorithm.

It definitely isn’t. The published algorithm is very hand-wave-y about exactly how you select between chains when there isn’t an obvious ordering.

I would like to see a more explicit handling of loop layout. There is a tradeoff between entering the loop with a branch, and having the exit block last. It is not clear to me that this algorithm handles that properly.

Keep in mind that the SCCs are across block chains. Most of the loops I saw formed a single chain with back edges, and so would be laid out exactly as they came into this pass, modulo fallthrough branches that for some reason we have probability information that says aren’t likely to be taken. That said, I haven’t done a thorough investigation of this… See below…

How do you ensure that BlockChains start at loop headers?

How do you ensure that loop exits are laid out following the loop body, especially in the presence of critical edges?

These are really interesting questions. I’m not sure how to answer them. Is this something we could continue to iterate on? Maybe you could point me at how to dig into these issues?

Why not layout blocks according to MachineLoopInfo? The SCC abstraction seems overly generic and unnecessary.

I’m not sure – the SCCs here are often at a much coarser granualarity than those in the MBB graph. If anything, it might be good to use MachineLoopInfo to force loops into a single pre-laid-out chain; but I haven’t looked at that pass yet. I’m interested in adding this kind of special and careful logic to the pass though, and making it much more loop-structure-aware.

When you merge chains, why don’t you delete the edge between the chains?

No good reason. The edges were a wart on the entire design. They’re gone now, and we just use the BB → successor list → block-to-chain mapping sequence to deduce edges when needed.

Why do you run profile guided block layout after the existing CodePlacementOpt? Shouldn’t it be the other way around so that CodePlacementOpt can cleanup loops, which BlockPlacement2 doesn’t handle well?

Yep, I just slapped it in there for testing. I’ve put it immediately before the CodePlacementOpt pass, but I know very little about the other passes. Let me know if there is a better home for it.

I think the answer is that BlockPlacement2 actually depends on loops being laid out sensibly before running, but that needs to be explained.

Essentially, yes. How should we document it? Does that change if we teach it to explicitly use some of the loop analysis passes?

Be careful about placing too much trust in the behavior of branch probabilities. They go out of whack when they saturate.

Second, you should handle blocks that can never fall through (indirect branches from jump tables). There is no need to try to place their successors.

Please put this under some flag for now. Enabling it by default opens up a whole set of issues that are premature to discuss. Until the overall code layout strategy is more coherent, people will only want to enable profile guided layout when they have complete profile info, or really need to take full advantage of builtin expect. If you’re not one of those people, you’ll want to leave this disabled to avoid unnecessary misery when debugging generated disassembly, and reduce the number of places that codegen makes semi-random decisions that perturb the code.

Couldn’t agree more. The goal is to get something started.

Otherwise, the code looks nice! Good comments.

Thanks! Please let me know both what needs to be done to get this first version checked in, and what the important next steps are.

On my end, I’m working on getting some benchmark numbers next.

block_placement_codegen.patch (29.2 KB)

A new patch is attached that is *much* less of a rough draft. Sorry for any confusion due to the early state of the patch.

Thanks, Chandler. This is great stuff.

Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I'll be the first to say it's almost certainly not ready for that just yet.

I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches.

Could you rename it to MachineBlockPlacement.cpp first, though? That makes it clear it's a CodeGen pass, and the BlockPlacement2 name is icky.

I'm more hopeful that we can delete the existing block placement pass, and direct anyone interested in profile file guided stuff to write a simple pass to load profiles into metadata. I suspect this pass is already superior to that one.

I also see it as a replacement for CodePlacementOpt.cpp, so I think your flag should run your pass instead of CodePlacementOpt rather than before or after it.

You should simply clone the loop alignment stuff, it's pretty trivial.

/jakob

A new patch is attached that is much less of a rough draft. Sorry for any confusion due to the early state of the patch.

Thanks, Chandler. This is great stuff.

Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I’ll be the first to say it’s almost certainly not ready for that just yet.

I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches.

Awesome, same way I feel. Checked in as r142641. I’ve got some generic cleanup I’ll go ahead and commit to it over the next few days.

Also, let me know whether you’d prefer post-commit or pre-commit review for various feature additions we’ve talked about here. I’m fine either way, so its really what makes things easier on your end. I’ll definitely want review on all of it as some of this (especially the loop structure issues) is still a bit new to me. =D

At least some of it (the loop alignment from CodePlacementOpt) is obvious how to port across, but I’m sure there will be more tricky elements around the loop structures.

Could you rename it to MachineBlockPlacement.cpp first, though? That makes it clear it’s a CodeGen pass, and the BlockPlacement2 name is icky.

Yea, the 2 thing was only intended to work around a brief duplication with the existing pass. MachineBlockPlacement solves both problems nicely.

I’m more hopeful that we can delete the existing block placement pass, and direct anyone interested in profile file guided stuff to write a simple pass to load profiles into metadata. I suspect this pass is already superior to that one.

I also see it as a replacement for CodePlacementOpt.cpp, so I think your flag should run your pass instead of CodePlacementOpt rather than before or after it.

You should simply clone the loop alignment stuff, it’s pretty trivial.

Done, and in progress. SHould have the loop alignment stuff cloned right away, and then i’ll start looking at the loop structure issues. I’ll probably have some questions for you and/or andy about exactly how that should work, whether CodePlacementOpt is doing the right thing, etc.

A new patch is attached that is much less of a rough draft. Sorry for any confusion due to the early state of the patch.

Thanks, Chandler. This is great stuff.

Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I’ll be the first to say it’s almost certainly not ready for that just yet.

I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches.

Awesome, same way I feel. Checked in as r142641. I’ve got some generic cleanup I’ll go ahead and commit to it over the next few days.
Done, and in progress. SHould have the loop alignment stuff cloned right away, and then i’ll start looking at the loop structure issues. I’ll probably have some questions for you and/or andy about exactly how that should work, whether CodePlacementOpt is doing the right thing, etc.

Your work looks good and there’s talk of making this the default
layout pass. So let’s figure out the goals and requirements. We agree
that preserving “structure” of the CFG is important in the absence of
profile information to the contrary. This is important for performance
stability in the presence of imperfect profile data and unbiased
branches. But it’s also important for the sanity of fellow compiler
engineers and anyone debugging the disassembly. We can define
structure very simply by saying that we want contiguous loops and
topologically ordered regions within a loop. Those are not the
strongest constraints. We could still end up with strangely shuffled
code, but they are easy constraints to define.

Once you decide to break these constraints, you have effectively
designated the offending paths as “cold”. It’s up to you how to
decide which paths are cold. You’ll need some threshold based on
confidence in the profile, and we can measure the effectiveness of
your heuristics and add more over time. The key point is that the
layout algorithm needs to make an explicit binary decision at some
point to violate the natural structure of the CFG.

I don’t care much about how you layout the cold chains as long as they
are not interleaved with the rest of the code, thus breaking the
topological ordering and forcing extra branches. In practice, this
means they get their own little Siberia after the function return,
effectively splitting the function into two sections. However, the
notion of “coldness” is really relative to the current loop head. So
you may layout a nice contiguous, topologically ordered inner loop,
then determine the whole thing is cold relative to its outer loop, so
send the whole thing packing to Siberia. That’s exactly what we
want. The loop itself is sane, but doesn’t get in the way of hotter
enclosing code.

Even within the constraints imposed by CFG structure, you can reduce
the likelihood of taken branches. Here you can employ any
profile-based or static heuristics with lower confidence, because the
resulting layout will be reasonably good either way. This will still
do a great job in the absence of accurate profile information, builtin
expect, or very strong static heuristics (e.g. landing pads are cold).

This is where the algorithm gets interesting and you can be creative,
but I’ll give you a rough idea so we’re on the same page:

  1. Visit each loop independently inside-out.

  2. Construct chains of blocks within the loop. Here the “entry” chain
    is the loop header, or function entry. You can probably represent
    each already laid out inner loop as a single chain. The graph of
    chains within a loop is acyclic except for irreducible control
    flow, which doesn’t have any structure to preserve anyway. There’s
    no need to compute SCCs.

  3. Stop constructing the chain at a CFG merge. Chaining past a merge
    would immediately violate topological ordering, which is only ok if
    other predecessors of the merge have been designated “cold”.

  4. Splice the chains together, preserving topological order. Assigning
    each block to a DFS reverse-post-order number provides a nice
    ordering that is easy to compare or sort.

Let me give you a little example to think about:

A:
br B(80%), C(20%)
B:
br D
C:
br D(90%), E(10%)
D:
br F
E:
br F
F:
ret

This is a fine topological sort but bad layout based on expected
branch probability (even with low confidence in the profile).

The only good layout that preserves topological ordering is:
A, B, C, E, D, F

.8 * 1 branch + .18 * 2 branches + .02 * 2 branches

This approach likely covers the existing functionality of
CodePlacement, which attempts to keep loops contiguous.
You will probably need to do the loop tail adjustment that
CodePlacement currently does as a special case.
(Shift the tail of the loop to precede its header, resulting in
a fall-through loop test.)

We can discuss the design more, but you may want to think about it and
come with your own design first.

Finally, we need some way to validate the design and verify the
implementation. Weighted statistics will be very useful here, similar
to those that Jakob used in the register allocator. For example:

For each nonadjacent edge:
TakenBranchFrequency += frequency(Edge)

This metric will prefer the most “extreme” layout because it assumes
perfect accuracy of block frequencies. It would be nice to look at the
same metric computed from skewed branch probabilities to see how
sensitive it is to small shifts in branch behavior.

Taken branch frequency is the most obvious, but doesn’t completely
capture locality. You could also measure distance jumped fairly
easily, also weighted by edge frequency.

It might be useful to gather these stats as late as possible to take
into consideration the final impact of layout, including the effect on
any later passes.

-Andy

A new patch is attached that is much less of a rough draft. Sorry for any confusion due to the early state of the patch.

Thanks, Chandler. This is great stuff.

Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I’ll be the first to say it’s almost certainly not ready for that just yet.

I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches.

Awesome, same way I feel. Checked in as r142641. I’ve got some generic cleanup I’ll go ahead and commit to it over the next few days.

Done, and in progress. SHould have the loop alignment stuff cloned right away, and then i’ll start looking at the loop structure issues. I’ll probably have some questions for you and/or andy about exactly how that should work, whether CodePlacementOpt is doing the right thing, etc.

Your work looks good and there’s talk of making this the default
layout pass. So let’s figure out the goals and requirements. We agree
that preserving “structure” of the CFG is important in the absence of
profile information to the contrary. This is important for performance
stability in the presence of imperfect profile data and unbiased
branches. But it’s also important for the sanity of fellow compiler
engineers and anyone debugging the disassembly. We can define
structure very simply by saying that we want contiguous loops and
topologically ordered regions within a loop. Those are not the
strongest constraints. We could still end up with strangely shuffled
code, but they are easy constraints to define.

Once you decide to break these constraints, you have effectively
designated the offending paths as “cold”. It’s up to you how to
decide which paths are cold. You’ll need some threshold based on
confidence in the profile, and we can measure the effectiveness of
your heuristics and add more over time. The key point is that the
layout algorithm needs to make an explicit binary decision at some
point to violate the natural structure of the CFG.

I don’t care much about how you layout the cold chains as long as they
are not interleaved with the rest of the code, thus breaking the
topological ordering and forcing extra branches. In practice, this
means they get their own little Siberia after the function return,
effectively splitting the function into two sections. However, the
notion of “coldness” is really relative to the current loop head. So
you may layout a nice contiguous, topologically ordered inner loop,
then determine the whole thing is cold relative to its outer loop, so
send the whole thing packing to Siberia. That’s exactly what we
want. The loop itself is sane, but doesn’t get in the way of hotter
enclosing code.

Even within the constraints imposed by CFG structure, you can reduce
the likelihood of taken branches. Here you can employ any
profile-based or static heuristics with lower confidence, because the
resulting layout will be reasonably good either way. This will still
do a great job in the absence of accurate profile information, builtin
expect, or very strong static heuristics (e.g. landing pads are cold).

This is where the algorithm gets interesting and you can be creative,
but I’ll give you a rough idea so we’re on the same page:

  1. Visit each loop independently inside-out.

  2. Construct chains of blocks within the loop. Here the “entry” chain
    is the loop header, or function entry. You can probably represent
    each already laid out inner loop as a single chain. The graph of
    chains within a loop is acyclic except for irreducible control
    flow, which doesn’t have any structure to preserve anyway. There’s
    no need to compute SCCs.

  3. Stop constructing the chain at a CFG merge. Chaining past a merge
    would immediately violate topological ordering, which is only ok if
    other predecessors of the merge have been designated “cold”.

  4. Splice the chains together, preserving topological order. Assigning
    each block to a DFS reverse-post-order number provides a nice
    ordering that is easy to compare or sort.

FYI, I’m still working on the details, and the code is a complete hack, but I have a big chunk of this working now. =] The more precise strategy I implemented follows your suggestion very closely. I’ve got lots more ideas to layer on top of this, but this is my first stab at providing a better underpinning.

First, it walks all loops in the function from the inside out just as you describe. =] Really like this, it’s a very nice model to use.

For each loop, after having walked all sub-loops, it walks forward across all the blocks in the loop. For each block not part of a chain, it creates a single-block chain. If the block has more than one predecessor (I think this is what you were after w/ a “merge point”?), we’re done. If the block is the start of the chain it belongs to, and if it is the most likely of the in-loop successors of the previous block, and that previous block is the tail of its chain, we merge its chain to the previous block’s chain. This should only form chains which are valid in the existing topological ordering within a loop, and don’t cross loop boundaries or cross incoming branches.

Then we traverse the BB graph in RPO, and for each block look up its chain. The first time we encounter a chain, we splice it into the function.

I think this satisfies your desire to preserve the structural components of the CFG? If not, I’d love to understand where I’ve deviated / what I’ve misunderstood.

It already makes some minimal use of the probabilities to select which topology-preserving fall-through options to prefer. I’m also going to tie-break with the existing layout. I’ve not started to really integrate it with probabilities though, I wanted to get something that fit the constraint model you’re looking for first, and had clean traversal semantics.

I’ll clean the patch up and mail it out tomorrow. The implementation gets tremendously simpler by A) using the loop info, and B) exclusively walking forward over the blocks. The first simplified the traversal greatly, and the second made it easy to just store pointers to basic blocks for the chains rather than moving blocks around in the function itself. That gets rid of a bunch of other code, and the most bug-prone of the other code. So all in all, I like where this is going.