[LLVM IR] Basic Block reordering in LLVM IR

Hello,
I was trying to find out if there is any pass in llvm IR that does basic block reordering. BOLT does bb reordering at the machine IR level however I am unsure if this happens at the IR level at all.
It would be great if I can get any clarification on that.
Thanks,
Vidush

Block reordering really makes sense only when you are serializing the blocks, which happens at the machine level (which is where BOLT operates, IIUC). Block ordering can give you advantages such as avoiding taking a branch on a common path, or collecting hot blocks together to be adjacent in the cache.

Block “order” doesn’t mean as much in the IR. The basic blocks do form a directed graph, and so there is an “order” in the control-flow sense. Also there are algorithms for traversing that graph that sometimes have “order” in the name, such as Depth First Order. But block reordering in the machine-code sense doesn’t really apply to IR.

Looping in @jdoerfert I saw some performance improvements of random basic block orders at the IR level. For example, around 6% for bzip2 compared to a baseline of O3.

That didn’t really answer the question.

Wrt. it doesn’t make much sense: a) Not all backends go through upstream MIR-levels; thus if you don’t control you backend you can’t reorder your blocks there. b) We have plenty of algorithms that do not traverse blocks in a graph like-fashion; I honestly expect to see impacts just on them by reordering the block list (not necessarily conceptually but due to impl. artifacts).

Wrt. it doesn’t mean much in IR: We see different results in some preliminary tests. Now the question still is, what capabilities do we have in reordering blocks based on some bespoke heuristic which we could provide. If the answer is “Bolt and Propeller”, we will have to build something new as most of our hardware flops are neither X86 nor ARM [Bolt] and there is no Linux perf support [Propeller].

We don’t necessarily have to go through BOLT and Propeller. Context sensitive PGO (utilizing a second round of profiling post-inlining) is able to capture most of the performance gains of post-link optimizers like BOLT and Propeller by enabling better optimizations like BB layout.

Additionally, I think BOLT and Propeller might be a bit more generic than you’re thinking here (but might be missing some obvious limitations), or at least significant elements can be reused on devices like GPUs if the entire workflow can’t be. BOLT has been ported to other (CPU) architectures including RISC-V. Object file format is still sparse (ELF-only). I don’t think there are any conceptual limitations to porting to more arches/object formats, but I might be missing something there.

The propeller infrastructure currently works with sampled profiles (from what I can discern), but there’s no reason it couldn’t use an instrumented profile. And at the end of the day, the propeller tooling generates a list of BB sections/symbol orderings that the linker (lld) then uses to put code into the right places. That tool can easily be substituted out. The fact that the linker is involved might preclude it’s use in non-open toolchains, but it should be applicable barring significant restrictions to GPU linking that I’m not aware of (in open toolchains).

Anyways, kind of off the main topic of the thread.

The order of blocks in LLVM IR doesn’t have any semantic significance. And it’s hard to predict what influence, if any, reordering would have on the backend. So we generally try to ignore block ordering in IR when possible. Passes that care about the order they iterate over a function don’t iterate in BB order; they use depth-first order or something like that.

At the MIR level, we do perform basic block reordering… and presumably proprietary targets which don’t use the LLVM infrastructure have something similar. We could add additional block reordering passes if we thought it was useful, similar to instruction scheduling… but I can’t think of any optimizations that actually care about knowing the BB order early.

Agreed, which means we can reorder it as we want, but it might not stay this way.

Agreed. Preferably, this order should not influence passes at all. If it does in practice is unclear, but not necessarily what we are hoping for anyway. We might be able to test this and provide feedback if/what passes depend on it but “should’t”.

From what we’ve seen, we do not always emit an optimal order. If it makes sense to improve the MIR reordering or add one on IR level is not clear yet. Since there is no IR one, we might look into the MIR one for now. If we find something interesting, we’ll report back.

For LLVM IR, we use basic blocks for analysis and optimizations, but they are not executable.
When the linker places basic blocks into sections (hold, cold, mildly warm), the order matters. We map the executable into memory and execute it.