[GSoC] Optimizing for size

Hi all,

I'm interested in adding code size optimizations as a GSoC project. This is
necessary when compiling for embedded devices, where LLVM should optimize for
size even at the expense of speed. I'm still working on my proposal, but I'd
like some advice on the technical parts and overall project plan.

First, I would add a way to determine which parts of the code should be optimized
for size. This is currently done with an OptimizeForSize attribute on
functions, which is added by clang with -Os. I would add a more flexible method
that determines whether a given BB should be optimized for size, depending on
profiling information (if available) and -Os and other options. I could put
this in ProfileInfoT, since it would interact closely with profiling
information.

The next step is to make optimization passes use this information. This means
updating passes that currently check the OptimizeForSize attribute to use the
new information instead. I would also make JumpThreading, InlineCost, and
possibly other passes aware of size optimization.

After working on existing passes, I would add a new MergeBlocks pass. This is
an IPO pass that would combine equivalent basic blocks, extracting them into
new functions. Research has shown that this can decrease code size by 5-6%. The
new pass will be based on CodeExtractor and MergeFunctions; it will create a
hash table of every basic block, based on the number of instructions of
different types, and then perform a detailed comparison on blocks with the same
hash. I would first write an inefficient preliminary version that only finds
obvious cases. This can be used to get an impression of the technique's
effectiveness. I would then work on making it more efficient and able to merge
blocks in more difficult cases.

Tentative schedule:
2-3 weeks: Add -Os to llc and opt. Use profiling information or -Os to
           determine which BBs should be optimized for size.
3-4 weeks: Make existing passes aware of size optimization.
2-3 weeks: Preliminary implementation of block merging.
2-3 weeks: Improve block merging algorithm speed.
1-4 weeks: Make block merging handle more difficult cases.

Any thoughts?

Thanks,
Sean Bartell
(wtachi in #llvm)

Hi all,

Hi Sean,

Last year, I've written master's thesis on MergeBlocks optimization on
LLVM IR level (in St.-Petersburg State Polytechnical University),
supervised by Anton Korobeynikov.

I'll comment on MergeBlocks part of your proposal.

After working on existing passes, I would add a new MergeBlocks pass. This is
an IPO pass that would combine equivalent basic blocks, extracting them into
new functions. Research has shown that this can decrease code size by 5-6%. The
new pass will be based on CodeExtractor and MergeFunctions; it will create a
hash table of every basic block, based on the number of instructions of
different types, and then perform a detailed comparison on blocks with the same
hash. I would first write an inefficient preliminary version that only finds
obvious cases. This can be used to get an impression of the technique's
effectiveness. I would then work on making it more efficient and able to merge
blocks in more difficult cases.

My implementation of MergeBasicBlocks pass (mergebb for short) managed
to reduce code size by ~3%, on some samples from the LLVM test suite,
on both x86 and ARM.

Detecting mergeable basic blocks is pretty easy and straightforward:
you can introduce an order on them, and detect all mergeable groups in
N log N time (very fast in practice, event without any hashing).

The hard part is deciding, if a particular group is worth extracting
into a function. I ended up using simple heurisitc (like block size *
A — n_inputs * B — n_outputs * C), and hand-picked magic A, B, and C
coefficients for every platform. This may not be the best way to do
it; I believe that one should use the code generator to actually
measure the real difference that extracting every particular basic
blocks group will make. This, however, may be really slow.

To conclude, I'd say that, in my mind:
— LLVM IR is not well suited for mergebb pass (at least if not using
codegen to decide which blocks are worth merging). While it is very
easy to detect equivalent blocks, and extract them into functions, you
have almost no control on the code generation, which is crucial for
this kind of optimization. Once again, if not using codegen for each
basic blocks group, you have no good way to tell if a particular group
is worth extracting at all.
— Platform-specific low-level code generation tweaks and traditional
-Os (-O3 without things that blow the size up) have an order of
magnitude bigger impact on the size of generated code, and are «the
way to go».

            Gregory