GSoC 2011: Superoptimization for LLVM IR

Hello,

I want to present my project for GSoC 2011 for LLVM below. It would be very nice to hear suggestions and your opinion, thanks!

Superoptimization for LLVM IR

Objective

This project focuses on implementing superoptimization algorithms targeted at
the LLVM IR. The project uses arbitrary LLVM bitcode as a training set to
discover new peephole optimizations that can be later integrated into LLVM
instruction simplify phase or as a separate peephole optimization pass.

Motivation

Code optimization is the task of using a faster sequence of instructions which
compute the same information, compared to a previous code sequence. The old
sequence may then be substituted by the new one, creating a faster program
with the same behavior. The problem of finding a code sequence that does some
computation in optimal time is undecidable and, in practice, optimizations are
discovered by humans using analytical reasoning. Finding new optimizations in
an automated fashion, using a search algorithm, is a different approach to the
problem and became known in literature as superoptimization, term coined by
Massalin [2].
Using computer-aided optimization discovery, different kind of optimizations
unnoticed by humans may be detected, specially those specific to the target
machine code [1]. This project will focus on implementing a superoptimizer
which will exhaustively investigate new optimizations for the LLVM IR. Using
this approach, I hope to greatly contribute to the current LLVM instruction
simplify phase, introducing new substitutions identified in this project.

Methodology

There is already a superoptimizer for LLVM in initial stages of development.
The existing superoptimizer currently limits its search space to a few transfor-
mations. This project will focus on improving this current superoptimizer by

extending the number of transformations applied and thus greatly increasing
the search space.

In order to test the superoptimizer efficiency, it will use LLVM bitcodes
compiled from the entire LLVM test suite and SPEC to harvest possible code
sequence optimizations. When successfully identified, these optimizations could
be implemented directly as peephole optimizations for LLVM IR.

Timeline

This project is expected to last three months, as follows.

• 1st month, 1st week - Study the LLVM IR in order to establish a
list of axioms, that is, valid transformations that determine the search
algorithm capability. There is an important trade-off to investigate here,
since using fewer transformations will lead to faster searches, but several
good optimizations may be missed.
• 1st month, 2nd week - Thorough study of the current implemented
base and new algorithm design to support arbitrary transformations.
• 1st month, 3rd week until end of project - Implementation of the
search algorithm.
• 2nd month - Test phase begins. Test algorithm with simple transforma-
tions searching for optimized code sequences.
• 3rd month - Harvesting new optimizations and implementing fixes to
the search algorithm as more performance feedback from training data is
received.

Biography

I am a Master student at the University of Campinas (UNICAMP), Brazil (time-
zone GMT -3). In my master thesis, I have already worked with superoptimiza-
tion theory and automatic code sequence searching. But instead of using it for
optimization purposes, I use this to automatically generate a LLVM backend
based on a machine description using the ArchC architecture description lan-
guage. The backend patterns are fixed and the implementation of the pattern
is inferred using a search algorithm similar to the one used on superoptimizers.
Thus, I have experience both on superoptimizer theory and LLVM code.

References

[1] Sorav Bansal and Alex Aiken. Automatic generation of peephole superopti-
mizers. SIGPLAN Not., 41:394–403, October 2006.

[2] Henry Massalin. Superoptimizer: a look at the smallest program. In
ASPLOS-II: Proceedings of the second international conference on Architec-
tual support for programming languages and operating systems, pages 122–
126, Los Alamitos, CA, USA, 1987. IEEE Computer Society Press.

Auler,GSoC2011.pdf (100 KB)

I am no llvm/compiler/optimization-expert and not able to mentor this project -

I just want to say that this sounds very interesting and I’m hoping that maybe someone
else could mentor it, or, at least, leave some useful comments, as it sounds like
something that could improve llvm a lot - doesn’t it?

2011/4/6 Rafael Auler <rafaelauler@gmail.com>

A bit of feedback on this proposal:

It worries me that you talk about "extending the number of transformations." The point of a superoptimizer is that it has no specific model of transformations. Rather, it is simply looking for cheap code fragments that are equivalent to expensive ones.

You have not specified how you will perform equivalence checking between code sequences. The best answer that I know of is to use the technique from the Bansal and Aiken paper, which is to translate bitcode sequences into a SAT or SMT instance and just ask if they are equivalent functions.

You have not specified the cost model that you will use to assess the desirability of each potential transformation. Simply replacing larger sequences with smaller ones is not necessarily a good idea since you may end up replacing a series of shift operations with a division. or something like that.

A linear sequence of LLVM instructions is perhaps not the best starting point. A better primitive unit of code would be a subgraph of the PDG. This will ensure that the superoptimizer is looking at instructions that are semantically related. Also, you will need to consider dependencies in any case to make this work, so you might as well do it from the outset.

John Regehr

It worries me that you talk about "extending the number of
transformations." The point of a superoptimizer is that it has no
specific model of transformations. Rather, it is simply looking for cheap
code fragments that are equivalent to expensive ones.

I could stand to be more clear...

Could you describe explictly the role of transformations in your superoptimizer framework? And also give examples of what kinds of transformations you are talking about?

Could you describe how it is established that two code fragments are equivalent?

John

Hi Rafael,

Methodology

There is already a superoptimizer for LLVM in initial stages of development.
The existing superoptimizer currently limits its search space to a few transfor-
mations. This project will focus on improving this current superoptimizer by
extending the number of transformations applied and thus greatly increasing
the search space.

it would be great if you could extend my superoptimizer to do fully general
superoptimization.

In order to test the superoptimizer efficiency, it will use LLVM bitcodes
compiled from the entire LLVM test suite and SPEC to harvest possible code
sequence optimizations. When successfully identified, these optimizations could
be implemented directly as peephole optimizations for LLVM IR.

Right, this is exactly what has been done with the existing superoptimizer,
which, in spite of its simplicity, discovered a lot of commonly occurring
simplifications in SPEC, the LLVM testsuite and the Ada ACATS testsuite. A
bunch of these were implemented in LLVM 2.9, mostly in InstructionSimplify.

• 1st month, 1st week - Study the LLVM IR in order to establish a
list of axioms, that is, valid transformations that determine the search
algorithm capability. There is an important trade-off to investigate here,
since using fewer transformations will lead to faster searches, but several
good optimizations may be missed.

It sounds like you are planning to follow the approach of Joshi, Nelson and
Randall ("Denali: A Goal-directed Superoptimizer") in that you don't intend
to exhaustively enumerate all possible code sequences, and see if they are
the same as the original only better; but instead start from the original code
sequence and repeatedly apply transformations that change code to equivalent
code, looking to see if you can eventually get something better than the
original. Is that right?

Ciao, Duncan.

Hello all, thanks for the feedback!

It sounds like you are planning to follow the approach of Joshi, Nelson and
Randall (“Denali: A Goal-directed Superoptimizer”) in that you don’t intend
to exhaustively enumerate all possible code sequences, and see if they are
the same as the original only better; but instead start from the original code
sequence and repeatedly apply transformations that change code to equivalent
code, looking to see if you can eventually get something better than the
original. Is that right?

Exactly. If we can use axioms/transformations/identities which don’t modify the code behavior, but just tries to interpret instruction semantics in a different way, so as to find another (and hopefully simpler) instructions to do the same job, the role of the SAT prover is no longer necessary, since the search space is constrained to expressions which are semantically equivalent.

Example:
Transformation: (ADD Operand_A, Contant 0) ← equivalent to-> Operand_A

Using this transformation, the search algorithm now can simplify “add A with 0” to “A” or, what may look initially not so useful, transform any operand A into an add with A plus 0. Later, this is important to prove we can use an add instruction to load any operand. Well, I think you got the idea of using a set of axioms… this, in practice, is important to prove an arbitrary sequence A is equivalent to SequenceB, and, if B is shorter than A, use B instead. Speaking of implementation, there will be a big list of transformations and a “for” loop trying each transformation and recursively applying more transformations, until we can show a code sequence is equivalent to another. This is a short overview of the algorithm and how I worked in this area. If you prefer another approach described formally in another paper, I can study it and use another algorithm.

Note that this system is a simple theorem prover, since it applies several equivalence rules to an expression to try to transform it to the answer. The difference is that we don’t have a “answer” to prove. Any faster sequence will suffice as “answer”. Regarding the performance “shift vs. multiply”, the instructions will have weights. Since we are working at the IR level, we can’t assume neither is good. That’s why superoptimization is best applied as peephole optimization of a backend prior to code emission - when we know more about the architecture, we can take more appropriate decisions and, with superoptimization, use obscure target machine instructions to compute apparently unrelated expressions. If we use this at the IR level, some suggested optimizations may not necessarily lead to better code after the backend pass (lowering/transformation to selectionDAG). That’s the downside of using it at the IR level. On the other side, many other simple substitutions may be learned by the algorithm and improve the instructionsimplify.

Thanks,

-Rafael

IMO super optimizer would yield less benefits on LLVM compared to
other compilers.

If you check the patch of the instcombine pass, you'll find out people
keep dragging "correct" optimization out, not because the optimization
violates the semantic of LLVM IR, but it will generate wrong code
sequences when lowering to machine code.

An example:

%3 = fcmp %1, %2
%6 = fcmp %4, %5
%7 = and %3, %6
%8 = and %7, %foo

Sometimes you'll be screw if you want to play reassociate %7 and %8. I
don't see a easy way of catching them in theorem provers.

Haohui

Hi Haohui,

IMO super optimizer would yield less benefits on LLVM compared to
other compilers.

If you check the patch of the instcombine pass, you'll find out people
keep dragging "correct" optimization out, not because the optimization
violates the semantic of LLVM IR, but it will generate wrong code
sequences when lowering to machine code.

it sounds like you have had some bad experiences. In my opinion it is
completely correct to disable instcombine optimizations that expose a
code generator bug, but only until the code generators are fixed. It
should not be an excuse to never fix the code generators. The usual
strategy is to open a bug report about the codegen problem so that it
is not forgotten.

An example:

%3 = fcmp %1, %2
%6 = fcmp %4, %5
%7 = and %3, %6
%8 = and %7, %foo

Sometimes you'll be screw if you want to play reassociate %7 and %8. I
don't see a easy way of catching them in theorem provers.

I've already implemented a lot of simplifications found by the existing
super optimizer, and none of them were reverted, so I think your concern
is exaggerated.

Ciao, Duncan.

Hi Rafael, don't forget to submit your proposal to GSOC (via the GSOC web-page)
- the deadline is today!

    It sounds like you are planning to follow the approach of Joshi, Nelson and
    Randall ("Denali: A Goal-directed Superoptimizer") in that you don't intend
    to exhaustively enumerate all possible code sequences, and see if they are
    the same as the original only better; but instead start from the original code
    sequence and repeatedly apply transformations that change code to equivalent
    code, looking to see if you can eventually get something better than the
    original. Is that right?

Exactly. If we can use axioms/transformations/identities which don't modify the
code behavior, but just tries to interpret instruction semantics in a different
way, so as to find another (and hopefully simpler) instructions to do the same
job, the role of the SAT prover is no longer necessary, since the search space
is constrained to expressions which are semantically equivalent.

Example:
Transformation: (ADD Operand_A, Contant 0) <- equivalent to-> Operand_A

Using this transformation, the search algorithm now can simplify "add A with 0"
to "A" or, what may look initially not so useful, transform any operand A into
an add with A plus 0. Later, this is important to prove we can use an add
instruction to load any operand. Well, I think you got the idea of using a set
of axioms... this, in practice, is important to prove an arbitrary sequence A is
equivalent to SequenceB, and, if B is shorter than A, use B instead. Speaking
of implementation, there will be a big list of transformations and a "for" loop
trying each transformation and recursively applying more transformations, until
we can show a code sequence is equivalent to another. This is a short overview
of the algorithm and how I worked in this area. If you prefer another approach
described formally in another paper, I can study it and use another algorithm.

I am fine with this approach, even though it can miss interesting reductions,
since it is way more efficient (from what I hear) than the fully general
approach, so should be able to handle code sequences longer than just a couple
of instructions.

Ciao, Duncan.

Ok, I’ve submitted now a revised version considering all these valuable comments about the project. I hope to contribute to the LLVM community as much as the LLVM project contributed to my thesis. Thanks.