I have been working on a hobby project outside office hours. I think I have reached the point of having a good proof-of-concept. I would like to develop it further, make it more robust and add optimisations, but am sharing it at this point to see if people are interested while it is still relatively small and easy to read/review.
This RFC is to see if there’s any interest in inclusion of a new experimental tool “llvm-superasm” in the llvm-project. It consumes assembly code and produces (super) optimised assembly code. Super optimisation is in the name because it produces optimal results using a SAT solver. This first version and proof-of-concept focuses on instruction scheduling. I.e., the instruction scheduling problem is formulated as a constraint set and solved by the Z3 SAT solver, and this first version implements that for a single basic block.
Next, I would like to add and implement other optimisations: i) register allocation/renaming, and ii) software pipeling as described in [1]. This is a reimplementation of the ideas in [1] but leveraging everything that the llvm-project has to offer in this area: option parsing, assembly parsing, architecture and instruction descriptions, instruction printers, etc. For instruction scheduling and this proof of concept implementation, it is using the instruction and latency descriptions to calculate a better schedule if there’s one. I see two main use-cases for this such an assembly super optimiser tool:
- Some people write optimised routines in assembly, e.g. library writers, and this tool could help with that.
- A post-processing tool for assembly produced by other tools, e.g. compilers.
I have named the tool “llvm-superasm” and could propose to add it to the “lvm-project/llvm/tools/llvm-superasm” directory if people would find this tool useful. I am open for a different name and location of course. I have squashed 10 patches into one big patch so that people can get an impression and overview of the code and the tool, please find it here on phabricator: ⚙ D152949 [RFC] Assembly Code Super Optimiser. I can upload the different patches for easier reviewing later. I have developed this tool in my free time and would like to continue to do so. I could continue doing this in-tree if the community is interested in inclusion of this tool, or I could keep doing this out of tree. I am interested in any opinion on this.
[1] https://eprint.iacr.org/2022/1303.pdf
Please find below some more implementation details of the compilation flow.
Compilation Flow
The tool in its current form is in essence a driver for certain LLVM libraries that are responsible for assembly parsing, retrieving architecture and instructions descriptions, and the Z3 SAT solver. The Z3 solver is already part of the LLVM project as it can be used by one of the sanitizers. As input, the tool consumes assembly code, for example:
mul r1, r2, r3
mul r5, r1, r1
add r4, r2, r3
sub r6, r2, #1
A dependence graph is created and can be printed in Graphviz format as a debug feature. For this example, we have 4 nodes corresponding to the 4 program assembly statements, and 1 edge corresponding to the true-dependency between the two MUL instructions:
digraph deps {
N0 [label = " mul r1, r2, r3"];
N1 [label = " mul r5, r1, r1"];
N2 [label = " add.w r4, r2, r3"];
N3 [label = " sub.w r6, r2, #1"];
N0 -> N1 [label = "R1:true"];
}
If a MUL instruction has a latency of 2 cycles, then the ADD and SUB instructions can be scheduled in between the two dependent MULs to hide the latency and avoid pipeline stalls. The dependence graph combined with these instructions latencies, are translated to a constraint set, which is setup to find a minimum schedule. This is the corresponding constraint system in text format that serves as input to the Z3 solver:
(declare-const N0 Int)
(assert (> N0 0))
(declare-const N1 Int)
(assert (> N1 0))
(declare-const N2 Int)
(assert (> N2 0))
(declare-const N3 Int)
(assert (> N3 0))
(assert (not (or (= N0 N1) (= N0 N2) (= N0 N3) )))
(assert (not (or (= N1 N2) (= N1 N3) )))
(assert (not (or (= N2 N3) )))
(assert (< (+ N0 2) N1))
(minimize (+ N0 N1 N2 N3 ))
(check-sat)
As a debug feature, the tool allows to dump the constraints in Z3 textual format shown above, but to query the Z3 solver the tool uses its C-API. The constraints system above specifies that there are 4 variables corresponding to the 4 statements. The values of these variables will corresponds to the new position of the instruction in the instruction sequence. The constraints above specify that these positions should be greater than 0, should not be equal to each other, and the position of N0 should be two less than N1 because of the dependency and latency. Finally, we specify that the minimum values for these variables should be found. The solver finds a solution by assigning integer variables to the nodes. For this example, the solver will finds the following solution:
N0 -> 1
N1 -> 4
N2 -> 3
N3 -> 2
As mentioned, these integer values correspond to their new position in the instruction sequence. So, the first node gets position 1 in the new schedule and the second node gets position 4, etc. And what we thus achieve, it that the two MUL instruction will be scheduled apart. Finally, this assignment is translated to a new schedule:
mul r1, r2, r3
sub.w r6, r2, #1
add.w r4, r2, r3
mul r5, r1, r1
Next steps
There are a few limitations at the moment, for example:
- Only single basic blocks can be handled at the moment. That is a restriction that certainly needs to be lifted also to support software pipelining.
- I have not yet looked at spilling and reloading.
- and probably some more I may not have encountered.
And to make it more than a proof-of-concept, it would:
- Need to be made more robust, do some more error handling etc.,
- And it needs to be tested a lot more.
After that, I would like to work on the real differentiator, which in my opinion is software pipelining that could potentially make this also interesting for some out-of-order cores (and not only in-order cores for which scheduling is more important). To implement software pipelining, a CFG is required, so that would require a CFG data structure.