[RFC] Assembly Super Optimiser

I have been working on a hobby project outside office hours. I think I have reached the point of having a good proof-of-concept. I would like to develop it further, make it more robust and add optimisations, but am sharing it at this point to see if people are interested while it is still relatively small and easy to read/review.

This RFC is to see if there’s any interest in inclusion of a new experimental tool “llvm-superasm” in the llvm-project. It consumes assembly code and produces (super) optimised assembly code. Super optimisation is in the name because it produces optimal results using a SAT solver. This first version and proof-of-concept focuses on instruction scheduling. I.e., the instruction scheduling problem is formulated as a constraint set and solved by the Z3 SAT solver, and this first version implements that for a single basic block.

Next, I would like to add and implement other optimisations: i) register allocation/renaming, and ii) software pipeling as described in [1]. This is a reimplementation of the ideas in [1] but leveraging everything that the llvm-project has to offer in this area: option parsing, assembly parsing, architecture and instruction descriptions, instruction printers, etc. For instruction scheduling and this proof of concept implementation, it is using the instruction and latency descriptions to calculate a better schedule if there’s one. I see two main use-cases for this such an assembly super optimiser tool:

  • Some people write optimised routines in assembly, e.g. library writers, and this tool could help with that.
  • A post-processing tool for assembly produced by other tools, e.g. compilers.

I have named the tool “llvm-superasm” and could propose to add it to the “lvm-project/llvm/tools/llvm-superasm” directory if people would find this tool useful. I am open for a different name and location of course. I have squashed 10 patches into one big patch so that people can get an impression and overview of the code and the tool, please find it here on phabricator: ⚙ D152949 [RFC] Assembly Code Super Optimiser. I can upload the different patches for easier reviewing later. I have developed this tool in my free time and would like to continue to do so. I could continue doing this in-tree if the community is interested in inclusion of this tool, or I could keep doing this out of tree. I am interested in any opinion on this.

[1] https://eprint.iacr.org/2022/1303.pdf

Please find below some more implementation details of the compilation flow.

Compilation Flow

The tool in its current form is in essence a driver for certain LLVM libraries that are responsible for assembly parsing, retrieving architecture and instructions descriptions, and the Z3 SAT solver. The Z3 solver is already part of the LLVM project as it can be used by one of the sanitizers. As input, the tool consumes assembly code, for example:

mul r1, r2, r3
mul r5, r1, r1
add r4, r2, r3
sub r6, r2, #1

A dependence graph is created and can be printed in Graphviz format as a debug feature. For this example, we have 4 nodes corresponding to the 4 program assembly statements, and 1 edge corresponding to the true-dependency between the two MUL instructions:

digraph deps {
  N0 [label = "   mul     r1, r2, r3"];
  N1 [label = "   mul     r5, r1, r1"];
  N2 [label = "   add.w   r4, r2, r3"];
  N3 [label = "   sub.w   r6, r2, #1"];
  N0 -> N1 [label = "R1:true"];
}

If a MUL instruction has a latency of 2 cycles, then the ADD and SUB instructions can be scheduled in between the two dependent MULs to hide the latency and avoid pipeline stalls. The dependence graph combined with these instructions latencies, are translated to a constraint set, which is setup to find a minimum schedule. This is the corresponding constraint system in text format that serves as input to the Z3 solver:

(declare-const N0 Int)
(assert (> N0 0))
(declare-const N1 Int)
(assert (> N1 0))
(declare-const N2 Int)
(assert (> N2 0))
(declare-const N3 Int)
(assert (> N3 0))
(assert (not (or (= N0 N1) (= N0 N2) (= N0 N3) )))
(assert (not (or (= N1 N2) (= N1 N3) )))
(assert (not (or (= N2 N3) )))
(assert (< (+ N0 2) N1))
(minimize (+ N0 N1 N2 N3 ))
(check-sat)

As a debug feature, the tool allows to dump the constraints in Z3 textual format shown above, but to query the Z3 solver the tool uses its C-API. The constraints system above specifies that there are 4 variables corresponding to the 4 statements. The values of these variables will corresponds to the new position of the instruction in the instruction sequence. The constraints above specify that these positions should be greater than 0, should not be equal to each other, and the position of N0 should be two less than N1 because of the dependency and latency. Finally, we specify that the minimum values for these variables should be found. The solver finds a solution by assigning integer variables to the nodes. For this example, the solver will finds the following solution:

N0 -> 1
N1 -> 4
N2 -> 3
N3 -> 2

As mentioned, these integer values correspond to their new position in the instruction sequence. So, the first node gets position 1 in the new schedule and the second node gets position 4, etc. And what we thus achieve, it that the two MUL instruction will be scheduled apart. Finally, this assignment is translated to a new schedule:

mul     r1, r2, r3
sub.w   r6, r2, #1
add.w   r4, r2, r3
mul     r5, r1, r1

Next steps

There are a few limitations at the moment, for example:

  • Only single basic blocks can be handled at the moment. That is a restriction that certainly needs to be lifted also to support software pipelining.
  • I have not yet looked at spilling and reloading.
  • and probably some more I may not have encountered.

And to make it more than a proof-of-concept, it would:

  • Need to be made more robust, do some more error handling etc.,
  • And it needs to be tested a lot more.

After that, I would like to work on the real differentiator, which in my opinion is software pipelining that could potentially make this also interesting for some out-of-order cores (and not only in-order cores for which scheduling is more important). To implement software pipelining, a CFG is required, so that would require a CFG data structure.

9 Likes

Hey. As one of the authors of [1], I’m obviously excited to see this. As a TL;DR, we trialled this for various DSP and crypto algorithms on multiple in-order/OOO Arm cores (M55,M85,A55,A72), and saw substantial improvements even compared to prior handwritten assembly. Our tool (GitHub - slothy-optimizer/slothy: Assembly super optimization via constraint solving) uses CP-SAT instead of Z3, and ideally one would keep that flexible.

As I see it, this would be independent of and perhaps even combinable with souper (GitHub - google/souper: A superoptimizer for LLVM IR): The latter does peephole optimization on LLVM IR, while the above is (largely-)instruction-preserving scheduling/allocation/SW-pipelining optimization at the assembly level.

An essential input here is of course a reasonably detailed uArch model, so it would also be interesting to explore if/how this RFC could work with @reidtatge’s [RFC] MDL: A Micro-Architecture Description Language for LLVM. [1] encodes things in a very ad-hoc way so far (e.g. slothy/targets/aarch64/cortex_a55.py at main · slothy-optimizer/slothy · GitHub for A55)

Looking forward to seeing where this goes.

Do you have any plans to make it into a legit MIR pass?

1 Like

Fair point, I can certainly see that these techniques could be applied on MIR too, but I didn’t have plans to implement them there. If this would be part of the compiler, then I would probably look at this as a some sort of new register allocator/scheduler, and people tend to only use the default one, so I think it would have some adoption problems. That’s why I see value in a separate tool, and it also catches the use case of people who write optimised assembly.

Yes, exactly, thanks for clarifying that.

I am not familiar with that work, but I think llvm-superasm should use whatever is available in llvm. So if that works helps, or could give better results, then that’s interesting and needs looking into.

Yes, and all this information is already in LLVM, so leveraging that is key in my opinion. Modern ISAs are really big, even Armv8M that is more focused on micro-controllers is really big as you know, so we really don’t want to be duplicating all of that information in a separate tool outside the llvm project. That’s why I think such a tool should be part of the llvm project.

Indeed, it would suboptimal to run the superoptimizer as a MIR pass and then later-on the register allocator undoes the gains.

If that’s the case, maybe you can leverage LLVM’s MCA library? MCA also uses the instruction scheduling data available in LLVM but goes one step further to find potential processor hazards w/ coarse-grained simulation.

Interesting work! The MDL infrastructure ought to be generally orthogonal to this work, since its currently just a plug-in replacement for Schedules and Itineraries. However, it does handle a much larger class of micro-architectures, so that might be interesting to this project.

Thanks, that’s interesting. This sounds relevant for software pipelining, which is something I would like to do but haven’t looked into. I will make a note of this and leave a TODO.

Thanks for clarifying!

Yes, any (micro)architecture with high scheduling sensitivity and/or register pressure would likely greatly benefit here. It was pretty interesting to see how much one can get out of MVE/Helium (the motivating example for the work), despite only 8 vector registers and pretty tricky scheduling constraints.

This is interesting. We wanted to do something similar for our library work.We wanted to use llvm’s dead code elemination pass for the assembler to optimise our hand written asm.

1 Like

Interesting! I have a very basic question. I feel like the idea lives in a similar area with BOLT, which can optimize the output of other compilers too. How do you think about this?

1 Like

No problem, thanks for reading and commenting. The way I look at it, both BOLT and this are “post processing tools”. BOLT works with profile data and works on binaries. It disassembles functions and reconstructs the control flow graph (CFG) before doing any optimisations. So there are some similarities with this, but there are also big differences:

  • assembly writers probably would like a tool that works on assembly, not on a binary.
  • Unleashing a SAT solver at a full binary is probably going to be time-consuming exercise, so it would be better to feed it selected assembly snippets.
  • the optimisations BOLT and this performs are quite different, so am not sure that integration of this into BOLT would be a good fit.

But I see why you brought this up: it would be interesting to see if some infrastructure could be shared. The one thing that I can think of is the CFG. Not sure this if this is possible, or applicable, but I am going to take a look.

1 Like

/me shrug

When I write assembler, I’m asking myself:

  1. is there a more concise way to write this, such as by using fewer or smaller instructions?
  2. do I need to be using so many registers? Are the values still live or dead which might allow to reuse the registers now.

Scheduling never crosses my mind; does it even matter for OOO cores? :-/

Having tooling help with 1 and 2 would be nice, so perhaps a framework where someone could start working on such features might be nice.

Probably not in general, but for some combinations of OOO core + workload it certainly does. E.g. in the paper we got a 20% improvement for some post-quantum crypto (Neon) workload over the best prior implementation handwritten for the same uArch (Cortex-A72). See Section 3.10 in https://eprint.iacr.org/2022/1303.pdf for some thoughts on this.

I should have pointed this out (better) in the RFC: big OOO cores would definitely not be my primary target for this, but some proper software pipelining should not be completely dismissed especially since there are different flavours of OOO as Hanno pointed out. And the market for in-order cores is not really small for which this indeed is more profitable: DSPs, micro-controllers, in-order CPUs, even GPU cores.

Fair points. Point 1 won’t be addressed. Point 2 however, is something I would like to work on next, so would like to (partially) address that.

Not really, and running code scheduled for a 1-wide in-order machine on a
great big OoO machine is often worse than running completely unscheduled
code.

Would anyone have some advice on what next steps could be? Are there any policies around inclusion of a new “sub-tool”? Could e.g. the incubator process help here?

Given the size of the tool, I would expect a LGTM should suffice. Clang-IR is an incubator in separate git repository. I would assume that llvm-cm also just needs a LGTM.

cc @clattner

The patch on Phabricator showed the prototype with all squashed commits. I have now also uploaded all patches to my llvm-project fork on GitHub:

https://github.com/sjoerdmeijer/llvm-superasm/tree/main/llvm/tools/llvm-superasm

where all the individual commits can be seen:

Comparing llvm:main...sjoerdmeijer:main · llvm/llvm-project · GitHub