Hey. As one of the authors of [1], I’m obviously excited to see this. As a TL;DR, we trialled this for various DSP and crypto algorithms on multiple in-order/OOO Arm cores (M55,M85,A55,A72), and saw substantial improvements even compared to prior handwritten assembly. Our tool (GitHub - slothy-optimizer/slothy: Assembly super optimization via constraint solving) uses CP-SAT instead of Z3, and ideally one would keep that flexible.
As I see it, this would be independent of and perhaps even combinable with souper (GitHub - google/souper: A superoptimizer for LLVM IR): The latter does peephole optimization on LLVM IR, while the above is (largely-)instruction-preserving scheduling/allocation/SW-pipelining optimization at the assembly level.
An essential input here is of course a reasonably detailed uArch model, so it would also be interesting to explore if/how this RFC could work with @reidtatge’s [RFC] MDL: A Micro-Architecture Description Language for LLVM. [1] encodes things in a very ad-hoc way so far (e.g. slothy/targets/aarch64/cortex_a55.py at main · slothy-optimizer/slothy · GitHub for A55)
Looking forward to seeing where this goes.