I can’t say a 10% improvement is making LLVM fast again, we would need a 10x improvement for it to deserve that label.
We recently open-sourced TPDE and our fast LLVM baseline back-end (TPDE-LLVM), which is 10-20x faster than the LLVM -O0 back-end with similar runtime performance and 10-30% larger code size. We support a typical subset of LLVM-IR and only target x86-64 and AArch64. Posting this here, as this might be interesting for the LLVM community – questions/comments welcome!
Data on SPEC CPU 2017 (x86-64, compile-time speedup and code size relative to LLVM 19 -O0 back-end):
| Benchmark | Comp-Time O0 IR | Code Size O0 IR | Comp-Time O1 IR | Code Size O1 IR |
|---|---|---|---|---|
| 600.perl | 11.39x | 1.27x | 15.06x | 0.97x |
| 602.gcc | 12.54x | 1.32x | 17.55x | 1.01x |
| 605.mcf | 9.72x | 1.27x | 12.47x | 0.92x |
| 620.omnetpp | 21.46x | 1.24x | 26.49x | 1.03x |
| 623.xalanc | 18.99x | 1.24x | 24.80x | 0.98x |
| 625.x264 | 10.52x | 1.26x | 15.19x | 0.97x |
| 631.deepsjeng | 9.60x | 1.25x | 17.56x | 0.97x |
| 641.leela | 21.44x | 1.24x | 18.36x | 0.95x |
| 657.xz | 10.95x | 1.30x | 15.15x | 0.92x |
| geomean | 13.34x | 1.27x | 17.58x | 0.97x |
Results for AArch64 are similar (a bit higher speedups due to GlobalISel). Obviously, on optimized IR LLVM’s optimizing back-ends fare much better in terms of run-time (~2x better) and code size (~2x better), but we don’t intend to compete there.
How Does It Work?
See this paper. In essence, we do three passes: one IR cleanup/preparation pass, one analysis pass (loop+liveness), and one codegen pass, which performs lowering, regalloc, and machine code encoding in combination.
Features & Future Plans
Currently, the goal is to support typical Clang O0/O1 IR, so there are a lot of unsupported features. Flang code often works, but sometimes misses some floating-point operations. Rust code works in principle, but some popular crates use illegal (=not natively supported) vector types, which are not implemented yet. (Legalizing vector types/operations is very annoying.)
Besides some more IR features, our current plans include: DWARF support (we already have some prototype) and better register allocation than “spill everything”. If someone provides sufficient motivation, other big items are non-ELF platforms, non-small-PIC code models, and other targets.
Speculatively Answered Questions
How to use TPDE-LLVM?
The LLVM back-end is usable as a library (e.g., for JIT compilation, also usable with ORC JIT), as llc-like tool, and can be integrated in Clang (needs a patch, plugins can’t provide a custom back-end right now). Some more details here.
Why not make LLVM faster?
We did, LLVM 18->20 got 18% faster on x86-64. I think that another 10-20% might be reasonable to achieve, but I’d expect going beyond that to require deeper changes. Even when doing this, a 10x speedup is unlikely to be achievable.
Which changes to LLVM-IR could allow even faster compilation?
- No
ConstantExprinside functions. These are hard to handle, so before compilation, we rewrite them to instructions – by iterating over all instruction operands to find such constants (and also walk through constant aggregates). - No arbitrarily-sized struct/array values. The only (IMHO) valid reason for struct/array values inside functions are arguments (e.g. AArch64 HFA/HVA) and multiple return values, but there’s no point in loading a
[6000 x {i32, i64}]. (We currently have quadratic runtime in the number of elements per value.)
Not directly related to performance, but would make things easier:
- No direct access to thread-local globals (see e.g. the proposal here). We cannot easily generate a function call at arbitrary places, so we rewrite all accesses to use the intrinsic.
- No arbitrary bit-width arithmetic.
i260doesn’t work well already, yet Clang occasionally generates this (bitfield structs) – we don’t support multi-word integers other thani128.
Random “Fun” Facts
- We currently use 4 padding bytes in
Instructionto store instruction numbers, as there is no field to store auxiliary data. PHINode::getIncomingValForBlockcauses quadratic compile-time for blocks with many predecessors. Thus, for blocks with >1k predecessors (happens for auto-generated code), we sort incoming entries by block pointer and use binary search.llvm::successorsis slow, so we collect and cache successors once. (This could be changed,SwitchInstshould store blocks together instead of interleaved withConstantInt. Maybe even makeSwitchInststoreuint64_ts instead of arbitrary-width integer?)- We track the performance of
tpde-llc(similar to LLVM c-t-t), but ~90% (77-95%) of the time is spent in bitcode parsing…