TPDE-LLVM: 10-20x Faster LLVM -O0 Back-End

5 years ago, @nikic wrote:

I can’t say a 10% improvement is making LLVM fast again, we would need a 10x improvement for it to deserve that label.

We recently open-sourced TPDE and our fast LLVM baseline back-end (TPDE-LLVM), which is 10-20x faster than the LLVM -O0 back-end with similar runtime performance and 10-30% larger code size. We support a typical subset of LLVM-IR and only target x86-64 and AArch64. Posting this here, as this might be interesting for the LLVM community – questions/comments welcome!

Data on SPEC CPU 2017 (x86-64, compile-time speedup and code size relative to LLVM 19 -O0 back-end):

Benchmark Comp-Time O0 IR Code Size O0 IR Comp-Time O1 IR Code Size O1 IR
600.perl 11.39x 1.27x 15.06x 0.97x
602.gcc 12.54x 1.32x 17.55x 1.01x
605.mcf 9.72x 1.27x 12.47x 0.92x
620.omnetpp 21.46x 1.24x 26.49x 1.03x
623.xalanc 18.99x 1.24x 24.80x 0.98x
625.x264 10.52x 1.26x 15.19x 0.97x
631.deepsjeng 9.60x 1.25x 17.56x 0.97x
641.leela 21.44x 1.24x 18.36x 0.95x
657.xz 10.95x 1.30x 15.15x 0.92x
geomean 13.34x 1.27x 17.58x 0.97x

Results for AArch64 are similar (a bit higher speedups due to GlobalISel). Obviously, on optimized IR LLVM’s optimizing back-ends fare much better in terms of run-time (~2x better) and code size (~2x better), but we don’t intend to compete there.

How Does It Work?

See this paper. In essence, we do three passes: one IR cleanup/preparation pass, one analysis pass (loop+liveness), and one codegen pass, which performs lowering, regalloc, and machine code encoding in combination.

Features & Future Plans

Currently, the goal is to support typical Clang O0/O1 IR, so there are a lot of unsupported features. Flang code often works, but sometimes misses some floating-point operations. Rust code works in principle, but some popular crates use illegal (=not natively supported) vector types, which are not implemented yet. (Legalizing vector types/operations is very annoying.)

Besides some more IR features, our current plans include: DWARF support (we already have some prototype) and better register allocation than “spill everything”. If someone provides sufficient motivation, other big items are non-ELF platforms, non-small-PIC code models, and other targets.

Speculatively Answered Questions

How to use TPDE-LLVM?

The LLVM back-end is usable as a library (e.g., for JIT compilation, also usable with ORC JIT), as llc-like tool, and can be integrated in Clang (needs a patch, plugins can’t provide a custom back-end right now). Some more details here.

Why not make LLVM faster?

We did, LLVM 18->20 got 18% faster on x86-64. I think that another 10-20% might be reasonable to achieve, but I’d expect going beyond that to require deeper changes. Even when doing this, a 10x speedup is unlikely to be achievable.

Which changes to LLVM-IR could allow even faster compilation?

  • No ConstantExpr inside functions. These are hard to handle, so before compilation, we rewrite them to instructions – by iterating over all instruction operands to find such constants (and also walk through constant aggregates).
  • No arbitrarily-sized struct/array values. The only (IMHO) valid reason for struct/array values inside functions are arguments (e.g. AArch64 HFA/HVA) and multiple return values, but there’s no point in loading a [6000 x {i32, i64}]. (We currently have quadratic runtime in the number of elements per value.)

Not directly related to performance, but would make things easier:

  • No direct access to thread-local globals (see e.g. the proposal here). We cannot easily generate a function call at arbitrary places, so we rewrite all accesses to use the intrinsic.
  • No arbitrary bit-width arithmetic. i260 doesn’t work well already, yet Clang occasionally generates this (bitfield structs) – we don’t support multi-word integers other than i128.

Random “Fun” Facts

  • We currently use 4 padding bytes in Instruction to store instruction numbers, as there is no field to store auxiliary data.
  • PHINode::getIncomingValForBlock causes quadratic compile-time for blocks with many predecessors. Thus, for blocks with >1k predecessors (happens for auto-generated code), we sort incoming entries by block pointer and use binary search.
  • llvm::successors is slow, so we collect and cache successors once. (This could be changed, SwitchInst should store blocks together instead of interleaved with ConstantInt. Maybe even make SwitchInst store uint64_ts instead of arbitrary-width integer?)
  • We track the performance of tpde-llc (similar to LLVM c-t-t), but ~90% (77-95%) of the time is spent in bitcode parsing…
16 Likes

I guess you meant “Compile Speed O0/O1 IR” rather than compile time

1 Like

could you elaborate this a little further? To me this sentence implies that GlobalISel is slower than SelectionDAGISel (and thus TPDE achieved a higher speed up), but I thought GlobalISel should be generally faster than SelectionDAGISel, is that not the case here?

Combining this and a statement later saying that multi-word integers other than i128 are currently not supported, could you briefly talk about the support status of type legalization in TPDE? And do you think type legalization will become a compilation time bottleneck if it’s supported?

It is. I was referring to GlobalISel -O0 vs. FastISel, where GlobalISel is slower.

We don’t really do legalization. We lower every type to one or more “basic types” (similar to MVT or LLT), so e.g. i128 becomes i128,i128 (two 64-bit parts), i54 becomes i64. When compiling an instruction, the instruction has to check its type and do ad-hoc legalization as required, e.g. sign/zero-extend integers the i54 to an i64 before a division.

The big problem are vector types, and right now, we simply don’t support illegal vector types. Going forward, I don’t think supporting element types other than i1/i8/i16/i32/i64/ptr/half/bfloat/float/double is reasonable, which already simplifies things. For the restricted element types, the current plan is to always scalarize illegal types (e.g., <6 x i8> would behave like [6 x i8], except that it supports arithmetic operations and casts) and be deliberately incompatible with LLVM’s undocumented ABI. i1 vectors might need special handling. (Alternatively, rewrite the IR to get rid of such types. But as such types are rare and we already look at all instructions for type lowering, I think even this could be detected cheaply without much cost for the common case.)

I don’t think other non-obscure types need legalization?

No, I don’t think so. Just a lot of effort for little gain. (I’d rather have front-end not emit illegal types in the first place… unlikely to happen, though.)

Those are some extremely impressive numbers. Do you plan to upstream this as part of LLVM?

Well, this is already some form or type legalization :slight_smile: Some of the things you described here is not really far away from what LLVM is doing. (P.S. I think it should be “i128 becomes i64,i64 (two 64-bit parts)”)

I think whether a type is non-obscure or not really depends on the target architectures. For instance, i8/i16/i32/i64 might be considered normal or reasonable for X86 but in AArch64 both i8 and i16 are illegal scalar types; by default RISC-V only has a single legal integer scalar type (i.e. i32 for rv32 and i64 for rv64). Therefore, the need for type legalization might be more prevailing in architectures other than X86. That being said, I agree that type legalization for vectors is hard and scalarize it seems to be a good starting point.

Don’t get me wrong, I think 10x ~ 20x is an impressive speed up. The reason I’m interested in legalization here is because it’s the key correctness – after all, no matter how fast the compiler is, it has to generate correct code.

3 Likes

This is awesome, congrats on the work.

I work at Wasmer. We have been using Cranelift for our development builds of Wasm modules (into assembly) because it was faster to compile. We have been using LLVM on prod because the assembly produced was waaaay faster.

However, with your changes we may be able to replace Cranelift back to LLVM for development proposes. Having 10x faster builds is a considerable improvement.

Keep up the good work. We are going to trial it and will post here our findings

1 Like