PartialExecuter: prove code to be never reachable thanks to Interpreter-based approach

Hi all!
This is Carlo, working on a fork of LLVM, Cheerp, that compiles C++ to JavaScript and WebAssembly.
Working on reducing code size, we developed PartialExecuter, that I believe to open up a new class of Interpreter-based optimizations on LLVM’s IR.

The starting challenge was printf, say that at a given point the symbol is internalized + there are no indirect uses, then you can enumerate all call-sites, check whether the format strings are compile time constants, and the keep the logic relative only to the set of actually used formats.

While this is clearly possible by hand, it has been formalized and generalized at the compiler level, using more or less those components:

  • LLVM’s ExecutionEngine / Interpreter to evaluate instruction as GenericValues / handle call stacks / etc
  • An ad-hoc visit of a SCC-graph to decide where control flow should go in case of non-decidable branches
  • An updated set of visited Edges (so that never visited BasicBlock-BasicBlock edges can at the end of the pass be removed)

I believe this approach to be a somehow original contribution, and would be looking forward for feedback / discussion on whether this could make sense as part of the LLVM framework.

This optimization is fully generic, but having tested it only as part of a specific pipeline I would expect adapting some of the logic might be required to handle generic IR.

Code is avaliable here: cheerp-compiler/PartialExecuter.cpp at master · leaningtech/cheerp-compiler · GitHub, while an article I wrote is here: PartialExecuter: Reducing WebAssembly size by exploring all executions in LLVM | by Carlo Piovesan | leaningtech | Mar, 2022 | Medium

Reachability analysis is standard in compilers and verification tools. LLVM has an algorithm for that as well.
The differences between algorithms are the abstraction for the values/memory, the sensitivity of the analysis (e.g., flow sensitive, path-sensitive, …), exploration order, etc. This impacts speed and precision.

Could you clarify how your analysis works? Have you compared with LLVM’s algorithm? Which benchmarks did you run?

Thanks!

Hi @nlopes, thanks for the questions.
I believe I miss some of the terminology, I could maybe benefit for some pointers (even just the relevant passes that might be an overlap).

PartialExecuter’s analysis part collects all call-sites of a given function (possibly adding a virtual one for external or indirectly-used functions).
Then for each call-sites it somehow ‘emulate’ all possibles execution paths, computing actual values when all operands are known / this is possible, and skipping Instructions otherwise.
The tricky part is when encountering terminators that branches on non-computed values.
There the idea is to explore an agumented SCC graph, keeping for every component track of the incoming edges:

  • No incoming edges → no visit
  • Single incoming edge → visit starting from destination, with phis known
  • Multiple incoming edges but having same destination → visit starting from destination, phis unknows
  • Multiple incoming edges with multiple destinations → all SCC is marked as visited

The classes of testcases that are currently not covered by other LLVM optimizations that can be covered by this can be of two classes:

  • cases where constant propagation can’t be performed, but there are invariants to be propagated
    eg. function isNumberPrime(n), where we know that (for example) inputs are either 17, 24 or 120 can be greatly simplified (basically to n == 17), while there is no way to constant propagate the concept of “input will be one of the values of a given set”
  • case like printf where the function uses a format string that happens to be known. There without interpreter (+ handling of loops / call-stacks) I don’t believe there are generic solutions to say “doubles handling is not part of any format string, so we can skip the relevant code”

In both cases PartialExecuter does this indirectly keeping track of the executed codepaths.

PartialExecuter has been developed in the contest of WebAssembly, where basically all libraries becomes internalized and have to be part of the resulting file, and there for example the difference between including generic printf (able to handle arbitrary format strings) vs stripped down printf can be easily 50% of the total size of the function.
I performed the LLVM / clang tests that were compatible with Cheerp (I’d say a 50% of the total), and there enabling the optimizations works (but for a test that gets completely optimized out…). Not sure what the relevant benchmarks would be for this, do you have any pointer on this?

I see, so you execute the functions with concrete values. Since they are all internal, you are able to see all the call sites.
This kind of analysis is potentially very expensive, so it’s important to have a lot of throttling.

Maybe have a look at this overview of the inter-procedural optimizations implemented in LLVM: https://llvm.org/devmtg/2020-09/slides/A_Deep_Dive_into_Interprocedural_Optimization.pdf

It sounds like you are propagating the set of values that can flow into each function argument. The rest of the machinery of propagating these through instructions is mostly there. If you place !range information LLVM should be able to use them and do the optimizations for you.

Let me reiterate again: the hard part is to write an analysis that is quick enough and optimizes what you want.
You can try LLVM’s test suite (test-suite Guide — LLVM 15.0.0git documentation). Or just compile any big application like a browser.

The thing I somehow struggle to explain, and I might need pointers on how to frame this, is that the central insight is:

  • a family of execution paths is built
  • for each actual execution path there is a map between Instructions and actual computed values at the current point of the path
  • when executing a path, computable instructions are interpreted, and their actual value added to the map, while other instructions are skipped
  • if a path meets an unconditional branch → it takes it
  • if a path meets a conditional branch that depends on a computed value → it takes it
  • if a path meets a conditional branch/switch that depends on a non computed value → it create N paths

When the procedure terminates → the set of visited BB will be the union of BBs ever visited.

To avoid combinatoric explosion, the main insight is:

  • merging paths thanks to SCC decomposition, in particular, while processing an SCC:
    • all entering path pass through the same BB → set the relevant PHINodes (possibly to “non-computable”), merge the paths and start with a single path starting at that BB
    • multiple headers are actually visited → all BB in the SCC are marked as fully visited, and all outgoing edges are assigned to a new path

There are a few additional insight to be able to treat loops and to speed this up further, but this should be the basic. Current running times somehow higher than other optimizations, taking say 1-5% of the total time spent in opt on bigger codebases, but more testing and benchmarking would be needed.