FWIW, I’ve put a bunch of time into thinking about at least how to build better frontend benchmarks for Carbon, and it seems reasonably applicable to Clang as well. We even have at least one rudimentary benchmark that covers both Clang and Carbon. All of this is open source under the LLVM license, and if there is interest and a good way, happy to contribute it or find ways to share it. =]
Code:
This is a very different benchmarking approach from the others mentioned so far. The goal is not to create a reasonably proxy for what real compile times will be – that is I think well done by the techniques mentioned here. Instead, the goal is to create interesting but extremely consistent measurement of important code paths through the frontend.
Why: I moved away from proxies because my experience is that is very hard to get good measurements from them. Realistic compiles tend to be fairly noisy and to change over time. To get really good data, you want to be able to do many runs and get very accurate timings. However, compiling the same proxy over and over again either requires the proxy to be so large that it takes too long for constant use, or results in misleading timings as significant amounts of work done on the first compilation provide optimizations (from branch prediction to cache populating) to all subsequent compilations. That doesn’t mean we shouldn’t have proxies to use as a baseline periodically, but I don’t think its easy to use these to get coverage or precision for doing targeted improvements.
The technique I came up with is to create a code generator that synthesizes “interesting” patterns of code in a way that is randomly permuted but completely consistent. So the total number of characters in identifiers is always the same, and the histogram of lengths is the same, but the actual identifiers and the order in which ones of a given length are used varies on each generation. Similarly for the number of declarations, the code constructs, etc.
The key is to synthesize code that will do the same amount of work, but in an unpredictable order to accurately measure the cold-execution time of the compiler, as that’s what matters for improving performance in practice. And you have to be really serious about what counts as “work” – one of the fastest parts of Clang is skipping comments, but changing the # of bytes in the input by having comment strings 2x longer has a larger effect on my measured compile times than most other changes. So every byte counts here from my experience. This is compounded by wanting the structure of the code to roughly match what clang-format or a human would produce, but you can’t afford to have a non-deterministic number or position of newlines.
Once you have this synthesis system, you can use more traditional microbenchmark techniques (linked above, built on top of GitHub - google/benchmark: A microbenchmark support library · GitHub ) to actually build reasonably stable measurements. My experience is then that you need to do many runs and aggregate them as there are too many process-specific variations that will drown out any data in noise otherwise. I wrote a custom benchmark runner to do this, and compute good statistical measures: carbon-lang/scripts/bench_runner.py at trunk · carbon-language/carbon-lang · GitHub
The source generation in Carbon is specifically designed to support multi-language synthesis, and I want to expand it to cover more interesting patterns. Right now it only generates one specific pattern that I care a lot about: large sequences of declarations of classes with lots of methods and a decent number of member variables, but no definitions. Basically, “boring” but super large header files. Even this has found lots of nice optimizations.
One thing I’ve been trying to expand it to is benchmarking the complete compiler invocation by running the compiler in a subprocess. For small files, it turns out that the compile time of C++ (and Carbon!) is completely dominated by process startup. This is how I found some of the dynamically initialized string tables in Clang a while back. There is probably more to find here.
Anyways, happy to answer questions here, and would love to have help expanding source generation to cover more patterns. This approach is really powerful, but it is also really difficult to build the source generation. Much harder than the benchmark itself it turns out. But the rewards look promising.