Macro performance: Lexer and SourceManager

Continuing a thread started on phabricator: https://reviews.llvm.org/D20401, summoning @nickdesaulniers @hokein.

Summary so far:

  • SourceManager optimizes to produce compact SourceLocations
  • The tradeoff is getFileID() is relatively expensive, and needed for most nontrivial operations on SourceLocation
  • Example: lexing macro args (see linked review thread) ends up being a hotspot for compiling linux kernel (macro heavy). Clang is significantly slower than GCC.
  • We found some simplifications & minor speedups of that lexing code.
  • Improving efficiency of getFileID() itself would be great
  • It mostly scans/searches through offsets in SLocEntrys, storing these contiguously would make searching faster. (@hokein looking at this)
  • The search is opportunistically bounded, @hokein found extra bounds to use in ⚙ D135132 [SourceManager] Improve getFileIDLocal.
1 Like

Adding additional context:

Compile times are important; if you want developers to use Clang (which helps their code bases support Clang as a compiler), having it be faster than the competition is important to winning developer mindshare. We’ve had multiple complaints about this from Linux kernel developers.

Since 2020, Clang has been slower than the competition when compiling the Linux kernel by 7% to 28%.

Here’s a fresh profile of a Linux kernel build with clang.


(Instructions for reproducing these measurements)

As you can see, the entirety of kernel build times are dominated by Clang. Not the linker, not other host utilities invoked during the build, not LLVM optimization passes. Just Clang (the front end).

We’ve made tiny progress recently putting a dent in these with ⚙ D131532 [Sema] Avoid isNullPointerConstant invocation which shaved off ~2.1% of cycles, but there’s a lot more we can be doing.

(Independently of algorithmic optimizations, I’m playing with staticly linking clang against musl+llvm-libc, then building it with LTO, PGO, and BOLT. I’m also looking into how we can measure+automate something like iwyu for the kernel).

https://reviews.llvm.org/D20401 introduced a major performance regression for kernel builds, and I wasn’t the only one to pinpoint that commit as being problematic.

There’s obviously more than just getFileIDLocal that could use some attention, so posting profiles in case others are interested in “Make Clang Fast Again” (MCFA).

3 Likes

SourceManager has a one-element cache for getFileID(): if the answer is the same as last time, return faster.
There seems to be room for improvement here:

The hit-test against the cache doesn’t seem trivially cheap
It calls getSLocEntry() and does some math that depends on local vs loaded. Caching (FileID, SourceLocation Begin, SourceLocation End) and just comparing SourceLocation numerically might be better?

A bigger or 2-level cache seems likely to help
The one-element one looks like it was low-hanging fruit. From compiling a random linux source file (drivers/gpu/drm/drm_property.c):

getFileID
  calls=861952
  hits=449530
  misses=412422
LRU age of result (in getFileID cache misses)
  -1=80397 i.e. first call for this File ID
  0=449530 i.e. cache hits
  1=101481 ==> cache size 2 prevents +25% misses
  2=32883  ==> +8%
  3=38852 ==> +9%
  4=23962 ==> +6%
  5=19073 ==> +5%
  6=24234 ==> +6%
  7=16805 ==> +4%
  8=7979 ==> +2%
  9=10941 ==> +3%
  10=10563 ==> +3%
  (rest are <2000)

So an 8-element cache would avoid most of the getFileIDSlow() calls.

Looking at a less macro-heavy file (clang/lib/Sema/SemaExpr.cpp), the cache hit rate is higher (~70% instead of ~50%) but the fractions of cache misses that would be covered by a larger-but-still-small cache are similar.

Playing with the caching strategy is a double edged sword. If you optimize for large codebases like the Linux kernel, you might harm smaller codebases that have most macros defined in the same file.

On a miss of the one entry cache, SourceManager::getFileIDLocal will do 8 linear probes before starting a binary search. From my summer intern’s measurements, doing 1 linear probe resulted in optimal hit rates for the linear probe, but didn’t make a measurable difference in compile times. (For Linux kernel builds). Once we start binary searches it’s game over for performance.

Point being, it’s good to brainstorm hypotheses then code them up. But ultimately “data wins arguments.” We must also be careful that data from code base be applicable for others, lest we over-optimize for the Linux kernel. (I mean, I wouldn’t mind, but I do have to rebuild LLVM once in a while too.)

1 Like

If the SourceLocation compression is so costly in CPU, I wonder if we are making the wrong speed/memory tradeoff here. Take a look at the SourceLocation overflow thread, which has been a recurring issue for certain users since 2019:

With 64 bits, we could use a much simpler scheme. The downside is potentially greatly increased memory usage, which must be measured and weighed.

Thank you for looking into improving compile time performance, I strongly encourage the work in this area!

I agree that compile times are very important; I’ve heard complaints from people other than Linux kernel developers about our compile time performance as well.

It’s also worth noting that the compile time performance questions also matter greatly for newer versions of C++. During the switch from C++14 to C++17 as the default language mode for Clang, we found that there was a large performance difference in some of the STL headers due to the amount of extra code in those headers in the later standard mode (⚙ D131465 C++/ObjC++: switch to gnu++17 as the default standard).

We have a patch series for supporting 64-bit SourceLocations as an option, because a significant number of our customers need it. (There’s a widely used automotive code framework which re-#includes the same large file enough times to overflow 32 bits, and it’s not really feasible to ask everyone to “just” stop using an industry-standard thing.)

It half-landed a couple of years ago, but the other half stalled in code review. (As I remember, the immediately hard part was the patch that had to worry about integer overflow at every site where a 32-bit offset was added to an existing SourceLocation – many call sites negate the offset while it’s still 32-bit, and then it’s too late to avoid silently introducing an error of 2^32. There’s also a question about the stable libclang ABI, which if I remember rightly we dealt with for the moment by not changing it, and simply having it return failure if it has to communicate a too-big SourceLocation.)

If people want to try to measure the memory-usage overhead of 64-bit SourceLocations then perhaps I should try to rebase those patches back to current main, and get them working again?

The remaining patches are:

1 Like

Good point. I believe we’re actually making the right tradeoff in general though. Usually we store zillions of SourceLocations in the AST during parsing and only decode a few of them during analysis. The design makes encoding (usually) cheap, and storing cheap, and decoding expensive.

The issue with macro-heavy linux code is that we’re having to decode SourceLocations as part of parsing. Maybe we can avoid using the encoded SourceLocation representation only inside the lexer?

There are a few other places where might call getFileID() a lot, e.g. algorithms like isBeforeInTranslationUnit, and in tools that aim to map locations for whole files at a time. But these are relatively rare cases, especially within the compiler proper.


SourceLocation overflow … With 64 bits, we could use a much simpler scheme

Indeed. Though if the RAM cost is too high to push this to everyone (as I expect) then having the current complex scheme coexist with a simple one sounds nightmarish.

If the goal is to get an estimate of %overhead, then applying them against an old revision would also be fine and presumably less work.

(I don’t have a strong opinion about whether the option should live in-tree, but if it turned out to be cheap enough to be default that’s a different story!)


Definitely. I think this lexer/sourcemanager stuff is unlikely to be very interesting to typical C++ though, it shows up on a Linux profile because there’s no templates to instantiate :slight_smile:

⚙ D135440 [SourceManager] Speedup getFileIDLocal with a separate Offset Table. proposes burning a bit of RAM to make getFileID() ~30% faster.

(It builds a redundant side-table to search over more compactly. Eventually this could become the source of truth and we’d get the RAM back. But that would take some more work)