The 64-bit source locations have been proposed and discussed a few times in the past (1, 2), but the discussions have stalled without a final solution.
I’d like to bring this back with fresh data and a prototype. The motivation: with C++ modules becoming more common, we’re running into SourceLocation exhaustion more frequently. This issue is going to become more pressing as adoption grows.
I built a small prototype to explore the impact of switching to 64-bit source locations and gathered some performance data.
Compile-time and memory usage were measured in the LLVM compile-time tracker:
compile time (instructions executed): full details here
→ the difference looks negligible
memory usage (max-rss): full details here
→ up to 4% increase overall
Peak memory usage for selected files:
File
32-bit sloc
64-bit sloc
SemaExpr.cpp
787 MB
829 MB
FindTarget.cpp
774 MB
817 MB
tramp3d-v4.cpp
273 MB
286 MB
I haven’t measured the impact on PCM file sizes, but I think it’s negligible—source locations are already encoded as 64 bits in PCM.
Alternative: as a middle ground, I also experimented with extending the existing 32-bit SourceLocation space to 4GB by removing the macro bit. This does extend the range but comes at a compile-time cost (in the worst case, isMacroID falls back to a binary search), ~4% slower in the benchmarks.
Given the data above, is the memory overhead acceptable to move forward, even as a default? What other tradeoffs or factors should we consider before proceeding?
Thank you for looking into this again! I agree it’s a topic we’d like to find a solution to, and it’s not an easy problem.
The primary concern boils down to the fact that everyone has to pay the overhead costs, regardless of whether they need that many source locations or not. e.g., Folks compiling large C code bases are not going to be excited to have measurable compile time performance regressions in support of a C++ feature that’s only needed for a small percentage of C++ users. It’s an easier pill to swallow if the costs are limited to the folks who need the functionality.
A 4% memory pressure increase is worrying due to template instantiation depth limits. FWIW, I’m already hitting problems with those limits in debug builds on Windows (I started to consistently get File F:\source\llvm-project\clang\test\SemaTemplate\instantiation-depth-default.cpp Line 11: stack nearly exhausted; compilation time may suffer, and crashes due to stack overflow are likely failures with this release). Making that problem worse gets us one step closer to having to seriously consider refactoring the template instantiation engine and that’s a bigger worry (to me) than 64-bit source locations due to the potential for quietly breaking user code.
While this all sounds negative, it’s not me saying “no, we can’t do this” so much as me saying “there are very good reasons we’ve been cautious about this move in the past and I think those reasons still exist today”.
I agree this is a real problem. It seems the work has been staled because the community did not agree with switching to 64-bit location (especially in the first link you mentioned). For example, our infrastructure heavily uses modules and we care about the rss a lot. To me the 64-bit source locations are a curing a symptom rather than the underlying problem.
EDIT:
Here is a promising but unexplored idea by Richard:
“One possible approach: set a checkpoint in the SLocEntry space when we begin processing a macro expansion or macro argument expansion. If we get to the end of the expansion, see if we actually created any tokens. If not, roll back to that checkpoint and recover all the SLocEntry space we allocated in the mean time. That actually seems like it would be pretty quick and straightforward to implement, now I come to think of it… hm. Worth a shot? (Of course, we don’t know if that will help at all in the boost case, because we don’t know what pattern of macro expansions they’re using.)”
Another would be to re-model the source locations as a binary tree but that would require a lot more work.
Thanks for working on that.
I agree with you that the SourceLocation issue is a sword of Damocles that modules might aggravate.
First off, I don’t think doubling the number of SourceLocation at a 4% compile time cost is at all worth it.
We would merely delay he inevitable. And compile times matter more than memory pressure in many cases.
I do share Aaron’s concerns about stack overflows, though.
One thing I’d like to see explored is whether we can keep SourceRange 64 bits, or keep the size to, say, 40 bits in frequently-used ast nodes, and in functions that often end up in deeply recursive stacks, etc, which I realize is a much more involved effort, but probably worth investigating.
I think I agree with Corentin here, this is something we have to do EVENTUALLY and folks run into it somewhat frequently. We have to pull the bandaid eventually.
ONE THOUGHT I had: I wonder if we could find some way to ‘pack’ source locations by making SourceRange ‘better’, or by changing how source locations are stored in the AST (WHICH might require we be better about storing SourceRange). IMO, if Decl had some sort of SourceLocationList rather than every single AST node at every level having a ‘handful of SourceLocation objects’, we might be able to store them more efficiently and alleviate the problems Aaron is concerned about.
Not sure I have a complete idea on how that looks, but perhaps it is something we can do to ‘better’ make this transition.
Out of curiosity, when doing the performance measurements, did you measure performance on a 32-bit build of Clang as well as a 64-bit? We still support and release 32-bit builds (for example: https://github.com/llvm/llvm-project/releases/download/llvmorg-20.1.5/LLVM-20.1.5-win32.exe. If the performance is significantly worse for 32-bit builds, we should consider whether that’s a blocker for this or whether we’re fine with degraded performance there, or perhaps we want to stop supporting that entirely.
I think increasing peak memory usage by 4% on C++ heavy compilations is a reasonable tradeoff.
I think we can say that 32-bit builds of clang are not performance critical. We support 32-bit builds for compatibility, but most major development platforms are 64-bit. At this point, most remaining 32-bit ISA applications seem to be for embedded hardware, where Clang is usually not performance critical.
I acknowledge that instructions doesn’t measure the performance impact of the additional memory usage, but I think we can intuitively say that AST nodes tend to be pretty cold. Clang builds up a giant block of AST nodes, and then code generates some subset of ODR-used declarations, so I doubt that larger AST nodes are really going to hurt compile times that much, even though I assume that Clang is often blocked waiting for memory/cache.
Given the proposals that folks have put on the table for recovering this 4% of memory usage, I will say that I would look elsewhere first. If we really want to improve peak memory (max RSS), I bet we could squeeze out 4% improvements elsewhere at lower cost. Adding caches to decls and making SourceLocations more context-dependent is a significant increase in complexity. Mainly, just ask, if SourceLocations were 64-bit today, would we consider it worth optimizing them back to 32-bit, or is there other higher leverage work we could do to reduce the number and size of AST data structures?
Together, these two node types contribute ~25 MB, ~60% of the total memory increase.
For OpaqueValueExpr, I think it’s possible to keep the size unchanged, which could bring us 6 MB back.
For DeclRefExpr, optimization here is more challenging. It stores 3 SourceLocations. Unless we redesign how we store SourceLocation/SourceRange, there’s limited opportunity here.
If a maximum 4% increase in memory usage is seen as a hard blocker, I agree with Reid that we’ll need to reclaim that cost elsewhere. Moving to 64-bit source locations does come with a price. While we might not be able to eliminate that cost entirely, if we can offset it in other areas, can we then consider making the transition?
The 4% refers to memory usage, not compile-time overhead. (The alternative approach which removes the macro bit has a 4% compile-time impact, but I don’t think we plan to pursue that option.)
Perhaps keeping SourceRange at 64 bits by storing only 48 bits (should be big enough) for the first location and using the remaining 16 bits to encode a delta. This assumes that both locations in a range are typically close together. The trade-off here is increased compute cost for encoding and decoding ranges.
Can you try an experiment out for me? I’m wondering how this impacts max template instantiation depth behavior (that’s usually what is most noticeably sensitive to memory pressure in terms of compiler behavior). Can you try a test to see what depth you’re able to reach without your changes and what depth you’re able to reach with your changes and report back the behavior?
Depending on your hardware, you may have to use -ftemplate-depth= to allow deeper instantiation stacks than the default of 1024, and -ftemplate-backtrace-limit= to make the diagnostic output more bearable.
If SourceLocation grows, we should move any SourceLocations out of the Stmt union; the point of the union is so data can be packed with the first few bitfield bits used to determine the subclass, but we aren’t getting any useful packing putting a 64-bit value in there. That should eliminate the one SourceLocation.
The second SourceLocation in DeclarationNameLoc isn’t normally used; it’s only relevant for C++ overloaded operators. We can probably use TrailingObjects to avoid paying this cost for most DeclRefExprs; we just never bothered because there wasn’t actually any cost with 32-bit source locations.
Has anyone ever hit source location limits without reparsing headers a bunch of times? I expect people are hitting this with named modules because the GMF encourages parsing headers a bunch of times. I wonder if there’s a solution that addresses this specific issue, as I agree with Richard in (1) that 1GiB of external source ought to be enough for anybody.
We extensively uses header modules internally, and we’ve hit the SourceLocation limit multiple times in the past – this remains a real risk.
With header modules, the underlying issue is that when the same textual header is included in multiple modules, it consumes SourceLocation space separately in each module. Any compilation that imports those modules ends up allocating space for that same header multiple times, even for include-guarded headers.
In the worst-case scenario, a guarded header used in n modules will consume n more SourceLocation space compared to non-modular builds.
There are ways to address this, such as deduplicating textual headers on imports at the cost of diagnostic quality.
In contrast, 64-bit SourceLocation appears to be an obvious and straightforward way to mitigate the problem — especially if this transition is expected to happen eventually anyway.
I’ve heard this pattern described as “not modularizing from the bottom up”. My understanding is that Google has put significant effort into modularizing from the bottom up, so these are the residual headers that don’t modularize well, i.e. glibc, linux kernel headers, and user headers that were module-hostile, right? I want to make it clear that significant effort was applied, and this is likely to be a common failure for other modules users.
I’ve heard that other Clang vendors have shipped with 64-bit slocs, and perhaps it would be good to get their input.
I see a dynamic here were we have 3 parties:
Toolchain vendors
Clang maintainers
Users
Modularizing is hard. Users often get it wrong, and often fail to modularize bottom-up, leading to source location exhaustion. They turn to their vendor for support, and the vendor has limited ability to tell the user “you’re doing it wrong, modularize your headers differently to waste fewer source locations”, so vendors tend to enable the 64-bit source location escape hatch. Clang maintainers have expert knowledge and are willing to refactor the codebases they care about to reduce wasteful textual inclusions, and would rather maintain the status quo and use less memory.
I’ve seen enough source location exhaustion reports now to say that we need to make some technical change to mitigate the problem. Building C++ is complicated, and it’s something that non-experts have to do all the time. We should be designing this system for non-experts. If we can find other ways to conserve source locations in the presence of messy modules builds, great, otherwise, we should reconsider our implementation limits.
This feels accurate. My only concern is that letting a header belong to multiple modules can have a large performance impact, and cause pretty hard to debug issues in other ways. But it’s still true that it’s really useful to be able to just let the compiler do type merging instead of needing to modularize perfectly. Given that we don’t ban non-modular includes, we should do a better job supporting it.
I do still think that we should encourage users to not have non-modular includes due to the perf and semantics issues, but I’m not against improving things here.
How should we be nudging users towards better builds? The compiler contract is to effectively be silent upon success, so there’s no user feedback for messy dependency graphs. I think the biggest success story in this space is -ftime-trace, with ClangBuildAnalyzer for aggregation. I wonder if we should consider upstreaming that tool so it can be bundled into the clang distribution, possibly as a busyboxed utility inside clang, or a standalone Python script.
At Google, we are preparing to sample, collect, and aggregate -ftime-report results across the codebase, see @ayzhao 's PRs to emit the pass timings as JSON. Are there some performance insights into the module/header graph that Clang could surface as statistics that we could feed into some aggregated analysis tool, ideally something in tree, and then document as a supported tool for build optimization? Something like a sorted list of re-includes, saying “header foo.h was included X times from Y modules using Z tokens”, to guide the priority of modularization.