[RFC] Clang SourceLocation overflow

Hmm, but will this actually be better than making SourceLocations 64 bits wide? Assuming we use the side table unconditionally, and assuming that most SourceLocations are stored in exactly one place (not true with templates, but often the case otherwise), this means that each time we

store a SourceLocation, we’ll use up 96 bits of storage (32 bits for the SourceLocation and 64 bits for the table entry). That would be 50% worse than the simpler approach of making SourceLocations always be 64 bits wide. If we could use regular 32 bit source locations for smaller values, and

then switch to the side table once we get (say) half way through them, maybe that would be acceptable, but it would mean that we take a constant-factor memory usage (and, due to loss of locality, probably performance) hit for all translation units larger than some threshold. Also, we’d still

need full-width source locations for the spelling and expansion locations of the macro expansion SLocEntrys, so those are going to grow significantly larger – and if we have a large number of macro expansions (such as in the boost case), that’ll be a big cost.

Hmm… yes, that is true. I would expect the SLocEntrys to just be the 64 bit larger version (and the macro expensions would use those instead), so those are equal to the 64 bit size. Spending a bit (the ½ thing above) to switch between the two I think makes this a size regression for anything >1GB (I think we currently handle a 2GB TU). We’d end up taking a size-hit vs just using 64-bit source-locs somewhere between 1GB and 2GB, but I haven’t done the math yet.

I don’t think we’re at the point where we can answer this question yet – I don’t think we sufficiently understand the problem. From what I’ve seen so far, it still seems likely to me that the problem is that we’re allocating too much of the SourceLocation address space for macro expansions >

that don’t need locations to be allocated for them, either due to an outright bug or a pathological case in the boost code.

I’m not sure how we omit those, from my look at the SourceManager it seems that all of those would be necessary (I don’t have a sufficiently good idea of how that works perhaps).

I think it would be worth looking at a SourceManager dump from around the crash and analyzing what we’re spending the source location space on – is this a death by a million cuts situation, or are there some surprisingly large allocations in there? If the former, are there patterns? Are we

generating duplicate locations (eg, repeatedly expanding the same macros with the same arguments in a way we can reuse)? Are there lots of SLocEntries that are unreferenced by file locations (eg, macro expansions that ended up producing no tokens)? There may be some smart way we can

conditionally disable updateLocForMacroArgTokens (https://github.com/llvm/llvm-project/blob/master/clang/lib/Lex/TokenLexer.cpp#L1054) or make it lazy, so we don’t assign full location information to preprocessor-tokens that are formed during macro expansion but don’t result in any

real tokens – or if there isn’t a smart way to do that, maybe we can give boost a way of disabling location tracking for some of its macros.

I should be getting a reproducer that doesn’t have a lot of other stuff happening in the next day or so, so I’ll share that dumpwhen I can.

I think it would also be worth pursuing the 64-bit source locations option (making SourceLocation larger, with no side table) as a configure-time flag, and fixing up all the places where we make the 32-bit assumption. If nothing else, that’ll let us measure the memory and compile time

overhead of doubling our source location storage. I expect that the numbers we get back from that will make that option an immediate non-starter, but perhaps not – in particular, if we could store a SourceLocation as a (FileID, offset) pair, that would save us from doing binary searches to

ocate the FileID for a given SourceLocation, which we do sometimes spend a non-trivial amount of our time doing, and it’d be worth measuring whether we can improve compile time at the cost of memory usage by doing that.

That sounds like a really good starting point. Configure-time I think is the right place to do this. FileID + Offset would be nice (as you say, that search looks really expensive), though would be likely still 64 bits (32 for fileid + 32).

I’d also be interested in seeing whether we can add a flag to turn off tracking macro expansion locations entirely (so we’d only ever track spelling locations). That would result in some QoI loss, but it might also allow cases such as this boost example to work without too much effort from us.

It’d be interesting to see how much maintenance burden we think such a flag would add.

Hmm… I will have to think on that.

Hmm, but will this actually be better than making SourceLocations 64 bits wide? Assuming we use the side table unconditionally, and assuming that most SourceLocations are stored in exactly one place (not true with templates, but often the case otherwise), this means that each time we

store a SourceLocation, we’ll use up 96 bits of storage (32 bits for the SourceLocation and 64 bits for the table entry). That would be 50% worse than the simpler approach of making SourceLocations always be 64 bits wide. If we could use regular 32 bit source locations for smaller values, and

then switch to the side table once we get (say) half way through them, maybe that would be acceptable, but it would mean that we take a constant-factor memory usage (and, due to loss of locality, probably performance) hit for all translation units larger than some threshold. Also, we’d still

need full-width source locations for the spelling and expansion locations of the macro expansion SLocEntrys, so those are going to grow significantly larger – and if we have a large number of macro expansions (such as in the boost case), that’ll be a big cost.

Hmm… yes, that is true. I would expect the SLocEntrys to just be the 64 bit larger version (and the macro expensions would use those instead), so those are equal to the 64 bit size. Spending a bit (the ½ thing above) to switch between the two I think makes this a size regression for anything >1GB (I think we currently handle a 2GB TU). We’d end up taking a size-hit vs just using 64-bit source-locs somewhere between 1GB and 2GB, but I haven’t done the math yet.

I don’t think we’re at the point where we can answer this question yet – I don’t think we sufficiently understand the problem. From what I’ve seen so far, it still seems likely to me that the problem is that we’re allocating too much of the SourceLocation address space for macro expansions >

that don’t need locations to be allocated for them, either due to an outright bug or a pathological case in the boost code.

I’m not sure how we omit those, from my look at the SourceManager it seems that all of those would be necessary (I don’t have a sufficiently good idea of how that works perhaps).

I don’t have any particular mechanism in mind. Theoretically, we could imagine (recursively) garbage-collecting SLocEntrys that are not referenced by any token, but actually doing it that way doesn’t seem feasible.

One possible approach: set a checkpoint in the SLocEntry space when we begin processing a macro expansion or macro argument expansion. If we get to the end of the expansion, see if we actually created any tokens. If not, roll back to that checkpoint and recover all the SLocEntry space we allocated in the mean time. That actually seems like it would be pretty quick and straightforward to implement, now I come to think of it… hm. Worth a shot? (Of course, we don’t know if that will help at all in the boost case, because we don’t know what pattern of macro expansions they’re using.)

Hmm, but will this actually be better than making SourceLocations 64 bits wide? Assuming we use the side table unconditionally, and assuming that most SourceLocations are stored in exactly one place (not true with templates, but often the case otherwise), this means that each time we

store a SourceLocation, we’ll use up 96 bits of storage (32 bits for the SourceLocation and 64 bits for the table entry). That would be 50% worse than the simpler approach of making SourceLocations always be 64 bits wide. If we could use regular 32 bit source locations for smaller values, and

then switch to the side table once we get (say) half way through them, maybe that would be acceptable, but it would mean that we take a constant-factor memory usage (and, due to loss of locality, probably performance) hit for all translation units larger than some threshold. Also, we’d still

need full-width source locations for the spelling and expansion locations of the macro expansion SLocEntrys, so those are going to grow significantly larger – and if we have a large number of macro expansions (such as in the boost case), that’ll be a big cost.

Hmm… yes, that is true. I would expect the SLocEntrys to just be the 64 bit larger version (and the macro expensions would use those instead), so those are equal to the 64 bit size. Spending a bit (the ½ thing above) to switch between the two I think makes this a size regression for anything >1GB (I think we currently handle a 2GB TU). We’d end up taking a size-hit vs just using 64-bit source-locs somewhere between 1GB and 2GB, but I haven’t done the math yet.

I don’t think we’re at the point where we can answer this question yet – I don’t think we sufficiently understand the problem. From what I’ve seen so far, it still seems likely to me that the problem is that we’re allocating too much of the SourceLocation address space for macro expansions >

that don’t need locations to be allocated for them, either due to an outright bug or a pathological case in the boost code.

I’m not sure how we omit those, from my look at the SourceManager it seems that all of those would be necessary (I don’t have a sufficiently good idea of how that works perhaps).

I don’t have any particular mechanism in mind. Theoretically, we could imagine (recursively) garbage-collecting SLocEntrys that are not referenced by any token, but actually doing it that way doesn’t seem feasible.

One possible approach: set a checkpoint in the SLocEntry space when we begin processing a macro expansion or macro argument expansion. If we get to the end of the expansion, see if we actually created any tokens. If not, roll back to that checkpoint and recover all the SLocEntry space we allocated in the mean time. That actually seems like it would be pretty quick and straightforward to implement, now I come to think of it… hm. Worth a shot? (Of course, we don’t know if that will help at all in the boost case, because we don’t know what pattern of macro expansions they’re using.)

What also might make a dent in paring down ultimately-unnecessary expansions from macro logic, assuming that is indeed the problem, is something like this:

Whenever a macro argument is only used in a concatenation that that expands to another macro name, then upon successful concatenation, it and all the expansions resulting in it should be deallocated/removed from SLocEntry space/etc.

E.g. consider a simplified “if” macro:

#define PP_IF(cond, A, B) PP_IF_##cond(A, B)
#define PP_IF_0(A, B) A
#deifne PP_IF_1(A, B) B

Now first note it’s possible we might call PP_IF(foo, A, B), which results in an identifier PP_IF_foo which must be interpreted beyond the preprocessor — and so we probably want to keep expansion info for the “cond” arg in that case.

But whenever we call PP_IF(COMPLICATED_TEST(a,b,c), A, B), as soon as we verify COMPLICATED_TEST(a,b,c) expands to a 1/0, and thus the concatenation results in a PP_IF_1/PP_IF_0, and that is a macro name, we can be sure we no longer need all the expansion buffers resulting from COMPLICATED_TEST(a,b,c); we will never need to refer to a SourceLocation in it. And there can be quite a lot of expansions therein in the Boost examples.

I’m not sure how to imlement this, or whether it would only turn the concatentation-of-buffers into swiss cheese that would have to be expensively rearranged if it were to help us at all — but since all preprocessor logic is performed using concatenations of expanded arguments to form other macro names, it might be worth considering a solution along these lines.

I've posted an RFC patch which implements this approach, including some benchmark results: https://reviews.llvm.org/D97204

Could you please have a look?