[RFC] Refactoring DILocation to use compact function-local storage

tl;dr by changing source locations (DILocations) from being global MDNode objects to being efficiently-stored function-local data, memory usage in debug info builds can be significantly cut; this comes with significant disruption to downstream forks and users of LLVM’s API, and so needs careful consideration.

Background

In LLVM’s debug info metadata, source locations of instructions are represented with the DILocation class, a subclass of MDNode, which contains the fields (Scope, InlinedAt, Line, Column, IsImplicitCode). Although it is necessary for debug info, DILocation ends up comprising a significant portion of LLVM’s memory consumption depending on the input. The exact percentage varies greatly depending on the source and build configuration, anywhere from <0.1% to 10% at peak memory usage (and often significantly higher during optimization passes). The reason for the heavy memory consumption can roughly be summarized as follows:

  • DILocation uses 16-24 bytes to store source location data, and 24 bytes to store generic MDNode data; this data is useful in some contexts, such as parsing, but is wasted throughout most of compilation.
  • DILocations are far more numerous than other MDNodes; in the cases I looked through they generally comprised anywhere from 30-80% of all MDNodes, scaling with the amount of inlining that takes place as we duplicate every inlined DILocation.
  • For each DILocation, we must also add a DenseMap entry in the LLVM context object as part of the uniquing behaviour of metadata, which can quickly end up consuming a measurable % of program memory.

It is certainly possible to make a variety of improvements that partially address each of these points, but they can’t be fundamentally changed without breaking some of the behaviours that LLVM currently relies on. With that in mind, we (Sony) decided to experiment with completely reworking source locations, and believe the result is worth implementing in full.

Proposal

In the prototype, we’ve split the DILocation fields above into two separate structs - “Context” data, (Scope, InlinedAt), and “Location” data, (Line, Column, IsImplicitCode). These are stored in two separate arrays owned by a DISubprogram, such that source locations are now function-local metadata. Finally, instead of each Instruction holding a pointer to a DILocation, they instead hold a pair of uint32_t indexes into the context and location arrays.

This has some significant implications for usage of source locations: Instructions no longer have a direct reference to their own source location, so a reference to the owning DISubprogram is needed. Although this is sometimes inconvenient, it is a logical limitation: DILocations are only used in a function-local context, and so everywhere that they are used, a reference to the owning DISubprogram is either present or easily-obtainable. In return, so far the prototype reduces the overall memory cost of DILocations significantly, around 50% for most inputs tested in the CTMark suite, and as the current implementation is very much unrefined and unoptimized we expect further improvements to be made.

For this post I’ll leave the technical explanation at just a rough overview, as the implementation is very much a work-in-progress - the MIR backend has yet to be fully implemented, the prototype as a whole is not review-ready, and there are a number of core components that will change before the final implementation. For more details on the implementation however, see the draft prototype and accompanying documentation here: Prototype: Replace DILocations with function-local source locations by SLTozer · Pull Request #133949 · llvm/llvm-project · GitHub

What comes next

The concept is not fully proven yet, though the prototype fundamentally works: there are still bugs and missing features, but the “tricky” cases are solved or have known solutions. Runtime performance impacts are still unclear - the prototype currently has a high runtime cost in Clang (which will be fixed later), and is about equal during optimizations, but this may change either way as the implementation is finalized.

More challenging than the mere implementation of this change however is the rollout - the change fundamentally affects all APIs that interact with source locations, such as the C API, which currently uses opaque MetadataRefs to pass debug locations around. While the in-tree updates to uses of DILocations are mostly trivial, this still creates work for any maintainers of downstream forks; unlike other significant rewrites, such as the replacing debug intrinsics with debug records, there is no simple runtime fallback for this change. Therefore, if the approach is accepted, the transition would need some careful planning to avoid causing too much disruption.

What we’re interested in right now is input from stakeholders, primarily anyone who:

  • has experience working with LLVM’s metadata model,
  • has a strong interest in reducing memory consumption in debug info builds,
  • consumes any of LLVM’s APIs that may be affected by these changes,
  • maintains a downstream fork of LLVM, particularly with any debug-info-related changes.

Any input is welcome, but we would particularly appreciate any technical feedback on the design direction, any issues you foresee with this approach, how this change could/would affect your own usage of or modifications to LLVM, and any suggestions for how this change could be made easier to work with/transition into.

Note also that by packing source location data more efficiently, we free up some headroom to expand our representation of source locations: this work was motivated by our work on Key Instructions, a feature which adds new fields to DILocation to improve stepping in a debugger (more details here). This rewrite reduces the cost of adding these fields, and may similarly benefit any other projects that look to extend LLVM’s representation of source locations.

Pinging a few individuals who this may be particularly relevant to: @adrian.prantl @dblaikie @echristo @nikic

3 Likes

+@rnk

Awesome - really long standing issue worth looking into.

A couple of questions:

  • What’s the impact to a non-debug build? I would’ve thought some concerns about a more built-in representation would add overhead to non-debug builds which might be difficult, but I forget how it’s all organized so I’m not 100% sure on that (like the dbgloc of an instruction is already a special pointer, so replacing that with two ints is size-neutral, maybe?)

  • You mention these things are stored in the DISubprogram - but there are some cases where instructions with locations are in functions without subprograms, though the instructions perhaps still have /a/ DISubprogram they may be harder to handle? Specifically the situation I’m thinking of is a function with debug info being inlined into a function without debug info. (we preserve the debug info into the inlining in case the non-debug-info function is further inlined into another function with debug info)

  • Two uint32 indexes - any sense of whether that can be shrunk further (perhaps at least the scopes could be int16 or something (though shrinking only one might not save space due to alignment padding taking up the saved space anyway))

1 Like

Thanks! (Also looks like the @rnk ping might not have registered).

What’s the impact to a non-debug build?

It should be nothing - the storage exists in the DISubprogram (in the current design), so that uses up nothing extra, and the pair of indexes occupy the same space as the old DebugLoc field (which was present with and without debug info).

there are some cases where instructions with locations are in functions without subprograms

Indeed, that’s one of the “tricky” cases I encountered. It’s not a particularly common case (in order for it to matter we need to inline from a debug function to a nodebug function, and then inline that into a debug function), so we can afford tradeoffs that are “expensive only when needed”. The two natural solutions as I see it are to either 1) move the storage into the function instead, which simplifies things but adds a (potentially small) cost to non-debug builds, or 2) create a smaller piece of metadata which just holds the function-local metadata storage, and is attached to nodebug functions (could be a !dbg attachment, but it doesn’t need to be) that have had debug locations inlined into them. The latter seems like the best solution to me, since it only costs anything when this situation arises, but I’ve not jumped in to measure this yet!

Two uint32 indexes - any sense of whether that can be shrunk further

Quite possibly - though in my case I’ve mainly considered whether they could be shrunk to fit more indexes in, to enable further debug info enhancements - but reducing the size of Instruction for all builds would also be a positive outcome! In any case, I don’t have a great sense of what the upper bound of the indexes might be - it seems rather unlikely that you’d ever come close to having 2^32 (scope+inlinedAt) combinations in a function, but 2^16 sounds like the sort of limit that could actually be hit in some pathological case (e.g. giant generated switch blocks where every branch calls into some nested inline functions).

1 Like

Great work, I’m glad to see folks working on this! I think it would be a great enhancement if we could make source locations sufficiently cheap that we could always track them, which is, BTW, the norm for MLIR.

This comes up when people add backend diagnostics. While they are discouraged, there are many of them, and users continue to ask for more of them. We could make these diagnostics significantly more user-friendly with ubiquitous source location information.

Speaking of the design, the direction I was contemplating before was, what inspiration can we borrow from the Clang SourceManager / SourceLocation representation? I think the Clang SourceManager contains too many details about the C preprocessor to migrate from Clang to LLVM, but we could push down some kind of file/buffer list and token offset encoding table as part of the LLVM IR serialization. Clang is able to represent the source location of every token in the compilation in 32-bits of information (64-bit sloc build configs notwithstanding). LLVM needs to augment that with more information (scope & inlinedAt), but even with one 64-bit integer and some tables, we can represent a lot more sloc information than we do today.

The challenges here probably all have to do with transformations. How do we merge your DISubprogram tables when we inline source locations? If we had module-level source manager-like tables, how do we merge them during full and thin LTO? Are there ways that we could represent the inlined call stack information to make inlining cheap, fast, and convenient? It seems like there are some similarities between source files and inline call frames. I can imagine representing an inlined call site as an array of arbitrary source locations, and the indices which point into them are that source location combined with an inlinedAt call site location. This allows nesting, and could be contained in a module-wide 64-bit source location index space.

Another thing to consider here is the readability of the textual representation. I think readable debug info goes a long way to helping transform authors accurately update source locations. Scoping is an important aspect of project management, so don’t let this derail the project, but it would be good if the in memory data model supports readable textual representations. I’ve been dreaming of locations that look like our asm comments, something like load i32, ptr %x #dbg(!1234:AsmPrinter.cpp:4565:12), where “!1234” is the real scope link (can this also serve for inlinedAt?), and AsmPrinter.cpp is a file basename that exists mainly for readability.

Something else to consider is that sample PGO folks would like to encode more information into the source location. They’ve already loaded up DWARF discriminators with 24-bits of flow-sensitivity information, but this is an extremely opaque bitfield representation (original RFC, but I’d love to see current docs). I’ve been getting questions like, would it be possible to encode which source locations are inside this critical section, as defined by this C++ RAII critical section variable? This roughly maps to the DWARF scope, but I’m not sure it’s sufficient.

cc @ZequanWu , who I was encouraging to look into this space in a few months.

2 Likes

+1 since this, in turn, mitigates profile degradation for SamplePGO as optimizations continue to evolve.

Thanks for the detailed response; I’m not familiar enough with Clang’s SourceManager to draw parallels between them, but one of the advantages of using per-function source locations rather than per-module is that we can keep the storage vectors tightly allocated, and we almost never have to perform large or frequent reallocations. I’ll answer to your points, though in case my short explanations don’t suffice I also have a readme on the development branch that explains some details of the implementation and the decisions made.

Inlining is quite straightforward; essentially, when we inline bar into foo, we can directly copy bar’s context and location arrays to the end of foo’s. Then, for each inlined instruction, we update its source location by incrementing its indexes by the original length of the corresponding array in foo, so that it matches up with the copied entries in the combined array.
For representing inline callstacks, I mentioned that the context array has an inlinedAt field - just as the current inlinedAt field in a DILocation is a pointer to another DILocation, the inlinedAt field in the index-based DebugLoc model is another DebugLoc, i.e. a pair of indexes; these indexes point to the (Context, Location) entries that correspond to the original call. Thus, we traverse an inlined call stack by following the inlinedAt indexes back through the Context array.

The textual representation I’m using at the moment (purely for development purposes, not intended as a final design) directly prints the indexes for each instruction, e.g. load i32, ptr %x !DebugLoc(loc: 2, context: 1), and then the corresponding DISubprogram contains arrays, e.g. ..., srcLocs: [(0, 0), (5, 1), (6, 5)], context: [(!2), (!4)], .... This does not improve clarity by any means, but it does have one advantage in that it simplifies the parsing of IR; this is not a performance critical task anywhere (to my knowledge) so this may be irrelevant, but it does remove the need for us to track forward references to location metadata - when you parse the indexes, you can store them directly in the instruction without needing to create temporary metadata.
I have, however, started adding information into comments to make visually parsing this easier. For now that just takes the form of printing the subprogram ID for easier lookup, i.e. !DebugLoc(loc: 2, context: 1) ; !2, but I intended to eventually add detailed comment printouts similar to what you described to make it much simpler to understand. Taking your approach as-is might also work, I just haven’t thought about it too much yet; the main difficulty would be in relation to inlining, where we have some distinct DILocations that aren’t attached to instructions but that must be preserved without combining identical-but-distinct inlinedAt locations - which would be tricky if we only printed out location information directly rather than maintaining an explicit index.

At the moment, all I can confidently say is that there should be more headroom for adding information to source locations; with a more detailed look at any specific scheme, I could make a better guess as to how it would slot in to the current design. Briefly though, the two ways we can add more information are 1) expanding either “Context” or “Location” to contain more fields, or 2) adding an additional index to DebugLoc. The former is the easier option in most cases, as while it increases the cost of storing all source locations, there is no strict limit in-place for the size of those structs. The latter is riskier, since we only have 64-bits to use for indexes without inflating the size of Instruction, which would be very undesirable; but it may be efficient to try and squeeze a 3rd (probably not a 4th) index in there, if we have some additional information to store that is not highly correlated with either the context or location fields.

It’s a really great attempt! DebugInfo is indeed a high memory consumption part during the compilation. However I once wanted to try your PR, but found that the PR build fails. May I ask if this work is still in progress? I’m very interested in reducing memory consumption in debug info builds.

The prototype covers a limited area of the compiler - it has successfully run middle-end optimizations for most of the programs in the LLVM test suite, but I haven’t fully implemented it to perform a full compilation correctly. The work will be completed, but it’s a reasonably large chunk of work that I’ve not yet been able to allocate the time to finish, so unfortunately it may take some time until it’s done.

1 Like