[RFC] Source expressions as metadata

Our team is interested in the problem of providing coherent memory reference information to users in optimization remarks. For example, when a loop is peeled for alignment, it may be useful to the sophisticated user to know which memory references in their source code benefit from alignment, and which remain unaligned. When vectorizing, the user may benefit from knowing that a memory reference pattern required a gather load rather than a sequential load. There are many other examples in various loop optimizations.

The difficulty is providing an easily digestible description of the memory reference. With debug information available, we can usually identify the line number and column number of the memory reference from the associated gep instruction, which is a good start. But it would be preferable to describe the memory reference in the syntax of the source language.

For GSoC ā€™23, there was a student project (https://github.com/llvm/llvm-project/pull/66591, https://discourse.llvm.org/t/map-llvm-values-to-corresponding-source-level-expressions/68450) that approached this by using debug information to infer a source-level expression string corresponding to a Value. This proof-of-concept demonstrated that such an approach is possible in practice, but raised some concerns. Itā€™s clear that the work to generalize the prototype for all instructions and type information would be substantial. More importantly, the inferred expression will often be only an approximation of the original source-level expression; the likelihood of accuracy diminishes as optimization proceeds. Another issue is matching the source language syntax, which might require the analysis to support multiple ā€œdialectsā€ when producing the expression strings. Compilation time required is linear but still nonzero.

The most optimal approach would be to use the file, line, and column number information to simply grab the expression from the source code. But this violates separability; we canā€™t guarantee the back end has access to the filesystem used by the front end. Furthermore, there would still be a source language dependency when determining the endpoint of the memory reference.

An obvious but key point is that the source code doesnā€™t change during compilation; we always want to point back to the same expression string for a given memory reference at any point during compilation. This argues for maintaining the source expression as metadata.

One possibility would be to have the front ends provide the expression strings as metadata along with the debug location for the memory reference. Obtaining the metadata from the front end solves the accuracy problem and keeps the front and back ends decoupled. We thought briefly about extending the DILocation metadata to include the source expression string, but after running it by one of our debug information experts, we realize there are strong reasons to keep the location information small, so this probably wouldnā€™t fly. This is too bad, since storing it as part of DILocation would allow us to piggyback on existing propagation strategies and optimization decisions for DILocations, for example determining which one to associate with an instruction after commoning.

Assuming thatā€™s not possible, we would propose adding a new metadata tag (say, !srcexpr) on instructions, pointing to something like !DISourceExpr(!ā€EXPR_STRINGā€, !). We would have additional work to do to ensure this was propagated as successfully as DILocation data.

To minimize the overhead of this approach, we would initially propose that we only create source-expression metadata for memory references, as this is the use case we have in mind. It would be ā€œeasyā€ to extend to other expression nodes if other use cases arise. The extra information could also be added under control of a switch. We would envision having this metadata generated by default when -Rpass is selected in order to support optimization reporting, as is already done for location data.

If obtaining the expression strings from the front end is not practical for some reason, it would be possible to use an inference approach like the GSoC ā€˜23 prototype, executed at the front of the LLVM pipeline to minimize the differences from the source expression. But as noted, this is not by any means ideal, and differences would still often exist.

Iā€™m personally not well-versed in the front-end code, so Iā€™m not certain how difficult it is to identify the string representing a memory reference from the source. I assume that after parsing and formation of the ASTs it might be challenging to get back to the source tokens. So Iā€™d be interested in thoughts from front-end experts on how feasible this might be.

What thoughts do you have on this proposal? Are there alternatives we havenā€™t considered? Are there significant downsides weā€™ve missed?

Thanks for reading this far! We appreciate any discussion.

Cheers,
Bill

Tagging @karthiksenthil @sguggill @abhinavgaba @eepshtey @AaronBallman

Tagging @jmorse @OCHyams as this feels vaguely related to Assignment Tracking; although in the AT case youā€™re looking at writes, which this work appears to care more about reads. But I have to wonder if thereā€™s something to leverage.

Hi @bill.schmidt , Thank You for your RFC. I have been seeking around and discussing very similar approach with @StephenTozer recently(just few days back). While working on the project, we have initially focused on building the source level expression irrespective of how accurate they are and mainly focused on the memory references. You already mentioned other challenges as well that we have faced.

Iā€™m recently thinking around how we can have something called ā€œInstruction Trackingā€ for the similar reason, Existing mechanisms like !DILocation as you have suggested, provide some tracking through source locations but fall short of tracking how instructions evolve through optimizations, and the size issues as you suggested on your further research.

I really like the idea of introducing a new metadata for expressions, but the problem is ensuring that !srcexpr metadata is correctly propagated through complex optimizations that might split, merge, or significantly alter the original instructions, Iā€™m still not sure whatā€™s the best way to ensure it, as far as Iā€™m aware Metadata does not seems to be preserve and it could be drop anytime (mostly during the complex transformation).

In case if we are going on that route, there could be other potential challenges that I was looking at, impacting compile time and memory usage, to mitigate this, I think as we are still limiting metadata generation to certain types of instructions (e.g., memory references), so it should have small affect and we can just allow it to be controlled by compiler flags (In case going more down to accessing the DIFile in some cases).

And I think that complexity might lie in accurately capturing and converting the high-level abstract syntax tree representations of expressions into string metadata that can be carried through the optimization phases. (Leaving this to the frontend folks)

I would love to see more work on this direction and happy to help out in the project as well if needed, I was just thinking about posting an RFC as well for the same topic but great that you have posted and started the discussion :smile: . ( I have discussed the ā€œMapping LLVM Values to the source level expressionā€ Project in the FOSDEMā€™24 and I have seen people was interested in the project and suggested ideas.)

Thank You!

Thank you for the RFC! This seems to be closely related to an existing deficiency we have in the frontend regarding communicating diagnostics from the backend to the user: [RFC] Improving Clang's middle and back end diagnostics (CC @nickdesaulniers). The concerns raised on that thread are primarily about finding an acceptable tradeoff for functionality vs compile time overhead. It may be worth mining that discussion to further refine your ideas.

Yep. Instruction::clone copies metadata, but instructions created other ways need metadata manually copied over (copyMetadata, combineMetadata). I guess surveying call sites for the following could provide some insight: setDebugLoc, applyMergedLocation, dropLocation, copyMetadata, combineMetadata.

Similarly, as paul mentions

With Assignment Tracking LLVM attaches !DIAssignID metadata to store-like instrucitons; this metadata is maintained (with a best-effort approach) through optimisations. Just like with DILocations, thereā€™s code for merging DIAssignID metadata when instructions are merged. Looking at DIAssignID for inspiration wonā€™t give you a full picutre because LLVM only has to preserve it on store-like instructions. Might be a useful to look at all the same.


The thread that @AarronBallman links above mentions an approach which involves changing DILocations to look something closer to clang SourceLocations, which are effectively offsets into source buffers AFAICT (Iā€™m not a front-end person though). I imagine moving towards something like that in LLVM would involve keeping the source code around in IR too (as you mention, the source file may not be available in the back-end)? That approach would probably help with your problem. It does sound like quite a substantial undertaking though.

I also canā€™t help here, sorry!

I think itā€™ll ultimately depend on the performance characteristics (looking at the metadata coming out of a front-end part prototype could help). That said, the approach sounds fairly reasonable to me at this stage given the alternatives. YMMV, Iā€™d be interested in hearing what others say.

Thanks for the discussion so far! My biggest takeaway from the discussion thread pointed to by @AaronBallman is confirmation that separate metadata (similar to the ā€œad hoc debug infoā€ provided on !srcloc) is a preferred approach.

I also see that itā€™s common to want to have something similar to debug location information thatā€™s treated in the same manner for propagation purposes. This is a common theme with the Assignment Tracking thread, the backend diagnostics thread, and this discussion. It seems we might benefit from a framework that would simplify propagation of metadata that wants to be treated similarly to DILocation data.

Iā€™m still interested in hearing opinions from front-end experts on the feasibility of identifying the full expression associated with a store or load (i.e., a substring of the LHS or RHS of an assignment involving memory that will turn into one or more load/store IR instructions) and attaching it to those IR instructions. I admit this is because I am daunted by trying to figure it out myself with no knowledge of the FE structure. Iā€™ll dive in if I must, but leaning on the knowledge of the masters seems more efficient. :wink:

Thanks again!
Bill

2 Likes