Our team is interested in the problem of providing coherent memory reference information to users in optimization remarks. For example, when a loop is peeled for alignment, it may be useful to the sophisticated user to know which memory references in their source code benefit from alignment, and which remain unaligned. When vectorizing, the user may benefit from knowing that a memory reference pattern required a gather load rather than a sequential load. There are many other examples in various loop optimizations.
The difficulty is providing an easily digestible description of the memory reference. With debug information available, we can usually identify the line number and column number of the memory reference from the associated gep instruction, which is a good start. But it would be preferable to describe the memory reference in the syntax of the source language.
For GSoC ā23, there was a student project (https://github.com/llvm/llvm-project/pull/66591, https://discourse.llvm.org/t/map-llvm-values-to-corresponding-source-level-expressions/68450) that approached this by using debug information to infer a source-level expression string corresponding to a Value. This proof-of-concept demonstrated that such an approach is possible in practice, but raised some concerns. Itās clear that the work to generalize the prototype for all instructions and type information would be substantial. More importantly, the inferred expression will often be only an approximation of the original source-level expression; the likelihood of accuracy diminishes as optimization proceeds. Another issue is matching the source language syntax, which might require the analysis to support multiple ādialectsā when producing the expression strings. Compilation time required is linear but still nonzero.
The most optimal approach would be to use the file, line, and column number information to simply grab the expression from the source code. But this violates separability; we canāt guarantee the back end has access to the filesystem used by the front end. Furthermore, there would still be a source language dependency when determining the endpoint of the memory reference.
An obvious but key point is that the source code doesnāt change during compilation; we always want to point back to the same expression string for a given memory reference at any point during compilation. This argues for maintaining the source expression as metadata.
One possibility would be to have the front ends provide the expression strings as metadata along with the debug location for the memory reference. Obtaining the metadata from the front end solves the accuracy problem and keeps the front and back ends decoupled. We thought briefly about extending the DILocation metadata to include the source expression string, but after running it by one of our debug information experts, we realize there are strong reasons to keep the location information small, so this probably wouldnāt fly. This is too bad, since storing it as part of DILocation would allow us to piggyback on existing propagation strategies and optimization decisions for DILocations, for example determining which one to associate with an instruction after commoning.
Assuming thatās not possible, we would propose adding a new metadata tag (say, !srcexpr) on instructions, pointing to something like !DISourceExpr(!āEXPR_STRINGā, !). We would have additional work to do to ensure this was propagated as successfully as DILocation data.
To minimize the overhead of this approach, we would initially propose that we only create source-expression metadata for memory references, as this is the use case we have in mind. It would be āeasyā to extend to other expression nodes if other use cases arise. The extra information could also be added under control of a switch. We would envision having this metadata generated by default when -Rpass is selected in order to support optimization reporting, as is already done for location data.
If obtaining the expression strings from the front end is not practical for some reason, it would be possible to use an inference approach like the GSoC ā23 prototype, executed at the front of the LLVM pipeline to minimize the differences from the source expression. But as noted, this is not by any means ideal, and differences would still often exist.
Iām personally not well-versed in the front-end code, so Iām not certain how difficult it is to identify the string representing a memory reference from the source. I assume that after parsing and formation of the ASTs it might be challenging to get back to the source tokens. So Iād be interested in thoughts from front-end experts on how feasible this might be.
What thoughts do you have on this proposal? Are there alternatives we havenāt considered? Are there significant downsides weāve missed?
Thanks for reading this far! We appreciate any discussion.
Cheers,
Bill