Map LLVM values to corresponding source-level expressions

Description of the Project: Developers often use compiler generated remarks and analysis reports to optimize their code. While compilers in general are good at including source code positions (i.e., line and column numbers) in the generated messages, it is useful if these generated messages also include the corresponding source-level expressions. LLVM implementation uses a small set of intrinsic functions to define a mapping between LLVM program objects and the source-level expressions. The goal of this project is to use the information included within these intrinsic functions to either generate the source expression corresponding to LLVM values or to propose and implement solutions to get the same if the existing information is insufficient. Optimizing memory accesses in a program is important for application performance. We specifically intend to use compiler analysis messages that report source-level memory accesses corresponding to the LLVM load/store pointer values that inhibit compiler optimizations. As an example, we can use this information to report memory access dependences that inhibit vectorization.

Expected result: Provide an interface which takes an LLVM value at any point in the LLVM transformations pipeline and returns a string corresponding to the equivalent source-level expression. We are especially interested in using this interface to map addresses used in load/store instructions to equivalent source-level memory references.
Desirable skills: Intermediate C++, familiarity with LLVM core or willingness to learn the same, self-motivation.

Project size: Medium

Difficulty: Medium

Confirmed Mentor(s): @sguggill Karthik Senthil

1 Like

Hey, Shivam this side,
Iā€™m currently pre-final year undergrad in India. And currently also working part time at KDAB(currently working on clazy(a static analysis tool, much specific to Qt that is based on the Clang API) .

I had also previously contributed at LLVM in clang static analyser and few patches stuff related to AST. And also few patches is in review for libcxx (or i also need to update them, sorry for delaying :frowning: ) .

Thanks for the project, itā€™s always great , if the modern compilers can help developers with more information about the stuff happening on optimising side / memory usage or anything which helps them to identify the performance problems, errors etc. And debugger always plays vital role here.
Iā€™m not much in debuggers and never explored it much but pretty interested in this project, always been interested in backend of LLVM and especially IR , and Iā€™m aware that LLVM debug info is generated using itā€™s IR and the intrinsics we have.
Currently reading this Source Level Debugging with LLVM ā€” LLVM 17.0.0git documentation , I hope it helps to explore more about LLVM debug info or if you have any suggestions, Iā€™d love to know.

Iā€™ll try to comment here with some more idea and my learnings asap.(just stuck in college) .
Thanks :wink:

Hey @sguggill , few questions.
Do we just want to use LLVM intrinsics to generate the source level expression? Or there is also one possible thing is to traverse over the LLVM IR to generate the corresponding source level expression taking an LLVM Value as input.
What do you think, what would be a good approach , to use llvm intrinsics or traversing IR for corresponding the value?

It seems to me that Intrinsics would be better, as it can be a more direct and efficient approach for these operations, since the intrinsic functions are already implemented in LLVM and have been optimized for performance.
If we go with intrinsics, I found out that metadata is the essential method to get information such as debugging information or information about data types. In our case, we are directly interested into the intrinsics which contains the addresses and source location details.

Could you evaluate the term LLVM Value ? If Iā€™m right it means any instruction in the pipeline, right? And also are we only interested in Load and Store instructions ?

@phyBrackets thank you for expressing interest in the project and letting us know a bit about yourself. To answer some of your questions, consider the following trivial example:

long foo(long *lp, long n1, long n2)
{
  return lp[2 * n1 + n2];
}

If you compile above using ā€˜clang -O2 -S -g -emit-llvm test.cā€™, you will see something like the following being generated. I have only shown the relevant parts.

define dso_local i64 @foo(ptr nocapture noundef readonly %0, i64 noundef %1, i64 noundef %2) local_unnamed_addr #0 !dbg !9 {
  call void @llvm.dbg.value(metadata ptr %0, metadata !15, metadata !DIExpression()), !dbg !18
  call void @llvm.dbg.value(metadata i64 %1, metadata !16, metadata !DIExpression()), !dbg !18
  call void @llvm.dbg.value(metadata i64 %2, metadata !17, metadata !DIExpression()), !dbg !18
  %4 = shl nsw i64 %1, 1, !dbg !19
  %5 = add nsw i64 %4, %2, !dbg !20
  %6 = getelementptr inbounds i64, ptr %0, i64 %5, !dbg !21
  %7 = load i64, ptr %6, align 8, !dbg !21, !tbaa !22
  ret i64 %7, !dbg !26
}
...
...
!15 = !DILocalVariable(name: "lp", arg: 1, scope: !9, file: !1, line: 1, type: !13)
!16 = !DILocalVariable(name: "n1", arg: 2, scope: !9, file: !1, line: 1, type: !12)
!17 = !DILocalVariable(name: "n2", arg: 3, scope: !9, file: !1, line: 1, type: !12)

Using the information from the llvm.dbg.value intrinsics and seeing how %4, %5, %6, %7 are computed, we should be able to infer something like the following:

%4 -->   n1 << 1
%5 -->   (n1 << 1) + n2
%6 -->   &lp[(n1 << 1) + n2]
%7 --> *(&lp[(n1 << 1) + n2]) --> lp[(n1 << 1) + n2]

As you can see from this example:

  1. Computing the equivalent source expressions of interest will involve walking LLVM IR and using information provided in debug intrinsics.
  2. Even though our current interest is addresses used in load/store instructions, in order to compute the needed information we will need to compute the same for other LLVM instructions
  3. Optimizations may make it impossible to recover the original source expression. In the case above, for example 2 * n1 is optimized to n1 << 1 and recovering the original expression may not be possible. Can we do something to make this possible?

I hope I have answered your questions.

Hey @sguggill , Thanks for the nice and easy explanation, I think I understood it correctly.
I guess, It will be quite straight forward to compute the equivalent source expression by traversing the LLVM IR and using the metadata.
But yes, the transformations/optimizations makes it impossible to get the original source expression even with the help of metadata, as it is not preserve.
There can be multiple solution for that

  1. Preserving the metadata, which is not possible in the current context i think because metadata is designed such that itā€™s always safe to drop. Do you have any comments on this?
  2. Iā€™m reading a paper about the code tracking and preserve the information about the source level constructs, https://hal.inria.fr/inria-00572785/document , It is also based on the same problem that metadata is not preserve and get lost after certain optimizations,
    This paper done a great job in explaining about how Multi-versioning will be helpful in code tracking.

Cloning the module can help preserve a copy of the original LLVM IR before any optimizations are applied, which can be useful for comparing the optimized code with the unoptimized code. This can help to identify which optimizations are causing changes to the code and can also help with debugging any issues that may arise in the optimized code.
Once we have a clone of the module, we can apply optimization passes to the original module while still having access to the unoptimized code in the cloned module. We can then use the information gathered from the cloned module to help map the LLVM values back to their original source-level expressions.
Iā€™m not much aware of the LLVM::ValueMapper class but it seems to help in the cloning and re(mapping) LLVM: llvm::ValueMapper Class Reference .
Btw Iā€™m still in a doubt of, do we have access to the original LLVM IR in any way for cloning it, or Iā€™m missing something? Or for having the clone of original module, do we need to go for symbolic execution?

Let me know your views on it, Iā€™ll try to research more about it and will let you know if Iā€™ll find something more interesting and a feasible solution.

Thanks

Hi Shivam,

Thanks for the updates. I believe the major focus of the project will be on how we can reconstruct the source level expressions for a given llvm::Value. I agree the approach must be based on using def-use chains of the IR and consuming information from the debug intrinsics.

Regarding your suggestions on techniques to preserve source-level debug information metadata, I think that would be a secondary task and would need a much deeper investigation. Cloning a full module arbitrarily would be costly from LLVM middle-end/back-end perspective since they can be very memory intensive. We can discuss about that further, maybe it would be good to come up with pros/cons for your proposed techniques?

Also I have created a Discourse account now and will receive updates made on the post. I will ask Satish to update the post with my Discourse handle.

Thank you,
Karthik

Replying over here, Hi @karthiksenthil ,
Yes I agree, Cloning the full module can be very memory-intensive, especially for large modules, which can impact the performance of LLVM-based optimization passes and can be time-consuming, which can impact the overall compilation time.
There are few pros about this technique which may be relevant Cloning the full module ensures that all original source-level debug information metadata is preserved and the technique does not require any additional analysis.

And this leads to a further idea, which is to clone only relevant part of the module instead cloning the whole module but yes then it requires additional analysis, The advantage of this approach may be that it would reduce the memory and time overhead of cloning the entire module, and only focus on the relevant parts. However, it may require more upfront analysis to identify the specific parts of the function to clone, and the remapping of values may need to be carefully handled to ensure correctness.

In this technique, we need to be very specific to the transformation and optimizations that changes the metadata, so this approach probably require too much handling over each kind of transformations but surely a better approach rather than cloning the whole module.

Hey @sguggill @karthiksenthil , Iā€™m thinking of writing a Draft proposal about this project, Could you suggest what should be the initial/main focus of the project? How much we care about the original source level expression initially?
I think initially we can focused on the base idea of the project which is mapping the LLVM Value to source level expression, after that we can look into what would be a better and efficient approach to infer the original source expression.

Although ig llvm.dbg.value can able to give essential metadata info about the original source expression but some optimizations may make it impossible to recover the original source expressions, so we really need to look up for different ideas.

Hi @phyBrackets, thanks for following up and working on a draft proposal for this project. We believe that the primary focus of this project should be on how to construct a source-level expression for a given llvm::Value by tracing associated LLVM debug intrinsics. Having a mechanism to build such an expression would ensure that we could determine them for interesting Values in the IR (for example, memory accesses) early in the LLVM pass pipeline and cache them in a central storage like metadata. We can then investigate after every transform if this cached expression needs to be updated or retained as is. This approach should cover our secondary concern about correctness of the source-level expression. We can also discuss about other alternate approaches to track and maintain these expressions accurately.

This was actually one of the topics of the thesis I wrote with a friend of mine a couple of years ago (most of the relevant bits are in ch. 3). The work was a bit hurried and needed a lot of rewriting to upstream, so I never got around to it, but maybe the code can be useful as inspiration. It evolved organically, mostly by compiling CPython and waiting for the pass to crash while processing some new pattern, and then adding support for that (CPython implements OO code manually and downcasts pointers to subtypes left and right, so it was a pretty good test for some gnarlier C patterns). The implementation does some redundant stuff due to later finding more general solutions for some patterns, while leaving the initial attempts as dead code. The pseudo-code in the thesis may be useful as an architectural guide as it prunes away the redundancies.

Some things to consider, from my memory:

  • We didnā€™t start out with opaque pointers in mind, but during development gradually realised we needed to rely more and more on debug info metadata. In the end we realised the actual IR types shouldnā€™t be relied upon at all, but for legacy reasons our code kind of does and uses the debug info metadata to ā€œcalibrateā€ the types. After struggling with compiling CPython for a while I came to the conclusion that even typed pointers should be treated as opaque, but didnā€™t quite end up fixing the implementation.
  • Should the architecture be written with support for multiple source languages in mind? We only considered generating C source code for the purposes of our thesis, but LLVM is used for many other languages and C code may not be any less confusing than LLVM IR to e.g. a Rust programmer. More complex languages are harder to decompile accurately, however, especially after optimisations since parts of a higher level construct being emitted as multiple operations may be optimised away.
  • I thought a bit about how to handle the string construction efficiently, but ended up having to get something working before worrying too much about that. Iā€™m still not sure whether our string handling is much of a performance bottleneck/leads to memory bloat or not. More complex languages may require more speculative generation as multiple IR operations map to one higher level construct. Maybe this requires separating the phases of finding higher level constructs from actually generating the strings?
  • There isnā€™t always a 1:1 mapping between source code and IR: sometimes multiple source code constructions result in the same IR, e.g. *ptr and ptr[0]. Which pattern should be used when there are multiple options?
  • Related to the previous point, often the actual source code is present on the file system during LLVMā€™s compilation, and the path to it exists in DIFile metadata. Should a first option be to try to open the file and just go to line/column and fetch the ground truth, and only falling back to generating source code when this fails? What are the performance, memory and security implications of doing that?
  • What should be generated when parts of or the entire expression is lacking debug info metadata? Nothing (i.e. falling back to the current behaviour)?

As my implementation needed a rewrite, didnā€™t handle opaque pointers (which are becoming more standard every passing day), and I never had the time to fix it to an upstreamable state I absolutely welcome this initiative. Optimisation remarks have the potential to be so much more than they are, and with some work on the remark messages they could end up being used by many more (non-expert) programmers than they currently are, and this is a great step in that direction.

Another project the might be relevant for inspiration is the old C backend, although Iā€™m not too familiar with the code, and I assume it operates on Machine IR and emits whole translation units instead of individual expressions.

Thanks @karthiksenthil , for some more info!

Hey @hnrklssn , Thanks for the nice write up, It really helps in getting more insights about the problem.
What may be one option regarding this problem is, When there are multiple options resulting in the same IR, it is important to choose the pattern that is most commonly used and consistent with the coding style. This will ensure that the generated source-level expression is familiar to the programmer and easier to understand.

And regarding the presence of actual source code on the file system, it is a good option to go with open the file and fetch the ground truth by going to the line and column specified in the debug info metadata. However, there are might be certain implications that need to be considered. For instance, it can have performance implications if the file system is slow or if there are a large number of files to be opened. Memory implications may arise if the files are large or if multiple files need to be opened simultaneously.
Not much aware about the security implications about that, but it might be the case when files are untrusted in any way?

I guess falling back to the current behavior is a reasonable option. However, it would be better to generate as much of the source-level expression as possible, even if some parts are missing. This will still provide some context for the programmer to understand the operation being performed. What do you think?