If you intend to manage source locations for an optimizing compiler, you need a much richer data structure than Clang & LLVM currently support. The compiler company I worked for went down this road ~25 years ago. It’s actually quite straightforward to build, albeit potentially expensive in terms of memory and compilation time. But the result is that you can support accurate debugging of fully-optimized code. The model we used didn’t even try to represent any source location smaller than a line of code, because we had waaay less memory available back then. The approach we used can reasonably apply to any granularity of source location you choose to implement. That said:
- Any given instruction may “originate from” an arbitrary number of distinct source locations (call them LOCs). Consider inline expansion of a method, or macro expansion as two obvious examples; there are many more. The resulting instructions “originate from” the location of the call (or macro) and also from the particular line(s) in the method (macro) that generated them. So we need to be able to map one instruction/program-address to many LOCs.
- Any given line of code may be implemented by a bunch of instructions that may or may not be contiguous. So we need to be able to map a single LOC to multiple instructions, ranges of instructions, multiple non-contiguous ranges of instructions, etc.
- Depending on later optimization, there could legitimately be multiple distinct instructions each of which is “a beginning” of that LOC. More multiplicity.
- Depending on later optimization, there could be multiple distinct instances of a single LOC. Consider a method that is inline-expanded at K call sites. Each LOC within that method has (at least) K instances. Each maps back to the same LOC, but the instances aren’t the same as each other. This is an important and possibly non-obvious part of the design. Loop unrolling is another example; hoisting an invariant out of a loop is yet another.
We found by experience that attempting to maintain beginnings (and endings) of LOC instances through the entire compiler was too hard; we always messed it up. So what we did instead was to tag each syntax tree (or Optimizer IR node, or instruction, etc.) with all the LOC instances that it’s part of. We did some quite simple LOC management (see below) throughout IR generation, optimization, and code generation, then computed the beginnings, ends, and ranges of LOC instances at the end of the compilation process via the obvious flow-based algorithms.
Tagging everything with all its LOC info and recovering beginnings/ends after all transformations are complete makes managing source location data inside the compiler essentially trivial. It devolves into five simple cases:
- When you delete code/IR/instructions (I’ll call all of these code from now on), delete the LOC data along with them. With a reasonable implementation this comes for free when deleting the code.
- When moving code, move the LOC data along with it/them. Also a freebie.
- When combining code, union the LOC info. This sometimes happens in combination with the next case.
- When creating new code (simple case) copy the LOC info from the origin from which it was created onto the new code.
- When replicating code (more complex case) — inline expansion, loop unrolling, macro expansion… — create a new instance of each LOC for each replication of the origin. New instances are treated as distinct lines of code for the purpose of computing begin/end locations. When setting a breakpoint at the beginning of line N, for example, you would get breakpoint(s) at each beginning of each distinct instance of line N. For fancier debug support, one could set breakpoints at particular instances of line N without stopping at other instances, and so on.
The pain of implementing such an approach in the compiler is that somebody has to visit more or less the entire code base to ensure that each thing that transforms code applies the correct choice from the above 5 cases. The good news is that each such decision is both local and essentially trivial — we finished the first cut at the update in our compilers’ code (~400KLOC at the time) in about two person-weeks of effort. Followed, of course, by testing & debugging time to smoke out the incorrect or missing decisions.
The result of this approach is that you get quality source location information for minimum-to-low optimization levels at reasonable expense in memory & compile-time and with only modest expansion of the size of the debug info. At higher optimization levels you continue to get fully-accurate source location information (modulo bugs, as with all things). Depending on how agressive your optimizations are, that location information may be anything from somewhat non-obvious to utterly brain-bending. The increase in size for the debug info can be large, but is roughly proportional to the degree to which code has been transformed.
You can propagate LOCs through essentially arbitrary optimizations: loop unrolling, vectorization, loop pipelining, cache blocking, or whatever. You can even track source locations through software loop pipelined, unrolled & optimized VLIW code, if that happens to be the problem at hand. As an added benefit, it supports things like providing auditors looking at safety-critical code with the exact origin(s) of every instruction in the program. It even helps compiler implementers understand what their optimizer has done to the program.
Obviously, your debugger has to understand the more complicated model of source locations. For example, asking for a breakpoint at the beginning of a source line may require setting either one or many breakpoints in the program. Similarly, if you stop at an arbitrary instruction and ask the debugger where you are in the source, you could wind up with something like “Stopped at the beginning of line-K-instance-X, somewhere during execution of line-L-instance-1, and just after the end of line-A”. The most useful breakpoint operations were: “stop at the beginning of (each instance of) a line” and “stop just after the last instruction of (each instance of) a line”, because those are the points where either nothing for that line has yet executed or where everything for that line is complete. Obviously, arbitrary other stuff could be in flight there as well, including prior or subsequent lines, or whatever.
We also implemented a way to accurately explain to the debugger how to find the value of variables post optimization, but that’d need another big wall of text so I’ll skip it for now.
Sorry to go on so long. My primary message is that propagating rich source location information through a compiler is a problem that is conceptually easy, is tedious but feasible to implement, and that has known proven solutions. Further details on request, if anyone is interested.
Dean