Call for testing -- non-instruction debug-info

Hi all,

Please help us eliminate debug-intrinsics from LLVM by testing our experimental new debug-info code on your codebases. We’re aiming to have identical outputs for all passes and targets, lending us some coverage will greatly help.

Over in this thread [0] we’ve been producing an alternative representation of debug-info intrinsincs in LLVM with commits tagged “RemoveDIs”. I’m happy to say that from commit f85a38e21cf, what’s in-tree is able to produce an identical [1] binary of clang 3.4 and stage2 clang, in dbg.value mode and RemoveDIs mode. It also makes -g compile-times faster, usually proportionately with the amount of inlining done. Writing optimisation passes will become simpler too as debug instructions won’t appear in your data structures.

However, deploying this is not simply a matter of making all the lit tests pass: we don’t have comprehensive testing of the debug-info behaviour of all passes. Instead, we’ve been building a number of large programs and checking that the final binary is identical, which is a good assurance of correctness, but that then requires good coverage of all optimisation passes. Thus, it would be fantastic if developers who regularly compile large codebases could try building their codebases with/without the new debug-info mode, reporting errors or comparing whether the binary output is the same, and report any problems. That will flush out any undiscovered edge cases. Most of the work can be done by llvm-reduce. Comprehensive config instructions on how to use it and how to reduce differences can be found in this gist script [2] – note that testing this requires flipping a CMake flag and rebuilding. Please report any issues found to the github ticket at [3].

There are a number of limitations and pieces of remaining work documented at [0], but I would expect all targets to not crash or experience a noticeable compile-time performance regression in the new mode. Targets using SelectionDAG should have just as good debug-info before, or if you apply the bug-for-bug fixes and workarounds in the patches in [4] it should produce an identical output binary with/without RemoveDIs. We haven’t worked on GlobalISel yet, but there’s a workaround for it in [4] too.

In an ideal world, we’d have all of this sorted in the next month, with a view to switching this mode on-by-default for LLVM-18. I’ve written more details on that plan into the other discourse thread about what we’re doing.

Many thanks to @OCHyams and @StephenTozer who’ve been working on this with me for a long while.

[0] [RFC] Instruction API changes needed to eliminate debug intrinsics from IR - #10 by jmorse
[1] “Identical” if you apply the equivalence patches at [4], which makes the new RemoveDIs code match existing dbg.value behaviours bug-for-bug, which we’d rather not commit.
[2] Example script for reducing problems with RemoveDIs. · GitHub
[3] [DebugInfo][RemoveDIs] Umbrella ticket for problems with "new" non-instr debug-info · Issue #74735 · llvm/llvm-project · GitHub
[4] GitHub - jmorse/llvm-project at removedis-comparison-patches

4 Likes

Hi! I’ll be happy to run this on our engine(s), but the instructions for comparing is heavily geared towards Unix based systems. Would tests on windows with PDB be useful here and how would we compare the identical output in that case?

Hi! I’ll be happy to run this on our engine(s), but the instructions for comparing is heavily geared towards Unix based systems. Would tests on windows with PDB be useful here and how would we compare the identical output in that case?

That’d be fantastic, cheers! The three key things needed for this testing are:

  • Clang built with the LLVM_EXPERIMENTAL_DEBUGINFO_ITERATORS cmake flag set to “On”, which installs an extra bit in BasicBlock::iterator
  • DebugInfo’s “Assignment tracking” disabled with -Xclang -fexperimental-assignment-tracking=disabled (it’s not ready yet)
  • The new mode turned on with -mllvm --experimental-debuginfo-iterators=true

I’ve found in the last day a rare crash that seems fixed by one of the equivalence-patches, see the branch on my github repo linked above in [4]. This is unfortunate, we’ll try and narrow that down & get it into main ASAP.

For equivalence using PDBs, I’ll admit that I’m not a PDB expert in any way, but my confidence that it can be easily compared is low. I understand that it’s more of a “database” of debuginfo records that gets accumulated across multiple compilations, which features timestamps and guids which will get in the way. I imagine the object files are a more stable thing to examine. I took a glance at whether I could coax clang-cl to produce DWARF output and didn’t find a way.

It might also be possible to use llvm-pdbutil to dump the contents of a PDB and diff it, but there might also be spurious differences in there.

I’d hope that the debug-info-visual-analyzer tool, whatever it ended up being called (I’m on PTO and can’t really look right now) would be able to compare two Windows cases. I’m fairly sure it understands PDBs.

We might be able to compile a thousand or so Fedora packages with this turned on, but is there still value in testing if we can’t do the before/after binary comparison?

Paul wrote:

I’d hope that the debug-info-visual-analyzer tool, whatever it ended up being called (I’m on PTO and can’t really look right now) would be able to compare two Windows cases. I’m fairly sure it understands PDBs.

Good point – llvm-debuginfo-analyzer prints the logical meaning of debug-info instead of how it’s represented in any particular format, so it should be a stable way of comparing the file contents. However, I believe it doesn’t support the presentation of variable locations right now – CC @CarlosAlbertoEnciso is that still the case? Variable locations are the primary thing that debug-intrinsics affect.

Tom wrote:

We might be able to compile a thousand or so Fedora packages with this turned on, but is there still value in testing if we can’t do the before/after binary comparison?

IMO yes: there’s still a risk of debug-info being attached to PHI instructions in various circumstances, which eventually causes verifier errors. Getting more confidence that there are no crash-causing issues (or detecting them if there are any) would be valuable.

For the record, the most common issues that we’ve been addressing are:

  • The order of debug-info records changing, causing various DWARF records to change order too (but this can eventually re-order variable assignments which is bad),
  • Variable locations appearing one instruction later (or earlier) than expected,
  • Variable locations disappearing.

Realistically, aside any crashes, these failure-modes are low-impact enough to be a tolerable risk. My major worry is that there are likely optimisation passes out there that aren’t well-covered by clang-stage-2-builds and the sort of game-code we test on, which risk crashes.

(NB, I would now recommend testing main from 4b64138ba4 onwards as we’ve brought in some further fixes)

The llvm-debuginfo-analyzer tools fully supports variable locations for all the supported debuginfo formats (ELF, CodeView) and binary file formats (ELF, COFF, PDB and Mach-O). Some variable locations are represented in a high level detail and others are printed as DWARF/CodeView constants.

This is extracted from the official documentation:

ATTRIBUTES

The following options enable attributes given for the printed elements. The attributes are divided in categories based on the type of data being added, such as: internal offsets in the binary file, location descriptors, register names, user source filenames, additional element transformations, toolchain name, binary file format, etc.

–attribute=<value[,value,…]>

With value being one of the options in the following lists.

The following attributes describe the debug location information for a symbol or scope. It includes the symbol percentage coverage and any gaps within the location layout; ranges determining the code sections attached to a function. When descriptors are used, the target processor registers are displayed.

=coverage: Symbol location coverage.
=gaps: Missing debug location (gaps).
=location: Symbol debug location.
=range: Debug location ranges.
=register: Processor register names.