[RFC] Require real or annotated source locations on all instructions

Currently, we (Sony) are undertaking work to prevent unnecessarily-dropped source locations in LLVM by adding a new rule for updating debug info in the compiler. The rule adds a requirement that each LLVM instruction created must either be assigned a source location (almost always copying from one or more existing instructions), or else it must be assigned an annotation that categorises the reason that no source location is appropriate for that instruction.

This rule can be checked using the Debugify tool, allowing us to detect missing source locations without the false positives that Debugify currently detects. The benefit of this is that we can find and fix unintentionally-missing source locations, and that each intentionally-missing source location has that intention explained, such as instructions that have had their locations dropped following the rules for “when to drop an instruction location”, instructions that are compiler-generated, or instructions whose locations we simply cannot or do not know how to represent.

This post is firstly to draw attention to the change to the debug info update rules, and secondly to seek approval/support for adding verification of this rule to LLVM’s CI. Currently we have a script that tests for missing source locations in LLVM, using Debugify and targeting the llvm-test-suite’s CTMark suite. We would like to make this into an LLVM Labs buildbot; we would run this as a nightly bot, as new source location errors appear infrequently, and the Debugify reports can provide a stack trace of LLVM at the point(s) where missing source locations appear, meaning that even with a nightly run it would be quite easy to identify the commit(s) responsible for any new errors so that we can swiftly fix-forward.

In summary, we are looking to add a new form of debug-info-based correctness requirement to the compiler relating to the completeness of source location information. In future, we may add more debug info requirements, for example: a “temporary” annotation conveys that the annotated instruction does not have a location because it is transient and will not appear in the generated object code; this property could be verified at the end of compilation as a form of correctness testing. This all builds towards our long term desire to have more thorough and robust testing of debug info quality in LLVM, to prevent degradation over time and simplify ongoing maintenance.

This is a successor to a previous proposal; noting for the participants in the discussion of that proposal, @dblaikie I think this matches your suggestion of using Debugify for regression-testing, and @nikic this solves the InstCombine issue by moving the verification into an optional check that runs in-between optimization passes, so there’s no issue with deferring the application of a source location to an instruction.

10 Likes

Requiring source locations for all instructions can significantly improve our ability to map hardware sampled profiles (for AutoFDO) and callstack profiles (for MemProf) during profile use. For example, our change in #129960 improved AutoFDO performance on a large internal workload by 0.5% while also improving MemProf coverage by ~5%. This change preserves locations where the merge APIs previously dropped line information.

So we believe addressing the gap where source locations are lost altogether has the potential to improve PGO quality even more. Along the lines of the future requirements that you mention, we are interested in synthetic anchors which profile mapping can use when a source location is unknown (or cannot be known). Speaking on behalf of the compiler opt team at Google, we are supportive of the direction. In the short term, it makes debug info based PGO more reliable and longer term we can build on it to improve coverage.

2 Likes

Can you clarify the plan for enforcing the new rule?

Is the transition plan that 100% coverage this is only guaranteed for a single benchmark that works right now and the contract is that we don’t regress that benchmark?

1 Like

More or less exactly that, yes - currently all of the projects in the llvm-test-suite CTMark suite compile at -O2 -g without any missing locations (following a series of bug fixes detected by running Debugify against that suite), and the goal would be to treat new missing source locations in that configuration as regressions.
In future we could broaden coverage by testing more optimization passes (e.g. -O3) or building other projects (e.g. stage 2 clang), but CTMark is fast to build and covers a reasonable breadth of the compiler, hence its use as a starting point.

What targets would this be enforced for? Do you have an idea how often non-amd64 (or whatever you’re testing) targets are using different code paths (or entire functions / passes) where there are such issues?

What about for CodeGen and beyond? As I understand it this is only about IR-level debug info, not the resulting MIR (or intermediary DAGs in the case of SelectionDAG).

Initially x86, though there’s no reason this couldn’t be tested for other targets as well; I haven’t tried compiling for non-x86 targets yet, so not clear how many new bugs might turn up with other targets. Currently, the testing only covers until CodeGen (i.e. no coverage of ISel or machine passes), but that’s something I’m looking into at the moment - there’s no fundamental limitation, it just requires some extra work.

As an update, we now have a staging buildbot which tests for missing non-annotated debug locations when building CTMark: Buildbot

Currently, the bot is failing due to missing source locations that have accumulated in LLVM since earlier this year; I’ll be working to fix those cases, gradually bringing the bot towards a green build, at which point we will be able to clearly identify any new patches that introduce missing non-annotated source locations. Keeping the bot green is not currently a requirement for new patches, but our intention is to monitor the bot and manually notify patch authors of any such cases as they appear to raise awareness of the new rules and ensure that either we or the original author can provide a fast fix. In the future, if keeping the bot green proves to be simple, then we may open an RFC to move the bot from staging to production i.e. making it a requirement for new patch submissions.

2 Likes

If it helps, for the similar (I think) effort about profile metadata propagation, @boomanaiden154-1 did this which is then ingested this way. This helps us keep the bot green and, as we fix areas, we can avoid regressions. For instance, we fixed all failures in LICM, so now if that turns the bot red, we can meaningfully point the patch author to the regression.

3 Likes

This is amazing progress!

Snehasish made this exact point above back in July, but since then a Google director came up with this pithy phrasing which you should feel free to steal and repeat as a mantra:

Every missing source location is a missed optimization. (*)

(*) This is only actually true if you are using sample PGO, which, I assume most users aren’t. But I know hyperscalers in the LLVM community do (Meta, Google, others).

I think I made this point at EuroLLVM to everyone who I met in person, but sometimes repetition helps land the point.

I wish I could say that, for this exact reason, a bunch of folks are going to help stand up this buildbot, but “under promise and overdeliver” seems like an appropriate adage.

3 Likes

Just a followup, after a few more commits the buildbot has gone green - this should mean that any failed builds from here on represent a bug or missed annotation which is either new, or exposed by a new commit.

1 Like