Call for testing -- new variable location tracking solution

Hi llvm-dev@,

tl;dr, If you build a large optimised x86 codebase with clang and want
better variable locations when debugging, please consider testing the
"-Xclang -fexperimental-debug-variable-locations" command line switch
for clang.

I've been putting together a new mechanism for tracking variable
locations after instruction selection [0-2], and it's now reaching
maturity. From r8612417e5a5 it's able to build stage2 clang, various
benchmarks and a popular game engine, delivering improved variable
location coverage as described in [2]. However, I fear the
compile-time performance characteristics are going to be substantially
different. I've tried my best to keep things fast, but when there's
more data being produced there'll inevitably be some kind of slowdown,
and it's hard to determine which workloads might be affected. Thus: if
you'd like to lend a hand, please consider running a build with this
flag, and see whether there's a disproportionate compile-time
performance drop. CTMark times show a 1% to 5% performance cost right
now [3], higher for -O0 [4] simply because I haven't focused on -O0
(yet).

There are a few more coverage-improving patches (such as D104519) that
are yet to land / be published, but what's in-tree already gives good
improvements versus DBG_VALUE-based tracking. Right now only x86 works
really well -- I've made a start on aarch64 to ease adoption [5], but
it's not all there yet.

The overall mechanism involves annotating instructions with debugging
information instead of attaching it to registers. As a consequence,
additional book-keeping is needed in target-specific optimisations.
Adding that support is probably more a marathon than a sprint;
documenting exactly what needs to be done is in the "TODO" column too.

We could consider turning this feature on by default for major targets
sometime around llvm-14; I don't have a plan for that yet, but
previous feedback has been positive and the coverage improvements are
encouraging.

[0] [llvm-dev] [RFC] DebugInfo: A different way of specifying variable locations post-isel
[1] [llvm-dev] [RFC] A value-tracking LiveDebugValues implementation
[2] [llvm-dev] [DebugInfo] A value-tracking variable location update
[3] LLVM Compile-Time Tracker
[4] LLVM Compile-Time Tracker
[5] See stack at ⚙ D104519 [DebugInfo][InstrRef] Track subregisters in stack slots

Hi Jeremy,

Thanks for doing this!

I was playing a bit with this and I have used gdb-7.11 as a testbed (compiled for X86_64, with -g -O2 by using llvm-trunk on RHEL7).
(NOTE: the same code base was used for building both gdb-built-with-llvm-trunk and gdb-built-with-new-varloc-enabled.)

The final locstats numbers look promising:

$ llvm-locstats gdb-built-with-llvm-trunk

Hi Djordje,

It's great to see that improvement in coverage, particularly when not
all the improvements are landed yet. Note that some substantial
portion of the new locations might be entry values, knowing the
desired values of variables lets us better identify when entry values
can be used,

The thing I am concerned about is the number of 0% covered variables. It has increased when using this new feature, and I was wondering if there is any reason for that. In addition, the number of debug variables generated is different.

Several things could be going on here, my bet would be that it's
related to D95617 -- unfortunately the different ways that we
represent variables that have been optimised out can have an effect on
the DWARF produced. David points out in this comment [0] that the fix
isn't complete (non-inline functions with abstract origins can
gain/lose variables depending on our internal representation), I took
a look but fixing it wasn't straight forwards (I think subprograms can
acquire abstract origins after they're dwarf-emitted?).

That could explain why there are more zero-percent variables in the
output (different representation leads to different variable count);
however the total number of variables _decreases_, which I wouldn't
expect. If you can get a small reproducer for that I'd really like to
look at it.

[0] ⚙ D95617 [DWARF] Inlined variables with no location should not have a DW_TAG_variable

Hi Jeremy,

That could explain why there are more zero-percent variables in the
output (different representation leads to different variable count);
however the total number of variables decreases, which I wouldn’t
expect. If you can get a small reproducer for that I’d really like to
look at it.

I’ll try to experiment with this next week. Also, the number of call_site debugging information is different, so it may indicate that the number of call instrs is different.

Best,
Djordje

(I think subprograms can
acquire abstract origins after they're dwarf-emitted?).

I'd hope not... The out-of-line instance and all inlined instances
should be pointing to the same abstract origin. If the out-of-line
instance gets emitted first, *maybe* what you suggest can happen,
because we haven't yet noticed there are inline instances? But
that's not how the DWARF is supposed to look.
--paulr

(I think subprograms can
acquire abstract origins after they’re dwarf-emitted?).

I’d hope not… The out-of-line instance and all inlined instances
should be pointing to the same abstract origin. If the out-of-line
instance gets emitted first, maybe what you suggest can happen,
because we haven’t yet noticed there are inline instances?

Yeah, if we’re talking order of construction, that can certainly happen - when the concrete subprogram is created we don’t know if it’ll need an abstract definition or not (because we don’t know if there are any inlined instances yet) - so we delay filling out the attributes that would go on the abstract definition, wait until the end and then either fill them out on the concrete subprogram, or, if we’ve encountered/created an abstract definition by that point, emit the DW_AT_abstract_origin on the concrete definition instead.

Hi!

Just wanted to say I’ve tried building Chrome with this (with -g2 -O2) on Linux and didn’t see a noticeable difference in compile time.
Unfortunately running llvm-locstats fails on the chrome binary, so no coverage stats.

-Amy

Hi Amy,

Just wanted to say I've tried building Chrome with this (with -g2 -O2) on Linux and didn't see a noticeable difference in compile time.

Excellent, that's really reassuring given how large chrome is, thanks
for trying a build with the flag,

Unfortunately running llvm-locstats fails on the chrome binary, so no coverage stats.

That's a shame; however I'm confident there won't be a coverage
regression, detecting any performance cliff edges was my main aim
here.

(Sorry for the divergence)

Hi Amy,

Unfortunately running llvm-locstats fails on the chrome binary, so no coverage stats.

Can you please file a bug against this, so we can take a look? I am wondering if the problem is on the llvm-locstats or on the llvm-dwarfdump --statistics side.

Best,
Djordje

Just FYI.

I ran some tests using this flag on a large set of tests inside Google and saw only very slight degradations (~1%) in memory usage and wall clock time.

Focusing on optimized builds (-O3) with debug information on a representative set of medium-to-large programs,

The ratio sum_all_params(#bytes in parent scope covered by DW_AT_location)/#bytes in parent scope) improved (increased) by 6-8%.

The ratio of sum_all_variables(#bytes in parent scope covered by DW_AT_location)/#bytes in parent scope mostly improved (increased) 3-4%. One extremely large program had a 22% improvement. However another very large program had a 30% regression.

The ratio of local_vars_bytes_with_locations:sum_all_local_vars(#bytes in parent scope covered by DW_AT_location)/#bytes in parent scope also showed regressions in those two large programs.

The ratio of sum_all_params(#bytes in parent scope covered by DW_OP_entry_value)/#bytes in parent scope showed 2-4% improvements across all the test programs.

Unsurprisingly, the size of debug_loclists increased 5-11%.

Hi Caroline,

I ran some tests using this flag on a large set of tests inside Google and saw only very slight degradations (~1%) in memory usage and wall clock time.

That's great to hear, and highly encouraging that there's no
significant increase in resource usage, thanks for testing out the
flag,

The ratio sum_all_params(#bytes in parent scope covered by DW_AT_location)/#bytes in parent scope) improved (increased) by 6-8%.

The ratio of sum_all_variables(#bytes in parent scope covered by DW_AT_location)/#bytes in parent scope *mostly* improved (increased) 3-4%. One extremely large program had a 22% improvement. However another very large program had a 30% regression.

The ratio of local_vars_bytes_with_locations:sum_all_local_vars(#bytes in parent scope covered by DW_AT_location)/#bytes in parent scope also showed regressions in those two large programs.

The ratio of sum_all_params(#bytes in parent scope covered by DW_OP_entry_value)/#bytes in parent scope showed 2-4% improvements across all the test programs.

These improvements are encouraging too; the magnitude of the outliners
(22% up, 30% down) are surprising, especially on large codebases. I'm
aware of a few issues that lose coverage versus the current
location-based tracking, storing subregisters to the stack and when
register spills are fused into other instructions. (I have patches,
although they're yet to land). It's possible that those will rectify
some of the coverage regression.

I don't have a good explanation for the 22% increase; it does sound
too good to be true. If you have any spare cycles, looking at the
coverage gains and judging whether they're false locations or true
gains would be appreciated.