BoF: Debug info for optimized code.

Debugging optimized code is a topic that generated a lot of
interest at previous dev meetings, but often fell somewhat short
since we've all been very busy with improving the compile-time
performance impact of debug info. This year we'd like to make up
for this by inviting to a dedicated session on just "-O" and "-g"!

We will be giving a quick introduction to summarize the current
state of debug info handling in LLVM and highlight some of the
biggest problem areas. To get a productive discussion going,
here are some points we'd like to invite you to think about ahead
of time:

* What are the major pain points that you or your customers
  experience? What works — what doesn't?

* Do you have out-of-tree patches that improve the optimized
  debugging experience? What are your experiences with it and
  what are the challenges in upstreaming them?

* We would really like to start tracking debug info quality the
  same way we track code size and performance numbers. What are
  useful metrics for your use-case?

See you all at the dev meeting!

Adrian Prantl
Fred Riss

I would love to be able to attend this BoF, but unfortunately I cannot. I would however like to add a couple of things for consideration during this discussion.

Our custom processor is a VLIW processor, and I think that there is a growing trend in CPU design toward VLIW as transistor count increases, but clock speed does not. When we compile at '-O0' we do not use more than one functional-unit at a time (except for the predication unit plus one other because we have to). This means that each "instruction" corresponds to a particular line of code, and an assembly '.loc' directive can be emitted for that instruction which produces a reasonably straight-forward correspondence between the instruction and the source code location.

Optimisation for scalar processors introduces all sorts of fun - code-motion, code-duplication, code-elimination, you name it - but even so, an instruction generally corresponds to some line of code in the source even if it is not obvious to the person debugging their code why the line-tracking jumps around all the time.

But VLIW optimisation introduces a new twist in that each of the functional-units can correspond to different lines of code, and this is really very hard to understand in optimised code when debugging. The more functional units the VLIW architecture supports, the more difficult this becomes. And to complicate this further, predication units can control some functional-units and not others, so a single instruction may contain elements which are conditional and other elements which are not, and of course corresponding to several different lines of code.

I do not even pretend to know much about Dwarf and the representation of debug information, but it does appear that there is little or no support for the idea that a single "instruction" can correspond to multiple diverse lines in the source file. It would be useful for an assembly '.loc' directive to take a list of source locations, and for the debug meta-data to represent this one-to-many mapping ('.debug_{[a]ranges|lines}' etc.). At the moment we are limited to picking one of the several possible locations, and hoping that it is helpful to the programmer when debugging their optimised programs.

Perhaps this is simply my misunderstanding of the degree of meta-data support in Dwarf, but I don't see any obvious way of representing the one-to-many mapping in LLVM for bundles, or the selective predication of the elements in the bundle. This kind of bundling happens very late in the MI level, and by then the DebugLoc abstractions are getting quite difficult. For instance, how do I tell LLVM that some of the instructions in a bundle that contain a predicate are predicated by that predicate, and while instructions are always active? And how does this pass on to the debug meta-data? The Dwarf support in LLVM is (thankfully) largely independent of the target, but it still needs to have support for these abstractions and use-cases.

Another area that is really very painful for programmers, whether VLIW or scalar, is C++ templates. This is an enormously complex area to debug, in particular because the optimised program usually involves extensive use of inlining and multiple tiers of template specialisations. Even reasoning about the specialisation, whether at '-O0' or fully optimised is incredibly hard.

I don't yet have out-of-tree contributions to make, because I have to admit I struggle with this too, but I have some ideas which I would like to promote in the future once I have ironed out the details. But it would be good to have the topic of VLIW and C++ templates added to the hot-topics for the BoF.

Thanks and I am looking forward to reading about the outcome from this BoF,

  MartinO

There is. There is even a patch for LLVM:
https://reviews.llvm.org/D16697

-Krzysztof

Thanks Krzysztof, I hadn't noticed this.

The patch refers to the target providing an 'op_index' register, but this seems like something that can only be handled by an integrated assembler. We use an external assembler and I am curious if there are new directives that we need to support for this? At the moment our assembler is unable to accept '.loc' directives between each operation in a VLIW instruction, is this something that we need to implement to get this level of VLIW debug support?

Thanks,

  MartinO

That would certainly seem to be the conceptually simplest way to go: treat each operation in the VLIW instruction the same as a scalar instruction, and have “.loc” be able to address them.

You’d then need a way to indicate to the debugger whether it is known correct to execute the operations sequentially instead of in parallel (I believe Itanium specifies that they is always true), or if they must be done in parallel to be correct e.g. “r1<-r2; r2<-r1” to swap values (“true” VLIW).

Even when the architecture says the result is the same as if all operations were done in parallel from the same set of input registers, in practice many or most instructions will work correctly if the operations are executed sequentially, and it would be nice for debuggers to be able to do this for single-stepping – and to be able to execute the operations in source program order, in the event they have been mixed up within the instruction (perhaps because each operation position is limited to certain operation types).

So it would be useful to have some kind of “operation bundle” information, where each bundle is some arbitrary subset of the operations in the instruction, which must be executed in parallel.

Note: this is somehow the opposite to another notion of "bundle| in VLIW, where it means “everything in a bundle MAY be executed in parallel, but different bundles MUST be executed sequentially”.

Hi Martin,
Yes, the patch only changes the format of line information. There will be more work needed for fully implementing it across all tools.
Here your concern still stands---more focus on debug information for VLIW architectures would be welcome. I was only pointing out that the necessary capacity of the debug information to carry this data does in fact exist, and that at least one step for getting it into LLVM has been attempted (the patch was reverted shortly after commit).

-Krzysztof

At the BoF session, Reid Kleckner wrote a few notes on the whiteboard
and then I got a photo of it before the next session started up. I've
transcribed those notes here, and expanded on them a bit with my own
thoughts. If anybody else has notes/thoughts, please share them.

Whiteboard notes

From: "Paul via llvm-dev Robinson" <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Sent: Thursday, November 10, 2016 4:07:06 PM
Subject: Re: [llvm-dev] BoF: Debug info for optimized code.

At the BoF session, Reid Kleckner wrote a few notes on the whiteboard
and then I got a photo of it before the next session started up.
I've
transcribed those notes here, and expanded on them a bit with my own
thoughts. If anybody else has notes/thoughts, please share them.

Whiteboard notes
----------------
Variable info metrics
- Induction variable tracking
- Contrast -O0 vs -O2 variables, breakpoint locations
- Track line info for side effects only
  (semantic stepping) "key" instructions

Unpacking that a bit...

Induction variable tracking
---------------------------
Somebody (Hal?) observed that in counted loops (I = 1 to N) the
counter

Yes, it was me. It was pointed out (in conversations after the BoF) that we already have some pass (SROA?) that builds expressions for things; but that's pretty limited. We'll need utilities to build more-general expressions (and maybe some kind of SCEV visitor to build them), and also for full generality, debug intrinsics that take multiple value operands so that we can write DWARF expressions that refer to multiple values (which is currently not possible).

Thanks again,
Hal

From: "Paul via llvm-dev Robinson" <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Sent: Thursday, November 10, 2016 4:07:06 PM
Subject: Re: [llvm-dev] BoF: Debug info for optimized code.

At the BoF session, Reid Kleckner wrote a few notes on the whiteboard
and then I got a photo of it before the next session started up.
I've
transcribed those notes here, and expanded on them a bit with my own
thoughts. If anybody else has notes/thoughts, please share them.

Whiteboard notes
----------------
Variable info metrics
- Induction variable tracking
- Contrast -O0 vs -O2 variables, breakpoint locations
- Track line info for side effects only
(semantic stepping) "key" instructions

Thanks for writing this up, Paul, and thanks everyone who participated in the session! I found it to be very a productive discussion.

Unpacking that a bit...

Induction variable tracking
---------------------------
Somebody (Hal?) observed that in counted loops (I = 1 to N) the
counter

Yes, it was me. It was pointed out (in conversations after the BoF) that we already have some pass (SROA?) that builds expressions for things; but that's pretty limited.

Yes that was SROA. There is also a patch lying around in review limbo that does a similar thing for the type legalizer.

We'll need utilities to build more-general expressions (and maybe some kind of SCEV visitor to build them), and also for full generality, debug intrinsics that take multiple value operands so that we can write DWARF expressions that refer to multiple values (which is currently not possible).

To expand on this, the problem is that we cannot refer to IR from metadata, so in order to support this we could, for example, extend llvm.dbg.value() to accept multiple operands:

; Straw man syntax for calculating *(ptr+ofs).
; The first argument is pushed implicitly, for the subsequent ones we'll need a placeholder.
call @llvm.dbg.value(metadata i64* %ptr, metadata i64 %ofs, i64 0,
                     !DIExpression(DW_OP_LLVM_push_arg, DW_OP_plus, DW_OP_deref))

or something like that.

Thanks again,
Hal

often gets transformed into something else more useful (e.g. an
offset
instead of an index). DWARF is powerful enough to express how to
recover
the original counter value, if only the induction transformation had
a way
to describe what it did (or more precisely, how to recover the
original
value after what it did).

Contrast -O0 vs -O2 variables, breakpoint locations
---------------------------------------------------
This came up during a discussion on debug-info-quality
testing/metrics.
One metric for quality of debug info of optimized code is to compare
what
is "available" at -O0 to what what is "available" at -O2. This can
be
applied to both kinds of debug info affected by optimizations:
whether a
variable is available (has a defined location) and whether a
breakpoint
is available (the line has a defined "is-a-statement" address).

If you look at the set of instructions where a variable has a valid
location, how does that set compare to the set of instructions for
the
lexical scope that contains the variable? If you look at the sets of
breakpoint locations described by the line table, how does the set
for
-O2 compare to the set for -O0?

It's not hard to imagine tooling that would permit comparisons of
this
kind, and some people have had tooling like that in previous jobs.

Track line info for side effects only
(aka semantic stepping or "key" instructions)
---------------------------------------------
This idea is based on two observations:
(1) Optimization tends to shuffle instructions around, so that you
end
   up with instructions "from" a given source line being mixed in
   with
   instructions "from" other source lines. If we very precisely
   track
   the source line for every instruction, then single-stepping
   through
   "the source" in a debugger becomes very back-and-forth and
   choosing
   a good place to set a breakpoint on "the line" becomes a dicey
   proposition.
(2) If you look at the set of instructions generated for a given
line,
   it's easy to conclude that "some are more equal than others."
    This
   means for something like a simple assignment, the load is kind of
   important, the ZEXT not so much, and the store is really the
   thing.
So, picking and choosing which instructions to mark as good stopping
places could well improve the user-experience without significantly
interfering with the user's ability to see what their program is
doing.

[Okay, I'm really going beyond what we said in the BoF, but I think
it's
a worthwhile point to expand upon.]

Let's unpack an assignment from an 'unsigned short' to an 'unsigned
long'
as an example. This basically turns into a load/ZEXT/store sequence.

If you have an optimization that hoists the load+ZEXT above an 'if'
or
loop-top, but leaves the store down inside the 'then' part or loop
body,
is it really important to tag the load+ZEXT with the original source
line? If you want to stop on "the line," doing it just before the
store
is really the critical thing.

That is, the store is the "key" or "semantically significant"
instruction
here, and the load/ZEXT are not so important. You can have a smooth,
user-friendly debugging experience if you mark the store as a good
stopping point for that statement, and don't mark the load/ZEXT that
way
(even though, pedantically, the load/ZEXT are also "from" the same
source
statement).

Now, how far you take this idea and in what circumstances is arguable
because it very quickly is in the arena of human-factors quality, and
people may differ in their preferences for "precise" versus "smooth"
single-stepping or breakpoint-location experience. But these things
definitely have an effect on the experience and we have to be willing
to trade off one for the other in some cases.

Thanks,
--paulr

One topic that also came up in the discussion (after the?) session was the interest in making -O1 only enable optimizations that are known not to have an adverse effect on debuggability or even introducing a dedicated -Og mode like GCC has.

-- adrian