[RFC][DebugInfo] Proposed changes to the textual IR representation for debug values

tl;dr following from previous discussions about replacing debug intrinsics with non-instruction equivalents, we are seeking input on what this should look like in IR.

Background

We (Sony) are currently working on a prototype that changes the representation of debug values to no longer use instructions. Previous discussions have been raised on Discourse[0][1] and at LLVM conferences[2] on this topic; this RFC discusses the textual IR design for this model. To briefly summarize our goals in removing debug intrinsics, we are aiming to improve performance by reducing the number of instructions, reduce the amount of pointer chasing when searching for non-debug instructions, and reducing the surface area for -g to affect CodeGen. Since removing debug intrinsics involves making a substantial change to the developer-facing output of the compiler, I expect there to be many concerns regarding the technical limitations of the design, and many further concerns about its legibility, ease-of-editing, and other ergonomic issues; the design presented here is not final, and is intended to start a discussion about what this new debug info should look like, establish what different developers/vendors want from a new design, and hopefully reach a general consensus on the next steps. So without further ado:

Design

Here is an example IR function in current LLVM:

define dso_local i32 @f(i32 %a) !dbg !7 {
entry:
  call void @llvm.dbg.value(metadata i32 %a, metadata !10, metadata !DIExpression()), !dbg !20
  %b = alloca i32, !dbg !20, !DIAssignID !30
  call void @llvm.dbg.declare(metadata i32 %b, metadata !11, metadata !DIExpression()), !dbg !21
  %add = add i32 %a, 5, !dbg !21
  call void @llvm.dbg.value(metadata i32 %add, metadata !10, metadata !DIExpression()), !dbg !20
  call void @llvm.dbg.assign(metadata i32 %add, metadata !12, metadata !DIExpression(), metadata !30, metadata ptr %b, metadata !DIExpression()), !dbg !22
  store i32 %add, ptr %b, !dbg !22
  ret i32 %add, !dbg !23

} 

And here is the same function with our proposed change:

define dso_local i32 @f(i32 %a) !dbg !7 {
entry:
    #dbgrecord value { i32 %a, !10, !DIExpression(), !20 }
  %b = alloca i32, !dbg !20, !DIAssignID !30
    #dbgrecord declare { i32 %b, !11, !DIExpression(), !21 }
  %add = add i32 %a, 5, !dbg !21
    #dbgrecord value { i32 %add, !10, !DIExpression(), !20 }
    #dbgrecord assign { i32 %add, !12, !DIExpression(), !30, ptr %b, !DIExpression(), !22 }
  store i32 %add, ptr %b, !dbg !22
  ret i32 %add, !dbg !23

}

In this design, debug variable intrinsics are replaced by debug records, which have a different syntax but identical meaning. Debug records are printed in the position at which they become live, as with current intrinsics, but with extra indentation and a different syntax to clearly distinguish them from instructions: each begins with #dbgrecord, and then a type corresponding to the equivalent debug variable intrinsic, e.g. “value” for “llvm.dbg.value”. The contents of the curly braces are the same as the arguments to a debug intrinsic, except for the absence of the metadata prefix and including as an extra argument the DILocation metadata that would normally be attached to the intrinsic. Besides those differences, nothing else fundamentally changes about how to read, write, or edit debug info from the existing design.

Design Goals

One of the main aims of this design is to make sure that debug records don’t look like instructions, so that the IR faithfully represents the API to minimize surprising behaviours. For example, if a developer is debugging an opt pass and they print the IR, it should be obvious to them that if they have a pointer to %inst1, calling next() on that pointer should yield a pointer to %inst2:

%inst1 = op1 %a, %b
  #dbgrecord value { %inst1, !10, !DIExpression(), !11 }
%inst2 = op2 %inst1, %c

If instead we used a syntax comparable to an instruction, or even just converted debug records back to instructions for IR, this becomes less clear:

%inst1 = op1 %a, %b
llvm.dbg.value(%inst1, !10, !DIExpression(), !11)
%inst2 = op2 %inst1, %c

In summary, debug records are not instructions, and so it should be made apparent even to developers unfamiliar with IR that they are a distinct language element. Aside from that change, this format retains most of the benefits of the current model, in that it is easy to read debug info, to see when a debug record becomes live, to manually edit, move, add, or delete debug values, and as an additional benefit the indentation should make it easier to skim over debug info when it isn’t of interest to the developer or to skim over regular instructions in the opposite case.

There is one disadvantage to this model however, which is that it does not make obvious the API relationship between instructions and debug values: in the instruction API, debug values are a child of the following instruction (i.e. the instruction at which its live range begins); in this representation, debug values only appear between instructions, so the relationship is unclear or actually misleading since the indent would normally imply that the debug value is a child of the instruction above it. Moving debug values outside of the instruction list altogether would resolve this issue, but would complicate human reading and editing of IR too much; good API design and documentation should be enough to clear up any developer confusion.

Next Steps

The prototype implementation for removing debug intrinsics is still in progress, but some patches that move us towards this goal have already landed, and more will be submitted with time. The textual IR representation for non-instruction debug values is likely the most visible developer-facing change and so demands a healthy amount of input before finalizing a design. All suggestions, comments, and critiques are welcome and desired, but two questions in particular that would be useful to know are: Is this design clear and legible to you? And also, does the design have any apparent limitations (whether present in current LLVM or not) that would interfere with the way you use LLVM?

A patch for LLVM to test out this change will be available soon.

[0] Prototyping a not-an-instruction dbg.value
[1] [RFC] Instruction API changes needed to eliminate debug intrinsics from IR
[2] Debug-info round table notes

1 Like

Overall, I think this seems like a good way to represent the new structure in textual IR. :smile:

The use of distinct indentation is a nice way to visually isolate debug info.

As a small optimisation, I think you could go for the shorter #dbg prefix instead of #dbgrecord. The word “record” here feels a bit redundant. You already allow for a second parameter that defines the “type” of each debug thing (currently value, declare, assign), and this should presumably be enough to distinguish future types of debug things we might cook up down the road.

1 Like

Thanks for working on this! I’m basically very supportive. :slight_smile:

I like the indentation as a way to interleave non-instruction data into the instruction stream.

I would prefer #dbg over #dbgrecord. It is close to @llvm.dbg.value#dbg value in the end.

If we feel that #dbg is ambiguous with !dbg source location data on instructions, I’d actually suggest that we consider replacing !dbg with loc(), as a step towards MLIR convergence. Source locations in LLVM are after all referred to with DebugLoc.

By keeping the IR close to the existing rep, I wonder if we will be able to textually migrate (with sed) most of the LLVM test suite.

I considered the issue of trailing dbg.values, but I guess they don’t exist because we don’t allow these intrinsics after terminators.

2 Likes