This is a proposal to establish a new print command whose primary purpose is to choose how to print a result, using one of the existing commands. This can be thought of as a print abstraction on top of the multiple existing “print” implementations.
LLDB has multiple commands to print the process’s data, including:
expression
(aliasesp
andpo
)frame variable
(aliasv
)register read
memory read
(aliasx
)
The majority of users use p
/po
, because it’s widely known and in many cases p
works as an alternative to the other print commands. Although its primary purpose is to evaluate source language statements, it can also be used as a stand in for other commands that print. For example:
p someVar
instead ofv someVar
p $reg
instead ofregister read reg
parray 4 (uint32_t*)someVar
instead ofx/4wu someVar
The last one is a contrived example, the point is that people are more familiar with the source language than the debugger, and will use it because of that familiarity.
In fact, expression evaluation can go even further. In ObjC, po 0x12345670
can be used to print an object by its address, providing what is essentially an LLDB specific form of dynamic typing. There are a number of features within and around expression evaluation.
However there are some downsides to using p
as a universal print command. Some issues with expression evaluation are:
- the implementation is large and complex, and as a result it has more failure points, and it can be slow
- there can be unwanted side effects
- the source language syntax and semantics can impose limits/burdens on data inspection
To the first point, consider lldb-eval, which supports a syntax that falls between frame variable
and expression
to provide a middle ground between the performance and reliability between those two commands.
This proposal assumes readers agree that expression evaluation can be fragile, slower, or both. The document focuses on the other issues, as they represent divergent behavior between existing expression
and the other printing commands (specifically frame variable
).
For performance, reliability, and functionality, users have been advised to print variables with v
rather than with p
.
Advising users isn’t always enough. Not all users have heard this advice. Many users don’t know that v
and other print commands exist. Once they do know, they also need to know when to use the other commands and when not to. Many users don’t have this knowledge. In fact, many users don’t want to have to think about these debugger-centric details. Some users don’t want to learn distinctions that the debugger cares about – distinctions they may find unimportant.
Do What I Mean (DWIM)
From DWIM on Wikipedia:
attempt to anticipate what users intend to do, correcting trivial errors automatically rather than blindly executing users’ explicit but potentially incorrect input.
Some people, after learning about v
, will ask “Can we have a command that chooses a preferred method to print?” Many users see the p
command is an abstraction for printing, but to lldb it’s not that, it’s a command which performs in-process expression evaluation – which prints data as one of its effects.
Users don’t always care about the means, just the end result. If we view the existing commands as building blocks, as different implementation of a “print interface”, then we could imagine a DWIM print command that, based on its input and context, determines which printing implementation to use.
For the rest of the document, imagine a new DWIM print command whose job is to print data, without being tied to a specific printing implementation. For one invocation it may use expression
, and for another invocation at might use a different command, such as frame variable
.
Simple Cases
The simplest example is printing a local variable, which one can do by running either p localVar
or v localVar
. There’s almost no syntax here, only a token. The printed output should be identical in all cases. To implement print localVar
, the logic can be illustrated with this Python function:
def dwim_print(frame: lldb.SBFrame, user_input: str) -> None:
value = frame.FindVariable(user_input)
if not value:
value = frame.EvaluateExpression(user_input)
print(value)
FindVariable
first looks for the variable (from debug info). If the variable doesn’t exist, then expression evaluation is used instead.
This base logic can easily be extended to support registers:
def dwim_print(frame: lldb.SBFrame, user_input: str) -> None:
value = frame.FindVariable(user_input)
if not value:
value = frame.FindRegister(user_input)
if not value:
value = frame.EvaluateExpression(user_input)
print(value)
The command print pc
would print the value of $pc
. With this implementation, if a variable named pc
exists, that would be used instead. An alternative implementation could print both values, in the rare (unless you work on a debugger) case where both a variable and register exists by the same name.
If this was all there was to the story, then it would be straightforward to add a DWIM print command that supports variables, registers, and expressions. However the complexity goes deeper and there are more cases to consider.
Effects of expression
Even with the simple case of evaluating an expression consisting of an only a plain variable, there are side effects to be aware of. The p
command creates “persistent results” – variables named $0
, $1
, $2
, etc. These variables are snapshots of the expression’s result, and effectively have global scope. Persistent results are always created as part of expression evaluation, there is no way to opt out. When using p
, lldb prints the persistent variable name, but with po
, the result variable name is not included in the output. To demonstrate:
(lldb) p obj
(NSObject *) $0 = 0x0000600000008000
(lldb) po obj
<NSObject: 0x600000008000>
(lldb) p obj
(NSObject *) $2 = 0x0000600000008000
(lldb) p $1
(NSObject *) $0 = 0x0000600000008000
Even though the po obj
command didn’t mention $1
, the persistent result was in fact created. Users who primarily use po
may not even realize the persistent results exist.
We can see now the first way in which p obj
and v obj
differ. If a DWIM print invocation chooses to use v
, there will be no persistent result. This is an issue if the user wants to make use of the persistent result. In my experience, most users do not use persistent result variables. (This is especially true for users who predominantly use po
and don’t even see the persistent variable name). Additionally, a user printing a variable generally don’t need a second variable for it.
Some use cases of persistent variables are:
- Referring to data after its variable goes out of scope
- As persistent results are a snapshot, they can be used to compare data temporally, between a current value and a previously persisted value
- Composing
p
commands by passing the persistent result of one expression, as an argument into another expression
Considering all of the above, it seems reasonable for a DWIM print command to vary in whether or not it creates a persistent result. Advanced use cases that require persistent results can still use p
/expression
directly.
Persistent Results: Memory Tradeoffs
Making persistent results opt-in has an additional benefit: it avoids challenges around memory references. Today, persistent values can result in the user dealing with one or more of: unsafe memory, violated semantics, or unexpected retains/ownership.
In C, a pointers captured by a persistent result are inherently unsafe. When using those variables, the users won’t know if such pointers are still valid.
In ObjC, pointers within persistent variables could be retained (using ARC) to enforce validity. But LLDB does not do that. Remember that persistent results are global, and if these variables were retained, the memory would never be freed. This could be an inconsequential memory leak, a large leak, or it could affect program behavior, which can even induce new bugs.
In C++, has the same issues with raw pointers as C does. But, C++ has smart pointers. If you were to guess, what are semantics of a persistent result whose type is std::shared_ptr<T>
? Does the persistent result variable retain the pointer? If you guess yes, then you may be surprised that it does not. The persistent result is a raw data snapshot, and the smart pointer semantics are not adhered to. The alternative would be to preserve the semantics and retain the pointer, but as with ObjC, this could be a simple memory leak, but it could have worse side effects or introduce bugs.
In Swift, the language’s semantics are preserved. A persistent result variable is defined using let
, and this results in pointers being retained. In Swift, po obj
can cause memory leaks, or possibly worse.
No matter which choice is made, there are downsides. By making persistent results opt-in, this entire issue can be avoided. If persistent results had non-global scope, such as function/frame scope, this wouldn’t be as much of an issue, but changing scope would limit some of the use cases of persistent variables.
Now that persistent results have been discussed, let’s switch to another difference between p
and v
.
Syntax and Semantics
Following local variables, the next common example to consider is printing fields or member variables. Using C++ as an example, the comparison of p memberVar
and v memberVar
matches a lot of what has been said above about local variables, and about persistent results. Except what these commands are really doing is p this->memberVar
and v this->memberVar
. This introduces some syntax, the arrow operator ->
. To support this, the pseudocode could be changed to:
def dwim_print(frame: lldb.SBFrame, user_input: str) -> None:
value = frame.GetValueForVariablePath(user_input)
if not value:
value = frame.FindRegister(user_input)
if not value:
value = frame.EvaluateExpression(user_input)
print(value)
The change is to use GetValueForVariablePath
instead of FindVariable
. This introduces the concept of variable path expressions. Variable path expressions have the following operators:
- member of pointer (
->
) - member of object (
.
) - pointer dereference (
*
) - array (and pointer) subscript (
[]
) - address-of (
&
)
These operators allow expressions to perform some of the most common data traversal operations, and conveniently, but not necessarily by design, they happen to be mostly a subset of such operations in C/C++/ObjC. However, it’s not a strict subset, of syntax or semantics, which raises some issues.
A DWIM print command would receive at least two kinds of syntax, the full syntax of the source language, and the limited syntax of variable paths. Two syntaxes complicate matters. At this juncture, some questions arise:
- For these operators, are the semantics the same between variable paths and expressions?
- What is the future of for variable path expressions? Will it evolve to add more syntax/features?
- How do variable paths integrate with other language syntaxes? Should each language be able to provide its own variable path syntax?
Of these, the biggest topic to discuss is semantics. A DWIM print command must be aware of semantic discrepancies when choosing how to evaluate a given expression.
Semantics
This is a multipart discussion.
- Operator Overloading
In C++, with the exception of .
, the operators used in variable paths can be overloaded. What this means is p a->b
(for example) could run arbitrary code, while v a->b
would perform direct data access. If it’s known there’s no b
field, then the only possible option is expression evaluation. If it’s known there’s no operator->
overload available for the type of a
, then expression evaluation isn’t needed. A DWIM print command could use variables when it can determine that it’s safe to do so, and expression evaluation in other cases. This means LLDB needs to analyze the expression, in order to decide whether a variable path evaluation can be used. But before discussing analysis, let’s segue into another aspect of semantics.
- Synthetic Children
For display purposes, LLDB allows data formatters to define synthetic children for data types. This is a crucial feature for debugging, allowing LLDB to support data abstraction, and not burden the programmer with the raw implementation details of every type. Since LLDB shows the user a synthetic structure, it would be weird to not allow the user reference that structure. As a result, variable path expressions support synthetic children. This is in contrast with the source language, where the debugger’s synthetic view does not exist. A DWIM print command can support synthetic children, but there are cases where there’s a conflict in semantics. The first is the ->
arrow operator, which synthetic children can also override. The second conflict can happen with the []
array subscript, which for variable paths uses only numeric indexes. Consider std::map<int, T> aMap
, the expression aMap[1]
means different things when treated as a variable path vs as a source language expression. As a variable path, it refers to the second child (whatever that may be), and in C++ it means the value for the key 1
. A DWIM print command has to have logic to handle these and any other edge cases.
- Dynamic Typing
Variable paths have an invaluable feature over expressions, they operate on the dynamic types of objects. That is, v a->b
will use the dynamic type of a
to find the b
member. This works even if the static type of a
has no b
field. The dynamic type could be a subclass, or an implementation of a protocol. In essence, variable paths perform automatic/implicit downcasting. Since a debugger’s purpose is to inspect a running program, full visibility of types and data at runtime can be indispensable. However this dynamic behavior diverges from the source language, where static types (naturally) dictate data access.
Ideally, a DWIM print command would behave consistently. If a DWIM print sometimes uses dynamic typing, and sometimes does not, then users could conclude the command is buggy and not use it.
What’s needed is an evaluation mode that supports dynamic typing even when falling back to expression evaluation.
As an aside, this mode of evaluation would solve an all too common issue that arises when users are advised to use v
instead of p
: handling of properties. In ObjC and Swift (as well as D, C#, numerous scripting languages and yes even C++) a property is a field that syntactically looks like a plain data member, but is backed by getter/setter functions. Currently with ObjC and Swift, the following code forces lldb users to use p
:
// ObjC
@interface MyThing : NSObject
@property (nonatomic) int number;
@end
@interface MyClass @end
// Swift
class MyThing {
var number: Int {
return /* some computed */
}
}
In both cases, the following command will fail:
(lldb) v thing.number
This failure is despite the expression being a valid expression. This is a cognitive burden on the user, requiring them to know whether the particular property they want to inspect needs p
or whether v
can be used.
Expression Evaluation with Dynamic Types
This section provides an rough answer to the question: How can LLDB support dynamic typing in expression evaluation?
Let’s start with this common command:
(lldb) p object
While printing the result of the expression, LLDB attempts to determine the dynamic (i.e. runtime) type of object
. For example the exact subclass. If this succeeds, LLDB is able to print more data about the object – the fields of the dynamic type. This is a valuable debugging feature – during a debugging session you want all the state and execution information available, to help understand bugs. Using only the static type information is a limitation.
The use of dynamic typing is available for any expression result, not just for variables as the above shows. Dynamic typing works here too:
(lldb) p func()
The dynamic typing occurs on the expression result, not any of the input’s subexpressions. In other words dynamic typing is done after a result has been returned, not before or in the middle of an expression.
However, once LLDB has shown the user that it knows the dynamic type of a variable, the user might reasonably expect to be able to perform operations that depend on that type, such as:
(lldb) p object.subclassOperation()
In this example, the function subclassOperation
represents a function declared on the dynamic type, not on the static type. Expression evaluation will fail on this input, as the compiler doesn’t have the information the debugger has, the concrete type of object
. The compiler only has its static type. Users can work around this by changing their expression to include casts, for example:
(lldb) p ((Subclass&)object).subclassOperation()
However, this is clumsy and having LLDB seeming alternate between aware of dynamic types and unaware, is not a user-friendly workflow.
In addition to expression evaluation, LLDB has another kind of expression evaluation: frame variable
. These expressions can determine the types of variables, and their members (children). This is implemented using memory reads and type metadata, which are operations that are faster and more resilient for accessing data, compared to using full expression evaluation, which is slower and more fragile.
Note: For the purpose of demonstration, assume we have two types: a base class which has a smaller interface, and a subclass which has additional member data and/or functions:
class Base {
public:
virtual ~Base();
int baseData;
void baseFunc();
};
class Sub : public Base {
public:
int subData;
void subFunc();
};
Let’s do a quick comparison between frame variable
and expression
. Assume there exists a variable named object
, declared as Base &
, but whose runtime type is Sub &
.
(lldb) p object.subData
(lldb) v object.subData
- The
p
command will fail, as there exists no fieldsubData
onBase
- The
v
command will succeed –object
is determined to be an instance ofSub
, and itssubData
field is printed
This behavior is limited to expressions that frame variable
supports, namely direct data access, from variables down into arbitrarily nested data members, and does not include function calls. Thus, neither of the following will work:
(lldb) p object.subFunc()
(lldb) v object.subFunc()
- The
p
command fails becausesubFunc
does not exist on theBase
type - The
v
command fails because its limited syntax doesn’t support function calls
Proposed Implementation
For consistency and for an improved user experience, LLDB could provide a high level expression evaluation that is a hybrid of frame variable
and expression
. This high level evaluation combines the dynamic typing of frame variable
with the source language support of expression evaluation, to allow users work seamlessly with runtime data.
To implement this, LLDB could automatically rewrite expressions to leverage valid frame variable
subexpressions within them. There are two ways to achieve this:
- Rewrite expressions using casts
- Materialize persistent results and use those
In both cases, LLDB will need to perform an initial pass over the expression, to identify and evaluate valid frame variable
subexpressions.
To identify which parts of an expression are valid frame variable
subexpressions, LLDB will use the parsers of its embedded compilers (Clang, etc). Let’s start with the first expression we looked at:
(lldb) p object
The Clang AST for this expression is simple.
DeclRefExpr <line:1:1> 'Base':'Base' lvalue Var 'object' 'Base &'
This first expression demonstrates that variable access begins with a DeclRefExpr
node.
Note: The AST dumps shown in this document have been reduced for readability. For example runtime memory addresses, file paths, and implicit cast nodes have been removed.
Next up consider an expression that accesses a data member on object:
(lldb) p object.baseData
For this expression, the Clang AST is:
MemberExpr <line:1:1, col:8> 'int' lvalue .baseData
`-DeclRefExpr <col:1> 'Base':'Base' lvalue Var 'object' 'Base &'
A new AST node is introduced, MemberExpr
. This node is used for both .
and ->
member access. This AST node is used for each step of data access – an expression like a.b.c
will be represented in AST form as:
MemberExpr ...
`- MemberExpr ...
`- DeclRefExpr ...
In other words, frame variable
expressions using .
and ->
will be represented as a leaf DeclRefExpr
node, with zero or more MemberExpr
upward nodes, one for each member access.
Another way of representing this is: DeclRefExpr > MemberExpr
However this considers only the static case, which isn’t interesting as it does not require special handling. Let’s next look at expressions that require dynamic typing to succeed. Let’s see what the AST looks like for this expression:
(lldb) p object.subData
The corresponding AST is:
RecoveryExpr <line:1:1, col:8> '<dependent type>' contains-errors lvalue
`-DeclRefExpr <col:1> 'Base':'Base' lvalue Var 'object' 'Base &'
This shows that Clang’s AST is not limited to syntactic information, but semantic information as well. Here we see that accessing subData
from a base class instance results in an AST containing a RecoveryExpr
node.
Let’s compare the the difference between a statically valid expression (object.baseData
) and a dynamically valid (but statically invalid) expression (object.subData
).
Both contain the same DeclRefExpr
leaf node, but the parent node differs in type: MemberExpr
vs RecoveryExpr
. While the node type differs, the source location information is identical. In the dynamic case, the RecoveryExpr
node identifies the point, the immediate predecessor, at which there’s a dynamic type issue to resolve. The predecessor of the RecoveryExpr
is the DeclRefExpr
, and that is the subexpression for which we need to determine the dynamic type. In this case, the subexpression is “object
”.
This structure is not limited to one level of nesting. Imagine our object
is nested in some outer object (“outer
”), then the expression outer.object.subData
will have an AST that looks like:
`-RecoveryExpr <line:20:1, col:14> '<dependent type>' contains-errors lvalue
`-MemberExpr <col:1, col:7> 'Base' lvalue .object
`-DeclRefExpr <col:1> 'Outer' lvalue Var 'outer' 'Outer'
In this case, the predecessor to RecoveryExpr
is the sequence of DeclRefExpr
and MemberExpr
that corresponds to outer.object
. The dynamic type of outer
is not needed, while the dynamic type of object
is.
With this information, we can pinpoint where LLDB needs to provide dynamic type information. The RecoveryExpr
nodes represent seams that stitch together frame variable
expressions and source language expressions. With this knowledge, we can construct a high level AST where the nodes represent coarse grained subexpressions of either type. The subexpressions nodes will be processed by either frame variable
or expression
. For illustration, the high level AST for object.subFunc()
would be:
ExpressionNode '.subFunc()'
`- FrameVariableNode 'object'
As mentioned previously, to glue the two types of expressions, LLDB can rewrite subexpressions with downcasts introduced, or by replacing subexpressions with persistent result variables. Evaluation of this high level AST can be implemented by evaluating nodes bottom up, following data dependency order.
Dynamic Typing On-Demand
The expressions shown thus far have been cases where the dynamic type is needed, otherwise evaluation will fail. Applying dynamic typing on demand has obvious and non-obvious benefits:
- There’s no reason to determine dynamic typing where it isn’t actually needed
- Using dynamic typing can create confusing and unwanted changes to language semantics
The first should go without saying. The second is an interesting point to discuss. Consider this command:
(lldb) p f(object)
In languages that have function overloading, there can be more than one viable function f
to call, for example there might be both f(Base &)
and f(Sub &)
. If dynamic typing is unconditionally applied to object
(ie even when it’s not required), then our expression evaluator could unintentionally introduce multiple dispatch (aka multimethods) to the target language. Changing semantics of this kind is not a goal, and would create confusion to users. The proposed algorithm does not allow multiple dispatch to happen, since dynamic typing is performed only where static typing is insufficient. Users who want to control function selection can use explicit casts in their expression.
Dynamic typing can be limited to on-demand by applying only to AST subtrees that have a path matching this pattern:
DeclRefExpr > MemberExpr* > RecoveryExpr
Iterative Expression Evaluation
Thus far we’ve considered cases where dynamic typing is applied to data access chains. Next, we’ll want to consider dynamic typing the result of expression evaluation. Again, starting from the basic case, consider this print command:
(lldb) p f()
If the result type of f()
is Base &
(and thus can be dynamic) then LLDB will determine the dynamic type. The user could reasonably expect to make use of that dynamic type and run:
(lldb) p f().subFunc()
The member function subFunc()
may depend on the dynamic type of f()
. As we’ve seen, this can determined by parsing the expression where once again a RecoveryExpr
node will indicate that dynamic typing is necessary.
When dynamic typing is necessary, multiple expression evaluations will be required. First the subexpression f()
is evaluated, and then, the remaining subexpression, .subFunc()
, can be evaluated. There are two ways to construct the second expression:
- Using persistent results, ex:
$0.subFunc()
- Adding casts the the original expression, ex:
static_cast<Sub&>(f()).subFunc()
While the second is possible, it has potential issues. Repeated calls to f()
could have the following problems:
- Side effects
- Slower perceived performance of LLDB
- No guarantee the subexpression is deterministic
- The return value could differ the second evaluation
- The return value and return type could differ in the second evaluation, resulting in an invalid cast
For these reasons, it seems that persistent results are the preferred way to implement this.
With either method, for this two part example, the sequence of steps will be:
- Evaluate the first expression,
f()
- Determine the dynamic type of the expression’s result
- Evaluate
<rewritten>.subFunc()
Where <rewritten>
is one of the two substitution options, likely to be $N
.
This example shows two expression evaluations, but an input expression could require more. The evaluation process can be reduced to a value done in a loop, until the expression is fully consumed.
Let’s look at the AST to see how we can identify dynamic subexpressions indicate whether the expression is statically valid or not. Here are the ASTs for the static (first) and dynamic (second) cases:
`-CXXMemberCallExpr <line:1:1, col:13> 'void'
`-MemberExpr <col:1, col:5> '<bound member function type>' .subFunc
`-CallExpr <col:1, col:3> 'Sub':'Sub' lvalue
`-DeclRefExpr <col:1> 'Sub &()' lvalue Function 'f' 'Sub &()'
`-CallExpr <line:1:1, col:13> '<dependent type>' contains-errors
`-RecoveryExpr <col:1, col:5> '<dependent type>' contains-errors lvalue
`-CallExpr <col:1, col:3> 'Base':'Base' lvalue
`-DeclRefExpr <col:1> 'Base &()' lvalue Function 'f' 'Base &()'
In the first case, the expression can be evaluated as-is. In the second case, expression evaluation will need to evaluate each path from a leaf node to a recovery node. In this case, the [leaf,recovery) path has a range of 1 to 3, and corresponds to f()
. The remaining part of the expression, .subFunc()
will be evaluated in a second expression. The second expression will depend on the persistent result of the first expression. The second expression could be invalid, but that can only be determined after the first expression.
Non-goals of Dynamic Typing
Thus far we’ve looked at single expressions. What about complex expressions, those with multiple statements or with control flow? Should a DWIM print command provide dynamic typing for such expressions? Let’s look at one:
(lldb) p for (const Base &x : vec) printf("%d", x.subData);
What does it mean to print a for
-loop? Or an if
-statement? In addition to these hard to answer semantic questions, to implement dynamic typing within control flow statements would require more complicated evaluation logic. It’s a slippery slope towards evaluating more and more C/C++. For example, in this for
-loop, the evaluator would have to perform the iteration itself, in order to dynamically type the loop variable x
.
This new DWIM print command would provide dynamic typing only for single expressions that can be evaluated in a linear unconditional order. These are the expressions where dynamic typing is desirable. Expression that contain control flow, multiple statements, and closures (to name a few), will not support dynamic typing. By limiting the scope of where dynamic types are employed, the mental model should be reasonable for users to understand, hopefully fairly intuitive.
Miscellaneous
Variable Path Syntax
While variable path syntax is a subset of C/C++/ObjC, that is not true for other languages. Swift, for example, has the .
operator but not the ->
operator. It has other relevant operators, such as ?
and !
. For this reason, it might be good to allow language plugins to define their own variable path syntax, to ensure an ideal amount of overlap exists between source language and variable paths.
Transparency
To prevent any misunderstanding between how the user expects evaluation to be done, and how it is done, the DWIM print command could optionally print a command the represents the most direct “low level” way to print the value. For example, if the DWIM print sees that dynamic typing required for the expression to work. Such as:
(lldb) v a->b
note: equivalent command: p ((Subclass *)a)->b
This addresses the case where users want to paste an expression they’ve used into the source and expect it to compile. This won’t always be possible, for example if the expression references registers or persistent results, but that’s also the case today with p
. These can be thought of as an lldb form of fix-its.
Another benefit to printing equivalent low level commands is educating by showing them other commands, commands they may want to explore and use themselves.
Conclusion
The primary goal of this proposal is to provide a single print command, which chooses the most reliable, performant, and dynamic method of printing. At its most distilled, the goal is for DWIM print anyVar
to use frame variable
, and DWIM print someExpr()
to use expression
. In considering each of these, and their differences, there are a number of emergent cases to handle, including:
- persistent results
- syntax/semantic differences between
frame variable
andexpression
- dynamic typing
Hopefully this proposal has laid out most of the details needed to assess how these aspects can be handled by a DWIM print command.
Thank you for reading, all feedback is appreciated!