[RFC] Data Inspection Language

werat · April 11, 2023, 9:14am

Intro

In this RFC I introduce a Data Inspection Language (DIL) to improve stability and performance of variable introspection and address the inconsistencies in existing implementations to improve the debugger user experience. The names in this document (e.g. DIL) are placeholders and might change. If you have any suggestions, please leave comments

DIL is an expression language designed to inspect the data (e.g. variables) in the program. It looks and feels like a source language of the program (e.g. C++), but may deviate from it in certain aspects and can support extra features like dynamic type resolution and synthetic children providers. Quick example:

(lldb) inspect (char*)(&foo->bar + 64)
"hello world"

(Note: the inspect command doesn’t exist in LLDB)

LLDB already has two mechanisms for data inspection: expr and frame variable. The first one, expr, uses a compiler-powered expression evaluator and can execute almost any valid C++ code. It’s very powerful, but among its downsides are instability, poor performance, and lack of support for dynamic types and synthetic children. The second one, frame variable, uses a very simple interpreter that supports a limited number of operations. However, it does dynamic type resolution and can follow fields generated by synthetic field providers. Developers often use these two commands interchangeably, but the results may be different for the same input. Many people don’t realize that there are differences and just use whatever they’re more used to.

The eventual goal for DIL is to completely replace the implementation of the frame variable command (GetValueForExpressionPath()) and to be used by default for the p/print command. The expr command (EvaluateExpression()) will continue to use the compiler-based expression evaluation.

Motivation

A big motivator for introducing DIL is debugger stability. The expression evaluation in LLDB is fragile and can often fail even for simple expressions depending on the complexity of the program and current context. The expression evaluator is used by default in the print command, which many (most) command-line users use for basic data inspection purposes. Simply doing print x might crash the debugger if the program is currently stopped in some tricky context. We hope DIL will eventually be used by default in both print and frame variable (or at least can be aliased by the interested users), which will improve the overall reliability.

Note: compiler-based expression evaluation is not fundamentally unstable, it’s just very hard to get the implementation right; it relies on ClangAST which has a lot of internal invariants and is tricky to construct.

Inconsistency in behaviour is another important factor. Having print and frame variable to produce different results is a source of frustration for many users. Educating users on which command to use under which circumstances can only do so much, so removing the inconsistencies between these commands is important for improving the user experience.

Another motivator is performance. EvaluateExpression() can be pretty slow depending on the circumstances and DIL can be made significantly faster. Performance was the primary motivation for creating lldb-eval, which was used for implementing NatVis in the Stadia debugger. The gdb-to-lldb pretty-printer adapter (gala also relies on expression evaluation to achieve compatibility with GDB.

However, using DIL in data formatters is explicitly out of scope for the current proposal. Expression evaluation is already highly discouraged from use in data formatters and DIL won’t challenge that. It will be accessible via SB API though and it can evolve as we progress through the implementation.

Detailed Design

The idea is that DIL would be a reasonable subset of the source language of the target. This way the users don’t need to worry much about the syntax differences, although I acknowledge that there might be some inconsistencies. Here I am focusing on C++ and will use it as an example. Later I’ll outline the ideas and solutions for supporting other languages.

DIL will support the following operations:

Basic arithmetic – addition, subtraction, multiplication, division, modulo
Bitwise operations – and, or, xor, negation, left/right shifts
Member access – foo->bar and foo.bar
Array subscript (foo[bar])
Dereference and address-of (*foo, &foo)
Type casts – C-style casts ((int)x) and C+±style casts (static_cast, reinterpret_cast, etc)
Simple function calls (foo(), foo->bar()) (see “Function calls” section below)

Some operations here are language-specific (and might have language specific semantics) and

more operations may be added in the future. Let me know if you think something should be added to or removed from this list.

DIL is to be implemented as a parser + interpreter. It will use the information about types and values provided by LLDB to resolve identifiers, perform type checking and compute the result. This approach avoids using ASTImporter, so it doesn’t depend on the Clang compiler.

Language semantics

DIL doesn’t aim to have the exact same semantics as the program source language, but it will follow the “basic” semantics like operator precedence and overflow/underflow rules. It can deviate from the source language if that makes sense from the perspective of user-experience. DIL can provide extra convenience features, e.g. builtin intrinsic functions or something for common complex types.

Dynamic typing

Users often expect the member access operator to work “correctly” even if the static type of the variable doesn’t have the requested field. Consider the following example:

struct Base {
    // some virtual methods
}
struct Deriv : Base {
    int foo;
}
Base* base = new Deriv{};

// Debugger session
(lldb) print base->foo

Base type doesn’t have the field foo, however the actual type of the variable is Deriv. If the user were to print the whole base variable (via print base) they would see something like this:

(lldb) print base
(Deriv*) {
    (int) foo = 42
}

It only makes sense for print base->foo to print 42 as well.

The pure compiler based approach (i.e. the current implementation of EvaluateExpression()) is not able to resolve this, as the compiler always uses the static type information. See another proposal for a hybrid expression evaluation implementation – DWIM Print Command.

DIL will resolve the dynamic types as it parses and evaluates the expression (where possible) and therefore can choose and use the correct field. If necessary, this can be enabled/disabled via a setting.

Synthetic children providers

Data formatters in LLDB can provide synthetic children for existing objects, which is often used for complex data structures. When the user prints the variable, they see the members generated by the data formatters – this is very handy for visualizing complex objects. The current implementation of EvaluateExpression() does not support synthetic children, which means the user can’t use synthetic fields in the expressions. GetValueForExpressionPath() does support them and the user can access generated fields in frame variable expressions.

DIL will be data-formatter-aware and will support synthetic fields. The current synthetic children API can be used the resolve the children by name and index (Variable Formatting - 🐛 LLDB), so the interpreter can support both foo->synth_child and foo[3], which can be handy for things like containers (e.g. vectors). It will also support respect dereference (defined via $$dereference$$).

Properties

Some languages make heavy use of properties (or computed fields), e.g. Swift or Objective-C. Generally accessing properties requires executing code, because they’re essentially function calls. In theory it’s possible to call functions, however it requires knowledge of the ABI and is generally non-trivial. The current expression evaluator solves this problem by compiling and executing the whole expression.

DIL needs to support properties for languages that have them. A proper implementation would rely on calling the backing accessor function, however it’s not trivial because of the ABI and is even further complicated by language specific logic (e.g. in Objective-C we need to deal with the ObjC dispatch machinery).

A simpler (temporary) approach might be to use an expression evaluator for accessing properties. The DIL parser can recognize that foo.bar is a property access and can call EvaluateExpression() on this specific field. This may impact the performance and introduce instability, but it will work. It could be disabled by default, e.g. something like this:

(lldb) print foo->bar
Error: Foo::bar is a property and accessing properties requires executing code in the process. This is disabled by default, but you can enable it via "settings set dil.allow-property-access true".

Function calls

Calling functions is very useful in many situations. Similar to properties, in general case calling a function requires a complete knowledge of ABI. The expression evaluator solves this problem by using a compiler.

However it’s possible to call some functions without a “complete” knowledge of ABI. For example, functions with no arguments or functions with primitive arguments. LLDB already has a capability to invoke such “simple” functions and it can be leveraged by DIL. It won’t cover all use cases, but it’s better than nothing and hopefully can cover many simple cases (e.g. calling methods like v->size()).

Supporting different source languages

Everything above kind of assumes C++ as the source language of the target. However LLDB supports other languages like Swift and Objective-C (and ~Rust). Moreover, the target program can have modules of different languages, e.g. C++ code calling into Swift or the other way around.

DIL can have flavors (or dialects), which implement language specific features or semantics. By default, the interpreter would pick the flavor of the current frame, but it can be overridden. Implementation-wise DIL can have one parser that can branch based on the flavor or completely different parsers for different flavors. Since it’s not a goal to implement the whole source language, I believe the DIL parser can be relatively simple and easy to maintain.

Another option could be to incorporate features from different languages into one “true” implementation of DIL. This might be a good option from the user perspective (i.e. always works, don’t need to care about flavors). However I’m not sure whether it would be possible to properly differentiate between language-specific features in a single parser.

Comparison to other debuggers

GDB

GDB’s primary mechanism for data inspection is the print/inspect command. It uses an interpreter under the hood, which supports arithmetic operations, type casts, function calls. This is similar to the proposal in this RFC.

GDB also has a capability to compile and execute arbitrary code. Here it uses gcc under the hood.

Visual Studio Debugger

Visual Studio debugger supports evaluating expressions in the Immediate and Watch windows. It uses an interpreter-based approach, supports many C++ features like arithmetic, casts, function calls and provides many builtin intrinsics (like strlen or __findNonNull functions). This expression evaluator is extensively used in NatVis, which is a framework for defining custom data visualizers.

Milestones

Here’s the approximate implementation plan for DIL:

Implement the same set of operations supported by the frame variable command
- This includes dynamic type resolution and synthetic children support
Replace the implementation of GetValueForExpressionPath() with DIL
- This should be a no-op change from the user perspective
Implement type casts (e.g. (int)foo or static_cast<int>(foo))
Implement arithmetic operations (addition/subtraction, multiplication, bitwise operations, etc)
Implement properties
Implement basic function calls
Disable the fallback to expr by default in the p/print command
- At the moment p is aliased to dwim-print (commit)

Thanks @jingham @labath @cmtice @kastiglione @dblaikie for discussing and refining this proposal.

DavidSpickett · April 11, 2023, 12:55pm

It sounds like implementing a DIL is not without it’s own problems, however we’re already implementing a constrained DIL for the frame variable command. So we might as well do a more complete job of it and get some other benefits along the way.

Is that a fair assessment?

Which sounds great to me. I only learned there were 2 paths due to another discussion on this topic, so there’s one confused user

So with this DIL in place, simple expressions like that would never have to go the compiler powered route. If the expression was too weird, the user could use expr instead, knowing that there’s some risk to it or at least, some performance hit.

As a user I’d assume time taken and risk of a crash increase with expression complexity anyway. So this aligns reality with the perception I had.

jingham · April 11, 2023, 9:04pm

What do you plan to do about operator overloads in the DIL? In the course of debugging a library that implements overloaded operators, you sometimes want to see the result of the overload (particularly for ->) but other times you want to see an actual field and NOT run the overload. This was handled in the current scheme because expr would always show the overloaded operator result, and frame var would always show you the memory dereference. So provided you understand how the system works, you can always get what you want.

Are we going to preserve this distinction, or do we need a way to tell the DIL which one we want?

Jim

On Apr 11, 2023, at 2:15 AM, Andy Hippo via LLVM Discussion Forums notifications@llvm.discoursemail.com wrote:

werat
April 11

Intro

In this RFC I introduce a Data Inspection Language (DIL) to improve stability and performance of variable introspection and address the inconsistencies in existing implementations to improve the debugger user experience. The names in this document (e.g. DIL) are placeholders and might change. If you have any suggestions, please leave comments

DIL is an expression language designed to inspect the data (e.g. variables) in the program. It looks and feels like a source language of the program (e.g. C++), but may deviate from it in certain aspects and can support extra features like dynamic type resolution and synthetic children providers. Quick example:
(lldb) inspect (char*)(&foo->bar + 64)
"hello world"
(Note: the inspect command doesn’t exist in LLDB)

LLDB already has two mechanisms for data inspection: expr and frame variable. The first one, expr, uses a compiler-powered expression evaluator and can execute almost any valid C++ code. It’s very powerful, but among its downsides are instability, poor performance, and lack of support for dynamic types and synthetic children. The second one, frame variable, uses a very simple interpreter that supports a limited number of operations. However, it does dynamic type resolution and can follow fields generated by synthetic field providers. Developers often use these two commands interchangeably, but the results may be different for the same input. Many people don’t realize that there are differences and just use whatever they’re more used to.

The eventual goal for DIL is to completely replace the implementation of the frame variable command (GetValueForExpressionPath()) and to be used by default for the p/print command. The expr command (EvaluateExpression()) will continue to use the compiler-based expression evaluation.

Motivation

A big motivator for introducing DIL is debugger stability. The expression evaluation in LLDB is fragile and can often fail even for simple expressions depending on the complexity of the program and current context. The expression evaluator is used by default in the print command, which many (most) command-line users use for basic data inspection purposes. Simply doing print x might crash the debugger if the program is currently stopped in some tricky context. We hope DIL will eventually be used by default in both print and frame variable (or at least can be aliased by the interested users), which will improve the overall reliability.

Note: compiler-based expression evaluation is not fundamentally unstable, it’s just very hard to get the implementation right; it relies on ClangAST which has a lot of internal invariants and is tricky to construct.

Inconsistency in behaviour is another important factor. Having print and frame variable to produce different results is a source of frustration for many users. Educating users on which command to use under which circumstances can only do so much, so removing the inconsistencies between these commands is important for improving the user experience.

Another motivator is performance. EvaluateExpression() can be pretty slow depending on the circumstances and DIL can be made significantly faster. Performance was the primary motivation for creating lldb-eval, which was used for implementing NatVis in the Stadia debugger. The gdb-to-lldb pretty-printer adapter (gala also relies on expression evaluation to achieve compatibility with GDB.

However, using DIL in data formatters is explicitly out of scope for the current proposal. Expression evaluation is already highly discouraged from use in data formatters and DIL won’t challenge that. It will be accessible via SB API though and it can evolve as we progress through the implementation.

Detailed Design

The idea is that DIL would be a reasonable subset of the source language of the target. This way the users don’t need to worry much about the syntax differences, although I acknowledge that there might be some inconsistencies. Here I am focusing on C++ and will use it as an example. Later I’ll outline the ideas and solutions for supporting other languages.

DIL will support the following operations:

Basic arithmetic – addition, subtraction, multiplication, division, modulo

Bitwise operations – and, or, xor, negation, left/right shifts

Member access – foo->bar and foo.bar

Array subscript (foo[bar])

Dereference and address-of (*foo, &foo)

Type casts – C-style casts ((int)x) and C+±style casts (static_cast, reinterpret_cast, etc)

Simple function calls (foo(), foo->bar()) (see “Function calls” section below)

Some operations here are language-specific (and might have language specific semantics) and

more operations may be added in the future. Let me know if you think something should be added to or removed from this list.

DIL is to be implemented as a parser + interpreter. It will use the information about types and values provided by LLDB to resolve identifiers, perform type checking and compute the result. This approach avoids using ASTImporter, so it doesn’t depend on the Clang compiler.

Language semantics

DIL doesn’t aim to have the exact same semantics as the program source language, but it will follow the “basic” semantics like operator precedence and overflow/underflow rules. It can deviate from the source language if that makes sense from the perspective of user-experience. DIL can provide extra convenience features, e.g. builtin intrinsic functions or something for common complex types.

Dynamic typing

Users often expect the member access operator to work “correctly” even if the static type of the variable doesn’t have the requested field. Consider the following example:
struct Base {
    // some virtual methods
}
struct Deriv : Base {
    int foo;
}
Base* base = new Deriv{};

// Debugger session
(lldb) print base->foo
Base type doesn’t have the field foo, however the actual type of the variable is Deriv. If the user were to print the whole base variable (via print base) they would see something like this:
(lldb) print base
(Deriv*) {
    (int) foo = 42
}
It only makes sense for print base->foo to print 42 as well.

The pure compiler based approach (i.e. the current implementation of EvaluateExpression()) is not able to resolve this, as the compiler always uses the static type information. See another proposal for a hybrid expression evaluation implementation – DWIM Print Command.

DIL will resolve the dynamic types as it parses and evaluates the expression (where possible) and therefore can choose and use the correct field. If necessary, this can be enabled/disabled via a setting.

Synthetic children providers

Data formatters in LLDB can provide synthetic children for existing objects, which is often used for complex data structures. When the user prints the variable, they see the members generated by the data formatters – this is very handy for visualizing complex objects. The current implementation of EvaluateExpression()does not support synthetic children, which means the user can’t use synthetic fields in the expressions. GetValueForExpressionPath() does support them and the user can access generated fields in frame variable expressions.

DIL will be data-formatter-aware and will support synthetic fields. The current synthetic children API can be used the resolve the children by name and index (Variable Formatting — The LLDB Debugger), so the interpreter can support both foo->synth_child and foo[3], which can be handy for things like containers (e.g. vectors). It will also support respect dereference (defined via $$dereference$$).

Properties

Some languages make heavy use of properties (or computed fields), e.g. Swift or Objective-C. Generally accessing properties requires executing code, because they’re essentially function calls. In theory it’s possible to call functions, however it requires knowledge of the ABI and is generally non-trivial. The current expression evaluator solves this problem by compiling and executing the whole expression.

DIL needs to support properties for languages that have them. A proper implementation would rely on calling the backing accessor function, however it’s not trivial because of the ABI and is even further complicated by language specific logic (e.g. in Objective-C we need to deal with the ObjC dispatch machinery).

A simpler (temporary) approach might be to use an expression evaluator for accessing properties. The DIL parser can recognize that foo.bar is a property access and can call EvaluateExpression() on this specific field. This may impact the performance and introduce instability, but it will work. It could be disabled by default, e.g. something like this:
(lldb) print foo->bar
Error: Foo::bar is a property and accessing properties requires executing code in the process. This is disabled by default, but you can enable it via "settings set dil.allow-property-access true".
Function calls

Calling functions is very useful in many situations. Similar to properties, in general case calling a function requires a complete knowledge of ABI. The expression evaluator solves this problem by using a compiler.

However it’s possible to call some functions without a “complete” knowledge of ABI. For example, functions with no arguments or functions with primitive arguments. LLDB already has a capability to invoke such “simple” functions and it can be leveraged by DIL. It won’t cover all use cases, but it’s better than nothing and hopefully can cover many simple cases (e.g. calling methods like v->size()).

Supporting different source languages

Everything above kind of assumes C++ as the source language of the target. However LLDB supports other languages like Swift and Objective-C (and ~Rust). Moreover, the target program can have modules of different languages, e.g. C++ code calling into Swift or the other way around.

DIL can have flavors (or dialects), which implement language specific features or semantics. By default, the interpreter would pick the flavor of the current frame, but it can be overridden. Implementation-wise DIL can have one parser that can branch based on the flavor or completely different parsers for different flavors. Since it’s not a goal to implement the whole source language, I believe the DIL parser can be relatively simple and easy to maintain.

Another option could be to incorporate features from different languages into one “true” implementation of DIL. This might be a good option from the user perspective (i.e. always works, don’t need to care about flavors). However I’m not sure whether it would be possible to properly differentiate between language-specific features in a single parser.

Comparison to other debuggers## GDB

GDB’s primary mechanism for data inspection is the print/inspect command. It uses an interpreter under the hood, which supports arithmetic operations, type casts, function calls. This is similar to the proposal in this RFC.

GDB also has a capability to compile and execute arbitrary code. Here it uses gcc under the hood.

Visual Studio Debugger

Visual Studio debugger supports evaluating expressions in the Immediate and Watch windows. It uses an interpreter-based approach, supports many C++ features like arithmetic, casts, function calls and provides many builtin intrinsics (like strlen or __findNonNull functions). This expression evaluator is extensively used in NatVis, which is a framework for defining custom data visualizers.

Milestones

Here’s the approximate implementation plan for DIL:

Implement the same set of operations supported by the frame variable command

This includes dynamic type resolution and synthetic children support

Replace the implementation of GetValueForExpressionPath() with DIL

This should be a no-op change from the user perspective

Implement type casts (e.g. (int)foo or static_cast<int>(foo))

Implement arithmetic operations (addition/subtraction, multiplication, bitwise operations, etc)

Implement properties

Implement basic function calls

Disable the fallback to expr by default in the p/print command

At the moment p is aliased to dwim-print (commit)

Thanks @jingham @labath @cmtice @kastiglione @dblaikie for discussing and refining this proposal.

Visit Topic or reply to this email to respond.

To unsubscribe from these emails, click here.

werat · April 13, 2023, 10:00am

My suggestion would be to preserve the distinction. The expr command would be left unchanged and DIL (now used for print/frame-var) would always show the memory dereference (modulo sythetic children / dereference).

Maybe it’s worth thinking about some quality-of-life UX improvements, e.g. we showing a note/warning when the overloaded operator exists but we didn’t use it:

(lldb) print foo->bar
note: foo (ns::MyPtr) has an overloaded operator->(), but it was ignored here
42

werat · April 17, 2023, 1:05pm

Yep, that’s the idea. The expressions language currently used in frame variable is essentially DIL, but it’s adhoc and no spec. The proposal is make it official and better.

Topic		Replies	Views
Interactive commands in LLDB LLDB	19	149	February 26, 2015
lldb -- architecture level question -- linux v. darwin LLDB	11	235	March 16, 2011
DWIM Print Command LLDB	15	1817	November 18, 2022
Proposal: Improved regression test support for RuntimeDyld/MCJIT. LLVM Dev List Archives	9	80	June 27, 2014
[BUG] Many lookup failures LLDB	16	137	December 1, 2015

[RFC] Data Inspection Language

Intro

Motivation

Detailed Design

Language semantics

Dynamic typing

Synthetic children providers

Properties

Function calls

Supporting different source languages

Comparison to other debuggers

GDB

Visual Studio Debugger

Milestones

Intro

Motivation

Detailed Design

Language semantics

Dynamic typing

Synthetic children providers

Properties

Function calls

Supporting different source languages

Comparison to other debuggers## GDB

Visual Studio Debugger

Milestones

Related topics