A bytecode for (LLDB) data formatters

LLDB provides very rich customization options to display data types (see Variable Formatting - 🐛 LLDB ). To use custom data formatters, developers typically need to edit the global ~/.lldbinit file to make sure they are found and loaded. An example for this workflow is the llvm/utils/lldbDataFormatters.py script. Because of the manual configuration that is involved, this workflow doesn’t scale very well. What would be nice is if developers or library authors could ship ship data formatters with their code and LLDB automatically finds them.

In Swift we added the DebugDescription macro that translates Swift string interpolation into LLDB summary strings, and puts them into a .lldbsummaries section, where LLDB can find them. This works well for simple summaries, but doesn’t yet scale to synthetic child providers or summaries that need to perform some kind of conditional logic or computation. The logical next step would be to store full Python formatters instead of summary strings, but Python code is larger and more importantly it is potentially dangerous to just load an execute untrusted Python code in LLDB.

To address these problems, I’m proposing a minimal bytecode tailored to running LLDB formatters. It defines a human-readable assembler representation for the language, an efficient binary encoding, a virtual machine for evaluating it, and format for embedding formatters into binary containers.

Goals

Provide an efficient and secure encoding for data formatters that can be used as a compilation target from user-friendly representations (such as DIL, Swift DebugDescription, or perhaps even NatVis).

Non-goals

While humans could write the assembler syntax, making it user-friendly is not a goal.

Design of the virtual machine

The LLDB formatter virtual machine uses a stack-based bytecode, comparable with DWARF expressions, but with higher-level data types and functions.

The virtual machine has two stacks, a data and a control stack. The control stack is kept separate to make it easier to reason about the security aspects of the VM.

Data types

These data types are “host” data types, in LLDB parlance.

  • String (UTF-8)
  • Int (64 bit)
  • UInt (64 bit)
  • Object (Basically an SBValue)
  • Type (Basically an SBType)
  • Selector (One of the predefine functions)

Object and Type are opaque, they can only be used as a parameters of call.

Instruction set

Stack operations

These manipulate the data stack directly.

  • dup (x -> x x)
  • drop (x y -> x)
  • pick (x ... UInt -> x ... x)
  • over (x y -> y)
  • swap (x y -> y x)
  • rot (x y z -> z x y)

Control flow

  • { pushes a code block address onto the control stack
  • } (technically not an opcode) denotes the end of a code block
  • if pops a block from the control stack, if the top of the data stack is nonzero, executes it
  • ifelse pops two blocks from the control stack, if the top of the data stack is nonzero, executes the first, otherwise the second.

Literals for basic types

  • 123u ( -> UInt) an unsigned 64-bit host integer.
  • 123 ( -> Int) a signed 64-bit host integer.
  • "abc" ( -> String) a UTF-8 host string.
  • @strlen ( -> Selector) one of the predefined functions supported by the VM.

Arithmetic, logic, and comparison operations

  • + (x y -> [x+y])
  • - etc …
  • *
  • /
  • %
  • <<
  • >>
  • shra (arithmetic shift right)
  • ~
  • |
  • ^
  • =
  • !=
  • <
  • >
  • =<
  • >=

Function calls

For security reasons the list of functions callable with call is predefined. The supported functions are either existing methods on SBValue, or string formatting operations.

  • call (Object arg0 ... Selector -> retval)

Method is one of a predefined set of Selectors

  • (Object @summary -> String)

  • (Object @type_summary -> String)

  • (Object @get_num_children -> UInt)

  • (Object UInt @get_child_at_index -> Object)

  • (Object String @get_child_index -> UInt)

  • (Object @get_type -> Type)

  • (Object UInt @get_template_argument_type -> Type)

  • (Object @get_value -> Object)

  • (Object @get_value_as_unsigned -> UInt)

  • (Object @get_value_as_signed -> Int)

  • (Object @get_value_as_address -> UInt)

  • (Object Type @cast -> Object)

  • (UInt @read_memory_byte -> UInt)

  • (UInt @read_memory_uint32 -> UInt)

  • (UInt @read_memory_int32 -> Int)

  • (UInt @read_memory_unsigned -> UInt)

  • (UInt @read_memory_signed -> Int)

  • (UInt @read_memory_address -> UInt)

  • (UInt Type @read_memory -> Object)

  • (String arg0 ... fmt -> String)

  • (String arg0 ... sprintf -> String)

  • (String strlen -> String)

Byte Code

Most instructions are just a single byte opcode. The only exceptions are the literals:

  • String: Length in bytes encoded as ULEB128, followed length bytes
  • Int: LEB128
  • UInt: ULEB128
  • Selector: ULEB128

Embedding

Expression programs are embedded into an .lldbformatters section (an evolution of the Swift .lldbsummaries section) that is a dictionary of type names/regexes and descriptions. It consists of a list of records. Each record starts with the following header:

  • version number (ULEB128)
  • remaining size of the record (minus the header) (ULEB128)

Space between two records may be padded with NULL bytes.

In version 1, a record consists of a dictionary key, which is type name or regex.

  • length of the key in bytes (ULEB128)
  • the key (UTF-8)

A regex has to start with ^.

This is followed by one or more dictionary values that immediately follow each other and entirely fill out the record size from the header. Each expression program has the following layout:

  • function signature (1 byte)
  • length of the program (ULEB128)
  • the program bytecode

The possible function signatures are:

  • 0: @summary (Object -> String)
  • 1: @init (Object -> Object+)
  • 2: @get_num_children (Object+ -> UInt)
  • 3: @get_child_index (Object+ String -> UInt)
  • 4: @get_child_at_index (Object+ UInt -> Object)
  • FIXME: potentially also get_value? (Variable Formatting - 🐛 LLDB)

If not specified, the init function defaults to an empty function that just passes the Object along. Its results may be cached and allow common prep work to be done for an Object that can be reused by subsequent calls to the other methods. This way subsequent calls to @get_child_at_index can avoid recomputing shared information, for example.

While it is more efficient to store multiple programs per type key, this is not a requirement. LLDB will merge all entries. If there are conflicts the result is undefined.

Execution model

Execution begins at the first byte in the program. The program counter may never move outside the range of the program as defined in the header. The data stack starts with one Object or the result of the @init function (Object+ in the table above).

Error handling

In version 1 errors are unrecoverable, the entire expression will fail if any kind of error is encountered.

Prototype

A prototype implementation of this concept can be found in PR113398 [lldb] Add a compiler/interpreter of LLDB data formatter bytecode to lldb/examples by adrian-prantl ¡ Pull Request #113398 ¡ llvm/llvm-project ¡ GitHub . The implementation is entirely in Python. In compiler.py there is the assembler, disassembler, and interpreter. In the test directory there is a reimplementation of the llvm::Optional<> data formatter from llvm/utils/lldbDataFormatters.py to show the feasibility of the approach. The next step would be to reimplement the interpreter in C++ inside of LLDB.

5 Likes

I wholeheartedly support the intent of making formatters easier to consume by users.

However, I wonder what the long-term direction for such an interpreter is? If it aims to cover all the cases that existing formatters cover, then I’d expect more facilities to be present.

As an example, the existing set of selectors (which sounds a rather foreign term to me as someone who didn’t have much exposure to ObjC) includes cast but barely includes ways to get a type to cast the value to. Missing pieces are at least SBTarget::FindTypes, bases, and SBType::FindDirectNestedType. One of the use cases it to explain LLDB a custom RTTI scheme.

This is a lot to digest, just some obvious questions for now.

So I’m a library writer and I want to add these formatters. My workflow is:

  • Install the bytecode compiler (or the DIL compiler?)
    • If this is the assembly to bytecode compiler, where is the DIL → assembly compiler coming from?
  • Write my formatter in the higher level language
  • Compile it to this bytecode
  • Arrange for my linker to insert it into a section in the object file

Is that the proposed workflow? (one of them at least)

So for example, rust could ship their standard library builds with this data inserted into it and then rust-lldb would not have to add all the type formatters manually (which seems like the main point of that wrapper script).

And the lack of higher level data types and functions is precisely the reason not to use DWARF for this?

If we could use DWARF (which we sort of already do by including it in the binary) we would not have extra data formatters at all in the first place.

I think I answered my own question but what you propose sounds so much like DWARF it’s the obvious reaction to have.

Sounds like BPF. Having not used it myself I can’t say whether bolting a BPF interpreter onto lldb would be possible or make any sense though.

llvm has a target for bpf but we’d still need to build the interpreter for it. So the time saved is also unknown.

Did you consider somehow having the formatter be a function written in the same language as the program? This is kinda like saying what if the dump functions of llvm types were used as their formatter.

Seems like a bad idea though because:

  • That’s code we can’t easily strip out for stripped binaries.
  • The result would just be a string it might be missing the details lldb needs to properly layout the type on screen.
  • Library authors would need to build against some lldb headers, assuming their language even has a working API bridge.
  • Every library consumer that wants to use these also has to have those headers around.
  • Wouldn’t work for core files where we can’t execute natively.

So yes bad idea but it’s there to compare to on the other extreme of the “do almost nothing new” to “a whole new VM” scale.

Good to see this considered. I guess that one reason to include it in each record might be that you could have binaries linked together that were compiled with different versions?

You don’t have a header for the start of the dictionary (aside from the object file format’s header), so that means you can append one section onto the end of the other without worrying about what’s in the section.

And this will be part of the expression:

^
Start of string, or start of line in multi-line pattern

So if I wanted a regex equivalent to .*foo.* I would just write ^.*foo.* instead?

I think custom RTTI would be challenging to express in DWARF, but I agree with the sentiment that easy cases are covered by DWARF anyway. That’s why I brought up my own complicated one.

That would be a fantastic way of tackling my use case. On many occasions, what my formatters for Clang types do is reimplementing C++ code in Python using LLDB SB API. Which is the only viable way to do this today and not spend minutes waiting for expression evaluator, but certainly brittle when C++ code changes.

Yes, that would be one of the workflows this enables.

Yes, that’s correct. Most of the complexity in the proposal lies in the container format and how to manipulate value objects.

The compiler and interpreter for the language are so simple that both fit on a single screen; all the interesting parts are defining how to call out into ValueObject and string formatting utilities. Replacing the few lines that implement the core language with an adapter to another language that is designed to solve a different problem would make the whole design much more complex to reason about. Since this has a potential security surface, keeping the interpreter small is important.

To some degree, that is how the Swift DebugDescription macro works, but instead of being an LLDB-specific implementation, like you are describing, the macro takes a function that returns a string as input and translates it into a global symbol holding an LLDB summary string, without the developer having to know about any of the LLDB specifics.

That was exactly the idea!

Yes. This is micro-optimization to save a byte in the more common non-regex case.

I asked a GDB maintainer colleague about this topic.

Usual caveat here if you are not allowed to look at GPL code, be careful. Though none of this seems to have landed in upstream GDB and I will summarise what stood out to me.

GDB supports a .debug_gdb_scripts section that can contain the names of scripts for the debugger to autoload: dotdebug_gdb_scripts section (Debugging with GDB)

This is basically what if we put parts of the init file inside each library.

As for bytecodes and the like, there was a project trying to improve (or avoid) the handling of libthread_db: Infinity – gbenson.net / NoteFormat - GDB Wiki

The idea was to write bytecode to replace the functions provided by libthread_db in a way that would be compatible with different ABIs (I’m fuzzy on the details here).

This seems to be abandoned, but in a presentation on the topic (https://www.youtube.com/watch?v=dQ7HOjNDFxs) the author did mention pretty printers as another use case.

In that initial post they are leaning towards DWARF, and later they seem to have used it Infinity progress update – gbenson.net but are finding some issues with the current interpreter in GDB. Notably, they suggest adding a function call to DWARF. Perhaps you have come to the same conclusions here.

Whether this would have worked for pretty printers, I wonder because this libthread_db seems like a very DWARF like use case. Following locations to find data. Pretty printers are more like, as Endill said, C++ programs.

Someone does ask them about BPF but they say they want to stick with DWARF, so no data on that.

Definitely a struggle between a thing that exists already and has existing tooling vs. the amount of extra features that thing will have that we don’t need.

Do we have to modify the linkers to do this or can existing objcopy features cope with it?

That’s a very platform-specific question, but generally, if the attributes are set correctly, the linker can just accumulate the section contents from all sections in the objects and may not even need to know about it.

It seems like in terms of usability and user experiences saving a vanilla Python script would be the most convenient and straight forward approach vs introducing a new approach and providing a whole infra for creating bytecodes and implementing a dedicated interpreter.

but Python code is larger and more importantly it is potentially dangerous to just load an execute untrusted Python code in LLDB.

If your main concern is about executing untrusted Python code, have you considered sandboxing Python script execution by providing custom global/local environment dictionaries to something like Python’s eval/exec and thus fully controlling/restricting the API surface of what such a Python script can access and do?

One thing to keep in mind is that this is meant to be a compilation target, not necessarily language that developers would write.

Yes, this is an alternative we considered and rejected.

For context, if you want to ship entire python scripts, LLDB already has some mechanisms for this (on Darwin, for example, you can have your build system embed a Python script into your code’s dSYM bundle and that will get semi automatically loaded by LLDB).

One of the design goals is to have a format that is compact enough that it can (also) be embedded into binaries themselves. For example, a library author might want to ship a binary library without debug info, but with formatters for the public types. The bytecode reduces all identifiers to single bytes, so it’s a lot more compact that the equivalent Python code. You could gzip Python code, but that adds more complexity. As for the security aspect, yes it would be possible to manually interpret or sandbox the Python code, but it’s much harder to proof that you’ve done that correctly than auditing 25 lines of interpreter.

Is that the proposed workflow? (one of them at least)

My preferred workflow (as a C++ developer) would be to define those pretty printers directly in my C++ code.

I would imagine something like

struct {
    byte* data;
    size_t size;

    __attribute__((clang::lldbformatters))
    string_view __lldb_summary() { return string_view{data, size}; }
}

As mentioned by others before, directly calling the compiled function is not an option because of security and the slowness of expression evaluation.

But maybe we could teach clang to translate the __lldb_summary function into some bytecode in a separate section?

I have no glue whether eBPF support is even close to good enough for compiling C++ code (I never used eBPF before). But if we use a bytecode format for which there already is a LLVM-backend, we would be closer to that goal than if we use a completely new bytecode format.

Afaik, BPF was designed with the explicit goal of “sandboxing”, so BPF’s design goals should align well with our goals here. Not sure how we think about using a third-party libraries, but afaik there are libraries available for running eBPF code also in userland.

An alternative bytecode representation which comes to mind would be WebAssembly. WebAssembly was also designed with the primary design goal of security/sandboxing, and there also already is support a WebAssembly backend in LLVM

If there was a kind as well, then we can leave the door open to some of the alternatives suggested here without having to invent new sections each time.

Will it likely only ever have 1 value? Yes, but that would mean this solution did well enough to have no competition which would be a good thing.

This is also what I would propose for a cross-debugger formatters section, you’re not proposing that here of course but maybe supporting many versions of lldb would be easier with a kind?

I think we can achieve this by simply renaming version to kind. At least for the first 127 dialects, this will be more efficient :slight_smile:

That is roughly how the Swift DebugDescription macro works, and definitely within the possibilities where we could take this.

That‘s a totally valid idea to explore, and as @DavidSpickett noted, the section design leaves the door open to try this in the future. One thing to consider though is that all the interesting/difficult work here is about how to interface with ValueObject (the call word in my proposal). The actual interpreter is trivial to implement. So you need to weigh whether pulling in a large dependency that you will need to interface with, just to replace 20 lines of interpreter is the right trade off.

Sure but storage aside, it’s more logical to have version X of format Y. Than say ok, this range of numbers is reserved for this format and this other range is for that one.

I see your point on storage though, and given the lldb project completely controls this list, we can have 0 = bytecode_1, 1 = bytecode_2, 2 = python_script_1, 3 = bytecode_3 without worrying about synchronizing it with anyone else.

(and a fork could start at <very big number> if they wanted to be safe from overlaps)

Should this be over (x y -> x y x)?

Yes, thanks for catching that!

I’ve written a minimal proof of concept Python “compiler” that handles enough Python syntax (using its ast module) to convert an example summary provider to the bytecode assembly. The example case is basically the same one that Adrian demonstrates (a stripped down llvm::Optional).

See [lldb] Proof of concept data formatter compiler for Python by kastiglione ¡ Pull Request #113734 ¡ llvm/llvm-project ¡ GitHub

1 Like

Assume we have everything proposed here implemented. I’m a library author and I want to provide this new section. What is the “best in class” way to target this bytecode?

I don’t see library authors being very receptive to having to write their own compiler, having battled N compilers to build the library in the first place. I would likely to stick to Python scripts.

I see a few ideas like the one linked above and that’s great but what’s the vision for that layer of this process in future? Are the compilers third party things or do we end up with a few of them for the big languages we support?

Edit: where language is programming languages or this DIL you mentioned, I don’t know which, or all, or some?

Folks sticking to Python scripts isn’t the end of the world of course, this feature is not a replacement for that. I’m just not sure how widely you expect this to be used.