[RFC][LLDB] Moving libc++ data-formatters out of LLDB

In this RFC we propose rewriting the builtin libc++ LLDB data formatters in Python and moving them into the libc++ subproject.

Issues with builtin LLDB formatters

  • libc++ layouts/field names change frequently which cause the LLDB data-formatters to break. We made efforts to run formatters on libc++ CI, but there are gaps:

    • Only the bootstrapping libc++ runners run the formatter tests (i.e., the ones that build Clang/LLDB). This means if we wanted to test LLDB formatters against non-default libc++ configurations (e.g., hardened libc++), we’d have to set up a new boostrapping builder with the changed libc++ configuration. That’s resource/maintenance burden that the libc++ maintainers don’t want to carry.
  • When libc++ developers see that their pre-merge CI fail because of a formatter test, they have to reach out to LLDB developers to fix the formatters or try to build/fix LLDB themselves. Both are developer friction.

  • The main difficulty in maintaining/developing data-formatters is familiarity with the data types being formatted. Putting the burden on LLDB developers increases the likelihood of outdated and imprecise formatters.

  • If the libc++ community gets more familiar with writing formatters, new formatters could land around the same time that the corresponding libc++ type does. This would reduce bug reports to LLDB about missing formatter support for long-released data types.

  • Once an LLDB release is shipped, the formatters cannot be fixed until the next release. We try hard not to break the formatters but there are still occasionally issues in the field. At that point the workaround boils down to upgrading/downgrading lldb or libc++, neither of which is ideal.

Goals/Benefits

  1. libc++ community maintain formatters instead of LLDB devs

  2. lower barrier to contribution since Python-based formatters tend to have much fewer boilerplate/foot-guns

  3. having the formatters written against just the public SBAPI lets us dog-food the public APIs

  4. formatters get tested against all libc++ configurations (since they could be run from the libc++ test-suite on all build bots)

  5. LLDB can drop the requirement of top-of-tree libc++ from the API test-suite (this was ever only enforced on Darwin and has been getting increasingly difficult to keep up as libc++/libc++abi requirements evolve on macOS)

  6. Fixing formatters could be done by patching the Python script on the user’s machine.

Prior art for formatter distribution

  • libstdc++ GDB formatters

    • On Linux these get installed into /usr/share/gdb (or similar, depending on distribution)

    • GDB can be configured by packagers to auto-load from certain directories (using the --with-auto-load-dir GDB build option). Additionally, the GDB build option --with-auto-load-safe-path determines a list of “directories trusted for automatic loading and execution of scripts”.

    • Various `auto-load` settings control which kinds of scripts GDB should auto-load (GDB scripts vs. Python scripts vs. command scripts) and a auto-load safe-paths setting controls which paths are safe to load from (as mentioned above).

  • @DebugDescription macro

    • The Swift compiler will turn any class annotated with this attribute into a LLDB formatter bytecode program and embed it into a special section in the binary. LLDB knows how to load and interpret this bytecode. Summary providers and synthetic child providers are both supported.
  • libc++ GDB formatters

This proposal takes the “libstdc++ GDB formatters” approach since that’s what many are already familiar with and is a more mature ecosystem. (note, transitioning to formatter bytecode is still a possibility in the future and compatible with the direction of this RFC)

Proposal

  • Rewrite the libc++ formatters in Python and ship them alongside libc++ headers

    • On macOS the formatters would be placed somewhere inside the SDK (where the libc++ headers live)

    • On Linux this would be up to the distribution but should mimick the installation of libstdc++'s gdb formatters (which get installed into /usr/share/gdb)

    • On Windows this would be up to the toolchain maintainer

      • E.g., will ensure that the Swift toolchain installer places the formatters into some location that LLDB knows about.
  • On target launch, LLDB would auto-load formatters from these blessed locations.

    • For this we would introduce a setting (similar to gdb’s auto-load), set differently depending on platform, but can be overriden by users/distributions/toolchains in their .lldbinit. Discussed slightly further in the Auto-loading section below.
  • The formatters infrastructure shouldn’t require any other changes. Instead of adding type summaries/synthetic providers using C++ function pointers, we add them using the Python class names from the loaded formatter scripts.

Considerations

Auto-loading

Since the location of the formatters is not up to LLDB and will vary between distributions/vendors/etc., we want a configurable way to specify where to load the formatters from. A natural precedent for this is GDB’s auto-load infrastructure (discussed above). The current proposal omits the GDB’s auto-load booleans and implements the equivalent of a safe-paths setting for scripts specficially. If we want more granular control over this, we could provide separate “should auto-load this kind of script” and “should auto-load scripts from” settings. But we could also say if you don’t want to auto-load, unset the paths settings.

Existing Settings

For scripts

target.load-script-from-symbol-file -- Allow LLDB to load scripting resources embedded in symbol files when available.

AFAIK, the main use of this is for formatters distributed in dSYMs. When a dSYM gets loaded it would only automatically load the contained scripts if this setting is set. From a security perspective, loading scripts distributed with dSYMs automatically is more of a risk than loading from a system path set up by the vendor. So explicitly opting into auto-loading from dSYMs makes sense. This proposal does not change the semantics or existence of this setting. It seems like a natural counterpart (auto-load from symbol file vs. auto-load from path).

For .lldbinit

--local-lldbinit
--no-lldbinit

These control automatic loading of .lldbinit files. This proposal would not affect these. If we wanted to create auto-load-paths counterparts for these we could add a target.auto-load-paths.init setting.

Build-time Setting

We would introduce a new CMake variable that takes a list of paths and get embedded into LLDB (e.g.,-DLLDB_AUTO_LOAD_PATHS_SCRIPTS, with the idea being that if we want some other kinds of auto-load paths they would be called AUTO_LOAD_PATHS_FOO). This would be the primary way we expect the auto-load paths to be configured. This would include the system path but also local build paths.

Runtime Setting

A setting like settings set target.auto-load-paths.scripts path/to/1;path/to/2 would allow users to modify the auto-load paths. The default value of this setting would be configured at LLDB build-time. We want this to be configurable per-target, since one could be debugging different targets with their own libc++ formatter setup. Turning off auto-load entirely could be done by assigning to the setting an empty string. This would be useful for local development, testing or more bespoke setups.

Formatter backwards compatibility

Ideally we would load the libc++ formatter script that matches the libc++ version that the target was compiled against. This might be hard to check/enforce, so a better heuristic could be to load the formatters from the *newest* toolchain. If only a newer libc++ toolchain is available than what the target was compiled against, then the newer libc++ formatters should still be able to format the old layouts. I.e.,libc++ formatter scripts should be backwards compatible. This is already true for the builtin LLDB formatters and will be enforced when porting them to Python (more on testing in the Testing section below).

Other STLs

We also have MSVC STL and libstdc++ formatters in LLDB. Some are written in Python (many of the libstdc++ formatters) and some are written in C++ (all of the MSVC formatters).

This RFC only concerns itself with libc++, so we keep them untouched.

Rollout

To iron out any issues in the rewritten formatters, we could consider a period of time (a single LLVM release?) where we have both the C++ and Python formatters. These could be switched using an LLDB setting (the default being the Python formatters).

Testing

The tests will live in libc++ so they run on all the libc++ builders. Backwards compatibility of the formatters could be tested by copy-pasting the headers with the old layouts into the test-suite (this is what we already do in the LLDB test-suite for a handful of types) .

We could also make all the LLDB formatter tests run against the built-in C++ formatters *and* the new auto-loaded Python formatters.

2 Likes

Some systems ship LLDB without Python support. FreeBSD the system installed lldb I think only has LUA enabled. Libc++ is FreeBSD’s default C++ standard library.

@emaste what would be the impact on your users?

How far do you think this should / can go back?

Clearly we have to set some limit, and another advantage of the scripts being separated from lldb is that I could go and get the formatters for that release of libcxx and load them instead.

Right now if I have this problem, I have to get a different copy of lldb itself.

This looks to be done by hand, is that right? In cases where the author knows that they have changed the layout.

Some systems ship LLDB without Python support. FreeBSD the system installed lldb I think only has LUA enabled. Libc++ is FreeBSD’s default C++ standard library.

That’s useful context, thanks! Yea would be good to reach some consensus with the FreeBSD folks

How far do you think this should / can go back?

The current plan is to support them as far back as the current formatters do (that’ll be just a by-product of the mostly 1-to-1 rewrite).

We haven’t had a concrete policy of how long we maintain backwards compatibility. Usually we’d maintain it until it became a burden for implementing support for current layouts. We’ve definitely dropped support for 8-10 year old layouts in the past.

The benefit of the libc++ community mostly writing/maintaining these is that they would be more comfortable with reasoning about the layout evolution and how to maintain backwards compatibility (and maintaining tests for those).

Clearly we have to set some limit, and another advantage of the scripts being separated from lldb is that I could go and get the formatters for that release of libcxx and load them instead.
Right now if I have this problem, I have to get a different copy of lldb itself.

Definitely one of the motivations of this RFC

This looks to be done by hand, is that right? In cases where the author knows that they have changed the layout.

Yup that’s currently done by hand whenever a layout change occurs. They’re pretty gnarly to maintain for us so it’s been done on a best-effort basis. During some offline conversation, @ldionne was receptive of mimicking such tests by just copy-pasting the old headers into the test-suite alongside the backward compatibility test.

Checking whether a script is allow-listed using the safe-paths makes sense to me.

But how would LLDB know which exact scripts it should autoload (and check against safe-paths) in the first place? Would we have a .debug_lldb_scripts section (similar to GDB’s debug_gdb_scripts)?

If the only way to configure the paths would be via -DLLDB_AUTO_LOAD_PATHS_SCRIPTS, would we even need the safe-paths, given that this macro would be set by the toolchain vendor, i.e. a trusted entity?

Will this setting also support relative file paths or glob patterns?

In our environment, the toolchain (including the clang binary, the libc++.so etc.) is downloaded via Bazel, as part of the build process. The toolchain is downloaded into the folder /home/<user_name>/bazel-cache/bazel_user_root/<hash-of-checkout-directory>/external/clang_linux. As such, it will be different for every user, and depending on the checkout directory where our main project gets checked out. (I guess that many users of Bazel might have a similar setup?)

From the top of my head, the first solution coming to my mind would be to allow glob expressions in the safe-path list. Then I could set it to /home/*/bazel-cache/bazel_user_root/*/external/clang_*.

FreeBSD the system installed lldb I think only has LUA enabled.

That’s correct, we include a vendored copy of lldb in the base system and we have only Lua enabled because Lua is in the base system and Python is not. However, I think it’s likely that python will already be installed in practically all cases where someone’s looking to debug C++ code. Additionally, we have (many versions of) lldb available in the packages collection with python enabled, and it’s quite common for users to have at least one of them installed.

2 Likes

I know that LLDB can dynamically link against Python, but I assume it would still be a fatal error if no Python is found in that configuration. That’s likely a solvable problem, but it would need extra engineering work we would need to plan for.

My long-term preference would be to write as many of the libc++ data formatters as possible in a subset of Python we can translate into LLDB dataformatter bytecode. That is also engineering work that still needs to happen; we only have a tiny proof-of-concept implementation of this at the moment.

A transition strategy could look like this: We start implementing and shipping Python formatters in libc++, but don’t remove the existing C++ formatters. As our Python → formatter bytecode compiler becomes better over time, we retire more of the C++ formatters.

1 Like

I think this proposal makes a lot of sense from the libc++ side of things. Libc++ owning its own formatters would reduce one of the most prominent points of friction with another LLVM project at the moment. Indeed, friction is currently caused both by changes to our internal data layouts which requires changes to the formatters but also by LLDB needing to build and test against a tip-of-trunk libc++, which is tricky to do and causes various issues.

Once libc++ owns the formatters, the cost of testing those in all supported libc++ configurations is going to be extremely small, and we’ll get more coverage than we currently do.

Overall, big +1 from me.

2 Likes

A lot of this proposal is about how to register/auto-load non-builtin formatters. But that’s a pre-existing problem, right - has lldb not had a solution to this already for third party libraries?

(& I guess as an example - how should/are LLVM’s ADT lldb formatters be distributed?)

But if it is a pre-existing problem - great to get that solved and have less things “getting by” by being built-in. Ensures standard library support drives general features in lldb to ensure other libraries can integrate just as easily.

1 Like

We don’t support this in the FreeBSD base system. I think this is mainly support did not exist when we initially brought in LLDB. For the base system our build infrastructure is bespoke, and it will be somewhat awkward to build (even run-time loadable) Python support without having Python available in the build environment. I’d like to solve this, but we are also planning to transition to using a packaged toolchain (and upstream build infrastructure) and it’s likely that will happen before we’d get around to supporting run-time-loadable Python.

A transition strategy could look like this: We start implementing and shipping Python formatters in libc++, but don’t remove the existing C++ formatters. As our Python → formatter bytecode compiler becomes better over time, we retire more of the C++ formatters.

If feasible, this sounds like a great path for us in FreeBSD.

1 Like

But how would LLDB know which exact scripts it should autoload (and check against safe-paths) in the first place? Would we have a .debug_lldb_scripts section (similar to GDB’s debug_gdb_scripts)?

This could either be set to a specific script, or we could decide to load all scripts within that path. GDB’s heuristic is that the formatter script be called the same name (and a -gdb.py suffix) that the corresponding object file is called. So one could locate these at module load time. That wouldn’t work for binaries statically linking libc++ though.

I wasn’t aware of .debug_gdb_scripts, thanks for pointing this out. We’ve talked about shipping the formatters embedded inside the binary before (since that makes the distribution issue, particularly the security question, simpler). I know @adrian.prantl’s preference is doing this via the formatter bytecode. But talking to @jingham, injecting Python into some formatters section may be a good alternative to the auto-load paths (I’ll elaborate in a follow-up comment).

If the only way to configure the paths would be via -DLLDB_AUTO_LOAD_PATHS_SCRIPTS, would we even need the safe-paths, given that this macro would be set by the toolchain vendor, i.e. a trusted entity?

My understanding is that safe-paths is just a name for a list of paths. There isn’t really anything “safe” about them apart from someone telling GDB that those are safe paths to load from. So the proposed setting is what GDB calls safe-paths. I just didn’t re-use that terminology

A lot of this proposal is about how to register/auto-load non-builtin formatters. But that’s a pre-existing problem, right - has lldb not had a solution to this already for third party libraries?
(& I guess as an example - how should/are LLVM’s ADT lldb formatters be distributed?)

Apart from auto-loading scripts from dSYMs, I’m not aware of automatic registration. One would have to add an explicit command script import to the .lldbinit to load those LLVM formatters for example.

1 Like

I think the remaining questions boil down to method of discovery (how LLDB finds the scripts) and security. Below I tried to summarize the options discussed so far and their shortcomings/remaining questions.

Discovery

  • Loading scripts from a list of user-provided directories (this RFC)
    • Cons:
      • burden on build-system/user to specify the right path. Likely needs special markers for relative paths/SDK roots/etc.
      • What scripts does LLDB load from a directory? All scripts or do we introduce naming convention for scripts? Or both?
  • Embedding Python scripts in special section inside binary (equivalent to formatter bytecode section or .debug_gdb_scripts)
    • Cons:
      • Can no longer hot-fix a broken formatter without rebuilding binary
        • If formatter bytecode is the future, we won’t be able to do that anyway. But it is nice to have.
      • Need dedicated build step (via objcopy e.g.) to place scripts into binary, making this platform dependent
      • Increases in binary size (in the absence of tools like dsymutil)
      • We discussed loading Python formatters injected into binaries (specifically A bytecode for (LLDB) data formatters - #11 by adrian.prantl ). There was pushback on this as a general solution because (1) requires sandboxing (2) need to handle the binary size concerns
  • Scripts inside dSYM
    • Cons:
      • only supported on macOS

Security

All proposed solutions boil down to executing Python code. Without sandboxing Python (or using a more restricted language like the formatter bytecode), the best we can do is to decide on what sources we trust the Python to be loaded from.

  • Loading scripts from a list of user-provided directories
    • Via vendor-provided paths: trusted
    • Via user setting: untrusted
  • Embedding scripts inside binary
    • System dylibs: trusted
    • User binaries: untrusted
  • Scripts inside dSYM
    • In SDK: trusted (can additionally be signed)
    • Arbitrary dSYMs: LLDB already supports this but requires user to set an LLDB setting

Ultimately it seems to me that on non-Apple platforms (though I’m not very familiar with Windows), if we are to load arbitrary Python scripts with some form of protection, we’ll need to narrow the scope of what scripts we load (possibly even to just the libc++ formatters), relying on some special naming conventions/system paths. And if users want to load from a non-system path they’d have to add the import statement to their .lldibinit. This is what GDB does (by default the safe-paths would be system paths, and are trusted).

TL;DR
I think loading scripts from paths configured by vendors (and requiring users to manually import for more bespoke setups) seems like a reasonable middle-ground. While injecting scripts into binaries would make discovery simple, it does complicate the build-step. On Apple platforms we have an alternative which is to put the formatters inside a dSYM alongside libc++ in the SDK (that dSYM would just contain the formatters, no debug-info).

1 Like

& you libc++ doesn’t ship a dsym that could have such an auto-loading script?

Ah, read your last message which summarized a lot of the options - thanks!

I’d be inclined to do something similar to your macos distribution model - embedded wherever the debug info, if any, would go. So a .dsym on macos, and the .debug binary on Linux for instance (using a mechanism like .debug_gdb_scripts)

  • Need dedicated build step (via objcopy e.g.) to place scripts into binary, making this platform dependent

Not necessarily - dotdebug_gdb_scripts section (Debugging with GDB) has an example of how to embed the registration in the binary - you can embed the whole script there, or embed a path to the script - could maybe make that a build option? Though that’d complicate things.

But I’m not totally averse to “do it the way linux distros do it already” where the scripts are on-disk in some allow-listed/blessed directory and loaded from there by default - but yeah, probably more a question for distro maintainers than me.

2 Likes

Not necessarily - dotdebug_gdb_scripts section (Debugging with GDB) has an example of how to embed the registration in the binary - you can embed the whole script there, or embed a path to the script - could maybe make that a build option? Though that’d complicate things.

Would the increases in the individual .o debug sections sizes matter? Might be misremembering, but I thought that was something Google was interested in not regressing too much. So if we only wanted to get the formatters into the final executable, we might have to use an extra build step. I guess one could ensure that the formatters section just ends up in the final dylib by avoiding putting those defines in the headers (maybe via a dedicated CU just for the formatters). GDB does acknowledge that it expects the linker to deduplicate these sections if placed in headers. So maybe this is all more of a question/problem for formatter authors, not us that consume the section.

FWIW, we’d still have to introduce some sort of allow-list to ensure that we don’t allow loading the formatters section from any executable, just ones that we deem trusted (that’s what GDB does too).

Yeah, it would increase .o size - and that’s probably why people only put the file name in the embed rather than the whole pretty printer. So then you still need a place to go find them/allow-list that allows reading them, etc. Not sure how much it’d impact file size - there is a threshold under which we probably wouldn’t mind, but yeah, if it has to be in every header that defines the type to be printed, and every inclusion of that header pulls in the whole embedded pretty printer, that’s probably not super great for linker input size.

Putting it in a separate .o file might be possible - but then you have trouble with linking (you want to link it in only if the library is used, but if it’s in a separate .o file it may not have any symbols the linker needs to pull in, etc - and maybe you force it to be linked (essentially pass these to the linker as loose .o files, rather than as-needed .a files) but then it goes into every binary with a static dependency on the library regardless of need/use)

Pity about the tradeoffs :confused:

1 Like

It looks like there are several orthogonal axes we need to decide on:

  1. If we decide to ship formatter contents in binaries, do we put it in the headers (which means duplicating them in the users’ .o files and deduplicating them in the linker) or in the libc++ dynamic library
  2. What do we embed: formatter bytecode, Python, paths to Python scripts
  3. Does the formatter contents go into SDK, debug info, or the binary

As far as question (1) is concerned, I’m not excited by adding duplicated contents to every object file, even if the linker can deduplicate it. In the best case LLDB would still find a copy of the formatters in each loaded C++ image, which means we’d need to either deduplicate them in LLDB or make formatters a per-lldb::Module property. The only advantage of this approach is that it would allow us to drop backwards compatibility from the formatters. In many ways it is easier to put the formatter contents in the libc++ dynamic library; we control its build system and my assumption is that the vast majority of programs dynamically link libc++. We can document how to load the data formatters for those linking statically and/or even implement a targeted warning in LLDB if we detect this.

My preferred outcome for (2) is formatter bytecode, because it doesn’t have any security concerns and will work on FreeBSD and other platforms that don’t have Python support in LLDB. As I said above, the bytecode and it’s compiler isn’t quite ready yet, so in the very near future we should go with Python code, with the goal of gradually replacing it with bytecode once our compiler is ready for it.

The key insight for (3) is that it may be helpful to look holistically at one platform at a time. The distribution mechanism and the discovery are highly platform-specific.

As a next step we could write up a strawman proposal (basically a feature matrix) for what the best options for the major platforms are so we can see how compatible they are from an implementation standpoint?

1 Like

Based on previous discussions I think the current consensus is to ship the formatters alongside debug-info (be it in the SDK or injected into object files). The approaches we’d be going for per-platform are as follows:

  • Linux: inject Python script (or path) into libc++.so
    • Assuming statically linking is the less common use-case. For those users we’d recommend loading the formatters by hand (via .lldbinit). If this is a common enough setup we could still try to make it work by injecting it into the final binary (or via headers).
  • macOS: ship it in a dSYM in the SDK
    • When injecting into the dylib, we would need some heuristic about what paths we trust to load from. On macOS, in the presence of the shared cache, that might not be straightforward. The dSYM approach side-steps this.
  • FreeBSD: same as linux
    • Only works if Python is available. Otherwise falls back to C++ builtin formatters.
  • Windows: Less certain about this for now (CC @charles-zablit @compnerd). But based on my reading of the libc++ build instructions it sounded like building it as a DLL is the common (recommended?) workflow. IIUC, building libc++ on Windows requires clang-cl, so we’d be in control of the build process. We could inject the Python scripts there into the DLL. As a side-note, NatVis ships in PDB. We might be able to do the same?

Once formatter bytecode is able to cover all existing formatters, we would drop the builtin C++ formatters (so FreeBSD works for users without Python). And in the cases where we injected Python, we would switch over to injecting bytcode, in which case would maybe no longer need the “is the injected code trusted” heuristics.

Yes, Windows tends to prefer DLLs, but it does make sense to support the static library mode as well. That would allow building with both /MD and /MT.

There were a few clang specifics that we depended on when I ported libc++ to Windows, but that might’ve changed now. We shouldn’t design around clang, but rather assume around an arbitrary compiler (i.e. cl).

I do really like the idea of shipping the python scripts in the PDB. If we absolutely cannot make that work for some reason, we could embed the content into the binary via rc. Perhaps this might motivate the work to truly bring llvm-rc to parity with MSVC’s tool.

However, the question on the size of the formatters becomes a concern. The resources would add to the runtime distribution rather than the developer components. Embedding the formatters in the PDB should be preferred here as we can create a vendor extension stream and embed that with no adverse effects for Microsoft’s tools and with no cost on the runtime components. The question becomes how do we embed that without emitting that information into all the object files.

1 Like

At least with LLD and the MSVC’s link.exe, we could reuse the /NATVIS option. Both linkers will take any file extension and create a named stream in the PDB (relevant code in LLD). To identify the files, we can use an extension like .lldb.py.

1 Like