Proposal
To add hooks that a downstream implementation can use to add telemetry. tl;dr see a strawman proposal D131917, with the “telemetry” just printing to stderr.
Motivation
Understanding how users debug with LLDB can be useful information to maintainers of LLDB distributions. Some questions LLDB developers and maintainers might be interested in:
- What is the performance of individual commands? Are there any commands or data formatters that are particularly slow or resource inefficient?
- What commands in LLDB are people using, and which commands are under-utilized?
- How many people are using LLDB, and how often?
Without telemetry, this information is difficult to obtain. Solutions are ad hoc, and often rely on self-provided user reports, which may be missing substantial relevant information. Some data might be easily measurable but not representative, e.g. performance for a benchmark test suite might differ from performance when debugging real production binaries.
Privacy
Telemetry can be a controversial topic, and this functionality should be off by default. Only downstream users who have the authority/permission to collect telemetry data should enable it. For instance, a company that ships LLDB in an internal toolchain could send usage to an internal server for usage aggregation, or a distro could send logs to somewhere in /var/log
so they can ask users to attach info when filing bug reports.
Explicitly, there is no central server maintained by LLDB developers or the LLVM project.
Why upstream?
Since this feature is focused on downstream users, there is the case that this could live as a downstream-only patch. However, a few reasons we’d like to develop this as an extension point upstream:
- While we may be the first, there may be others interested in having metrics in this space. We would like to do this in a way that others might want as well.
- Even when restricted to “local” telemetry (just writing to a file/stream), it could be useful to developers as a lightweight reproducers mechanism to share for bug reports.
- The specifics of how telemetry is sent (e.g. if it’s sent as an RPC somewhere, what protocol format is used, etc.) can largely be confined to separate source files, but require some integration sprinkled across the LLDB codebase. For example, if a command handler requires a one line change to call some telemetry method on startup or shutdown and an upstream developer renames a source file, the telemetry call is likely to cause problems if maintained only as a downstream patch, but should be trivial to preserve in the new source location if it lives upstream.
Don’t we already have this?
Logging command
LLDB also has a general logging mechanism, as described with (lldb) help log
. I believe this serves a different purpose, aimed more directly at providing LLDB developers with a free-form printf
style logging system to debug general issues. It does not provide a way to do structured logging, which in turn presents challenges when deciding if it’s safe to log these statements.
Reproducers
As of Sept 2021, reproducers in LLDB are deprecated/discontinued, though the framework lives on in a different form. The goal of reproducers is full fidelity in capturing and replaying issues to reproduce LLDB bugs. This RFC provides a lightweight mechanism to do the same – often, the list of commands needed to get LLDB into a funky state is enough to clue an LLDB developer what the bug is.
Statistics command
The statistics dump
command gives some high level information for an LLDB debugging session. This can be useful, but is also perhaps too high level for many use cases. As one example, it logs memory used overall, but doesn’t attribute memory increase by any particular command, so one would have to run statistics dump
after every command to diagnose which command is causing LLDB to use more memory.
Enabling/configuring telemetry
By default, if you build LLDB from regular upstream sources, telemetry will not be enabled. Support might not even be built in.
There are two ways we might configure telemetry being enabled. At the global level, we could put everything behind a build option, so everything telemetry related is guarded by #ifdef LLDB_TELEMETRY
. The only telemetry-related code not guarded by this would be related to messaging that telemetry is not built in (see later). At a finer grained level, this could be enabled or disabled with a command, e.g. (lldb) telemetry enable
. If multiple telemetry destinations are registered, it can be redirected, e.g. (lldb) telemetry redirect stderr
.
Specific telemetry destinations should have simple names to configure it as a destination in settings (e.g. "stderr"
to log to stderr, "file"
to log to a local file), and can have suboptions if desired (e.g. "file"
would need an option to configure which file it should log to).
If telemetry is enabled, there should be an easy way to discover this. For example, lldb --help
or lldb --version
could print “Telemetry is NOT built in”, “Telemetry is built in, but off” or “Telemetry is enabled, sending data to <dest>” as appropriate. From within the command line interface, (lldb) telemetry show
should show the current settings, or nothing if it is not built in.
We could consider having categories within telemetry so that users can have only certain parts disabled without needing to turn it off entirely. For example, users might be OK with the telemetry that logs basic usage, but not individual commands. Or we could go even finer and say that it’s OK to log which top level commands are run, but not the full args – so (lldb) expr foo.stuff()
would log just expr
, and not foo.stuff()
. The way this would be configured is TBD but might look more generic, like the settings command.
Logged metrics
Startup info
This gives a very high level usage metric, such as answering “how many people use LLDB, and how actively?”
- Username and/or hostname
- Version info (e.g. git sha, or other packaging stamps)
- Command line options (i.e. argv)
- Origin of launching LLDB, e.g. command line vs
lldb-vscode
- Startup time/memory used during startup
Command info
This gives a finer grained look into how various commands are used. This might be the most interesting to many people.
- The command, as the user typed it
- The parsed/canonical command, e.g. “v foo” is really “frame variable foo”
- The result of the command (whether or not it succeeded)
- If possible, we might want even more detail, e.g. whether “v foo” failed because “foo” is not a known variable versus “foo” was optimized out.
- Performance stats, namely time (wall & cpu) and memory
Privacy concerns for command logging
Generally speaking, logging is a privacy sensitive area. LLDB developers may need fine grained information to troubleshoot issues, but personally identifiable information (PII) should be avoided wherever possible. While a downstream vendor of LLDB may be comfortable logging data about the user of LLDB, they may not be comfortable extending that logging to information in the debugee. Specifically, if we were to log the output of (lldb) expr request.user_data
, that would be including PII from an external user. For this reason, we want to take care to only record inputs to LLDB. This is most relevant with processing command logging, but applies generally.
Data formatting/expression evaluation
When a user prints a value, collect:
- Whether this is through
frame var
,target variable
,expr
, or evaluated some other way - The expression printed
- The type of the expression
- Which formatter we selected to print it
- Performance stats for this formatter (time/memory)
General performance
We should have a way to measure performance of important LLDB bottlenecks, such as time spent loading symbols from debug info or parsing the binary/coredump.
Shutdown info
Like startup info, this also gives high level usage metrics:
- How long the session lasted
- How many commands the user ran
- If the session ended gracefully, the return code.
- If the session ended due to a crash, a stack trace
Freeform logging
LLDB is highly extensible through the Python API, and authors of LLDB scripts may be interested in this telemetry mechanism as much as internal LLDB developers are. While it could be useful to have a general logging mechanism that anyone who runs import lldb can use to gather their own telemetry, as noted in the command logging privacy section, it’s important to weigh this against the risk of unintentionally recording PII. We should consider rejecting this type of logging entirely, or if allowing it, doing so in a way that it’s clear to the downstream implementer that it’s dangerous. For starters, we could have “unsafe” in the method names.
Testing/Support
Telemetry should not be enabled by default, and therefore should likely fall under LLVM’s Peripheral Tier of support. LLDB developers are not obligated to keep telemetry-enabled buildbots green, but would be welcome to do so.
Telemetry can be configured at the build level and at the LLDB settings level. To support testing, there should be at least one build bot with this feature enabled, but configured to send logging to a local file that the test can inspect. This means there will be at least one in-tree implementation of telemetry for testing.
Other build bots are welcome to enable this feature, although we also don’t want to be in a scenario where every build bot enables it and leaves the telemetry-disabled codepaths untested.
Consistency
The demo provided is very command-line LLDB oriented. However, we are interested in all LLDB statistics. In particular, this means:
- Telemetry for all entrypoints, whether via the command line, lldb-vscode, or some other integration.
- Uniform logging between command line and SB API methods
Internal format
In the demo provided, we use a very basic struct with a flat layout. Downstream implementations will have different requirements for what needs to be logged, and how it’s logged. There are two important choices a downstream vendor will want to make:
- The format of the data structure – a flat struct, json, proto, xml, etc. For example, the downstream vendor might want to send telemetry data via RPC, and the telemetry framework should be able to directly construct the request in whatever format the RPC wants.
- The schema – which fields need to get collected/logged for each flavor of telemetry. For example, one downstream vendor might want to collect the kernel version on startup, while others might not care.
Neither of these choices should live upstream, so the final product should have these two things decoupled from where logging is integrated with LLDB.
Feedback requested
This entire RFC is grounds for comments, but to get started, here are some general questions we would like to solicit feedback on:
- Does upstream agree to a two-level enabling mechanism (build guard and LLDB settings), with the default being off? Or is it good enough to build it in, but have it disabled by default in settings?
- Should we create a new
telemetry
command type or just re-use settings? (e.g.(lldb) settings set telemetry.enabled true
vs(lldb) telemetry enable
). Creating a new command pollutes the command space, but allows for more curated configurability. - Are there any other people interested in this feature being turned on? Do you think distributions might be interested in having it log to a local file, so they can ask users to upload it when submitting bug reports?
- Are there other metrics that might be interesting to record?
- Are there other existing mechanisms this overlaps with?