RFC: LLDB Telemetry/metrics

Proposal

To add hooks that a downstream implementation can use to add telemetry. tl;dr see a strawman proposal D131917, with the “telemetry” just printing to stderr.

Motivation

Understanding how users debug with LLDB can be useful information to maintainers of LLDB distributions. Some questions LLDB developers and maintainers might be interested in:

  • What is the performance of individual commands? Are there any commands or data formatters that are particularly slow or resource inefficient?
  • What commands in LLDB are people using, and which commands are under-utilized?
  • How many people are using LLDB, and how often?

Without telemetry, this information is difficult to obtain. Solutions are ad hoc, and often rely on self-provided user reports, which may be missing substantial relevant information. Some data might be easily measurable but not representative, e.g. performance for a benchmark test suite might differ from performance when debugging real production binaries.

Privacy

Telemetry can be a controversial topic, and this functionality should be off by default. Only downstream users who have the authority/permission to collect telemetry data should enable it. For instance, a company that ships LLDB in an internal toolchain could send usage to an internal server for usage aggregation, or a distro could send logs to somewhere in /var/log so they can ask users to attach info when filing bug reports.

Explicitly, there is no central server maintained by LLDB developers or the LLVM project.

Why upstream?

Since this feature is focused on downstream users, there is the case that this could live as a downstream-only patch. However, a few reasons we’d like to develop this as an extension point upstream:

  • While we may be the first, there may be others interested in having metrics in this space. We would like to do this in a way that others might want as well.
  • Even when restricted to “local” telemetry (just writing to a file/stream), it could be useful to developers as a lightweight reproducers mechanism to share for bug reports.
  • The specifics of how telemetry is sent (e.g. if it’s sent as an RPC somewhere, what protocol format is used, etc.) can largely be confined to separate source files, but require some integration sprinkled across the LLDB codebase. For example, if a command handler requires a one line change to call some telemetry method on startup or shutdown and an upstream developer renames a source file, the telemetry call is likely to cause problems if maintained only as a downstream patch, but should be trivial to preserve in the new source location if it lives upstream.

Don’t we already have this?

Logging command

LLDB also has a general logging mechanism, as described with (lldb) help log. I believe this serves a different purpose, aimed more directly at providing LLDB developers with a free-form printf style logging system to debug general issues. It does not provide a way to do structured logging, which in turn presents challenges when deciding if it’s safe to log these statements.

Reproducers

As of Sept 2021, reproducers in LLDB are deprecated/discontinued, though the framework lives on in a different form. The goal of reproducers is full fidelity in capturing and replaying issues to reproduce LLDB bugs. This RFC provides a lightweight mechanism to do the same – often, the list of commands needed to get LLDB into a funky state is enough to clue an LLDB developer what the bug is.

Statistics command

The statistics dump command gives some high level information for an LLDB debugging session. This can be useful, but is also perhaps too high level for many use cases. As one example, it logs memory used overall, but doesn’t attribute memory increase by any particular command, so one would have to run statistics dump after every command to diagnose which command is causing LLDB to use more memory.

Enabling/configuring telemetry

By default, if you build LLDB from regular upstream sources, telemetry will not be enabled. Support might not even be built in.

There are two ways we might configure telemetry being enabled. At the global level, we could put everything behind a build option, so everything telemetry related is guarded by #ifdef LLDB_TELEMETRY. The only telemetry-related code not guarded by this would be related to messaging that telemetry is not built in (see later). At a finer grained level, this could be enabled or disabled with a command, e.g. (lldb) telemetry enable. If multiple telemetry destinations are registered, it can be redirected, e.g. (lldb) telemetry redirect stderr.

Specific telemetry destinations should have simple names to configure it as a destination in settings (e.g. "stderr" to log to stderr, "file" to log to a local file), and can have suboptions if desired (e.g. "file" would need an option to configure which file it should log to).

If telemetry is enabled, there should be an easy way to discover this. For example, lldb --help or lldb --version could print “Telemetry is NOT built in”, “Telemetry is built in, but off” or “Telemetry is enabled, sending data to <dest>” as appropriate. From within the command line interface, (lldb) telemetry show should show the current settings, or nothing if it is not built in.

We could consider having categories within telemetry so that users can have only certain parts disabled without needing to turn it off entirely. For example, users might be OK with the telemetry that logs basic usage, but not individual commands. Or we could go even finer and say that it’s OK to log which top level commands are run, but not the full args – so (lldb) expr foo.stuff() would log just expr, and not foo.stuff(). The way this would be configured is TBD but might look more generic, like the settings command.

Logged metrics

Startup info

This gives a very high level usage metric, such as answering “how many people use LLDB, and how actively?”

  • Username and/or hostname
  • Version info (e.g. git sha, or other packaging stamps)
  • Command line options (i.e. argv)
  • Origin of launching LLDB, e.g. command line vs lldb-vscode
  • Startup time/memory used during startup

Command info

This gives a finer grained look into how various commands are used. This might be the most interesting to many people.

  • The command, as the user typed it
  • The parsed/canonical command, e.g. “v foo” is really “frame variable foo”
  • The result of the command (whether or not it succeeded)
    • If possible, we might want even more detail, e.g. whether “v foo” failed because “foo” is not a known variable versus “foo” was optimized out.
  • Performance stats, namely time (wall & cpu) and memory

Privacy concerns for command logging

Generally speaking, logging is a privacy sensitive area. LLDB developers may need fine grained information to troubleshoot issues, but personally identifiable information (PII) should be avoided wherever possible. While a downstream vendor of LLDB may be comfortable logging data about the user of LLDB, they may not be comfortable extending that logging to information in the debugee. Specifically, if we were to log the output of (lldb) expr request.user_data, that would be including PII from an external user. For this reason, we want to take care to only record inputs to LLDB. This is most relevant with processing command logging, but applies generally.

Data formatting/expression evaluation

When a user prints a value, collect:

  • Whether this is through frame var, target variable, expr, or evaluated some other way
  • The expression printed
  • The type of the expression
  • Which formatter we selected to print it
  • Performance stats for this formatter (time/memory)

General performance

We should have a way to measure performance of important LLDB bottlenecks, such as time spent loading symbols from debug info or parsing the binary/coredump.

Shutdown info

Like startup info, this also gives high level usage metrics:

  • How long the session lasted
  • How many commands the user ran
  • If the session ended gracefully, the return code.
  • If the session ended due to a crash, a stack trace

Freeform logging

LLDB is highly extensible through the Python API, and authors of LLDB scripts may be interested in this telemetry mechanism as much as internal LLDB developers are. While it could be useful to have a general logging mechanism that anyone who runs import lldb can use to gather their own telemetry, as noted in the command logging privacy section, it’s important to weigh this against the risk of unintentionally recording PII. We should consider rejecting this type of logging entirely, or if allowing it, doing so in a way that it’s clear to the downstream implementer that it’s dangerous. For starters, we could have “unsafe” in the method names.

Testing/Support

Telemetry should not be enabled by default, and therefore should likely fall under LLVM’s Peripheral Tier of support. LLDB developers are not obligated to keep telemetry-enabled buildbots green, but would be welcome to do so.

Telemetry can be configured at the build level and at the LLDB settings level. To support testing, there should be at least one build bot with this feature enabled, but configured to send logging to a local file that the test can inspect. This means there will be at least one in-tree implementation of telemetry for testing.

Other build bots are welcome to enable this feature, although we also don’t want to be in a scenario where every build bot enables it and leaves the telemetry-disabled codepaths untested.

Consistency

The demo provided is very command-line LLDB oriented. However, we are interested in all LLDB statistics. In particular, this means:

  • Telemetry for all entrypoints, whether via the command line, lldb-vscode, or some other integration.
  • Uniform logging between command line and SB API methods

Internal format

In the demo provided, we use a very basic struct with a flat layout. Downstream implementations will have different requirements for what needs to be logged, and how it’s logged. There are two important choices a downstream vendor will want to make:

  1. The format of the data structure – a flat struct, json, proto, xml, etc. For example, the downstream vendor might want to send telemetry data via RPC, and the telemetry framework should be able to directly construct the request in whatever format the RPC wants.
  2. The schema – which fields need to get collected/logged for each flavor of telemetry. For example, one downstream vendor might want to collect the kernel version on startup, while others might not care.

Neither of these choices should live upstream, so the final product should have these two things decoupled from where logging is integrated with LLDB.

Feedback requested

This entire RFC is grounds for comments, but to get started, here are some general questions we would like to solicit feedback on:

  • Does upstream agree to a two-level enabling mechanism (build guard and LLDB settings), with the default being off? Or is it good enough to build it in, but have it disabled by default in settings?
  • Should we create a new telemetry command type or just re-use settings? (e.g. (lldb) settings set telemetry.enabled true vs (lldb) telemetry enable). Creating a new command pollutes the command space, but allows for more curated configurability.
  • Are there any other people interested in this feature being turned on? Do you think distributions might be interested in having it log to a local file, so they can ask users to upload it when submitting bug reports?
  • Are there other metrics that might be interesting to record?
  • Are there other existing mechanisms this overlaps with?
3 Likes

Thanks for putting this together Jordan. I concur with everything you said in the motivation section. A few releases ago, we added telemetry downstream for internal users. I’m excited to move (some of this) upstream so we can share metrics.

On the topic of privacy, I would argue we need finer grained controls than just turning telemetry on or off, at least from the developers/collectors point of view. Concretely, I’d like our framework to be able to distinguish between levels of sensitivity. For example, the number of commands executed exposes no personal information, while the language of an expression is more input sensitive, but still pretty harmless, compared to say the expression itself, which could contain a snippet of of source code. I could see how someone would want to collect the first type of data unconditionally, but limit the latter to internal users.

I’m supportive of doing this work upstream. As mentioned above, I think there’s value in having a common framework and sharing metrics.

Regarding the statistics command. When I looked into telemetry I originally wanted to integrate the new functionality with the existing statistics and came to roughly the same conclusions as you did here. That said, I don’t this they’re irreconcilable and I think we should at least try to unify the two, at least where it makes sense.

I don’t have a strong opinion on whether telemetry support should be possible to be disabled at compile time. I would assume that if it’s compiled out, the telemetry command shouldn’t even exist which seems to match what you are describing. For the runtime side, I like the idea of having finer grained controls because the match very closely to the sensitivity levels I described above, but I saw it more as something controlled by the collector (e.g. more info for everyone using LLDB within Google, less info for the lldb shipped with Android for example).

Regarding the metrics, at a high level, I’d like to have an easy way to opt-in or opt-out from certain metrics. For example, I personally wouldn’t want to collect the username/hostname, but I can see how someone would. I imagine something like:

RegisterMetric(Version);
RegisterMetric(CommandLine);
RegisterMetric(ProcName);

The key thing is that we should be able to have a custom configuration downstream (without merge conflict) and can cherry pick whatever metrics we want.

Personally I see no reason to make telemetry available through the SB API let alone “freeform” logging. That would certainly be a dealbreaker for us.

On the topic of testing, I’m not sure why this would fall under the “Peripheral Tier”. If we provide an upstream “backend” (like logging to file), I don’t see why someone would not have to care about this. I don’t imagine this being a huge maintenance burden.

For the format, I agree with with that being a downstream decision. The way I imagine this working though is rather than a flat struct, which already combines multiple metrics, to have something more key-value based. To continue on top of the example I gave above, I might be interested in only a subset of the startup metrics and only want to record the Version, CommandLine and ProcName.

If my “scheme” bundles those 3, I see that as the responsibility of my telemetry “handler”:

class BundlingTelemetryHandler {
  void EmitVersion(VersionTelemetry t) { 
    m_startup_telemery.version = t;
    EmitStartupTelemetrytIfComplete();
  }  
  void Emit ProcName(ProcNameTelemetry t) {
    m_startup_telemetry.procname = t;
    EmitStartupTelemetrytIfComplete();
  }
  void EmitStartupTelemetrytIfComplete() {
    if(m_startup_telemery.version && m_startup_telemetry.procname)
      commit(m_startup_telemetry);
  }
  ...
};

While another handler might emit them as separate metrics:

class IndividualTelemetryHandler {
  void EmitVersion(VersionTelemetry t) { 
    emit(t);
  }  
  void Emit ProcName(ProcNameTelemetry t) {
    emit(t);
  }
  ...
};

Feedback

  • I think making it a runtime option is sufficient, but I don’t mind adding the ability to disable the feature at compile time.
  • My vote goes to a custom telemetry command because it allows us to extend it in the future. I’d like to be able to list the different metrics we’re collecting (similar to a help string)
  • We are intersted in this feature and are willing to upstream our existing metrics to work with the new framework.

Excellent!

Personally I don’t have any need for it to be disabled via a compile time flag, but I wonder if people wanted to have piece of mind that they could build their own LLDB with a guarantee that it never logs by having the nuclear option. This is also why I suggested having a telemetry command included even when this compile time flag is off, but only so that telemetry show could say that it’s not enabled. I might be overthinking this. If nobody else has this concern, we can have it always built in but just default to not enabled.

I suppose it could be configured both ways, i.e. the default value is set at build time by the distributor, but configurable at runtime if the user wants to turn it off.

Totally fine with free form logging being off the table, and if we ever did want that downstream (I don’t think we do), it seems pretty easy to isolate. I guess I was just testing the waters to see if others were interested in that.
For telemetry via the SB API, it would be nice to have, but that is less important for us. I imagine SB API usage is often via scripts that are checked in and we have access to see, whereas usage via command line or lldb-vscode is entirely user-driven, and we essentially have no visibility into how are internal users are using LLDB via those entrypoints, which is why it’s more important there.

I initially disliked this idea because it sacrifices type safety, e.g. timestamps would need to be round tripped through a serialized format, but I like that it avoids over-complicating the design.


Considering your example of selecting which metrics you’d like to register and separately how your handler wants to bundle things together when it emits it, how would you handle downstream-only metrics that might exist? As a trivial example, our tools here are usually built with a process that stamps the binary with various stats about the build process (build invocation uuid, timestamp, etc.), and users can take these strings and look them up to see details of how the binary was built. So in addition to the regular git version string, we have many other build info pieces, which sort of fit into the “version” bucket. Would we write this?

RegisterMetric(Version); // Exists upstream
RegisterMetric(GoogleBuildInfo); // Downstream-only

and then patch all the telemetry handlers upstream (plus the one we have downstream) to have a void EmitGoogleVersion(...) method? Or would you expect that the set of RegisterMetric() calls is closed, and we should patch the RegisterMetric(Version) handler to also include these extra stats?

I would imagine that the emitter knows how to emit the metrics it knows about and simply discards the one it doesn’t. If we’re okay with forgoing type-safety (and use the LLVM-style RTTI instead), the Emit function would be implemented as a switch on the metric type and a default case that does nothing (and maybe asserts, to make sure you don’t have a metric enabled you don’t know how to emit). I’d expect that to be different from RegisterMetric though, which would control whether that metric is ever generated.

So to build on top of your example, let’s say we have 3 metrics: Version, Hostname and GoogleBuildInfo. Upstream LLDB knows about Version and Hostname. Only your downstream implementation knows about GoogleBuildInfo. You want to emit Version and GoogleBuildInfo, but for the sake of argument not Hostname.

So you have somewhere:

RegisterMetric(Version); // Exists upstream
RegisterMetric(GoogleBuildInfo); // Downstream-only

This means that LLDB will generate the Version and GoogleBuildInfo metric which will be passed to the handler. The handler then will do something like this:

void GoogleEmitterEmit(Metric m) {
  switch(m.GetType()) {
  case Metric::Version:
    EmitVersion(m);
  case Metric::GoogleBuildInfo
    Emit GoogleBuildInfo(m);
  } 
}

Maybe we can even template this where your downstream implementation has to implement a specialization. Alternatively, if we want type safety, we could use tablegen to define the metrics and have it generate all the code for us.

So is the attack/problem here that someone writes a script that enables extra telemetry that includes PII, without the user of lldb knowing? If so I agree that simply not allowing scripts to do that is a good way to go.

Are you also volunteering to run said bots with telemetry?

I agree that we should avoid a situation where all bots have it enabled, or all bots have it disabled. For Linaro’s bots I’d leave it disabled initially.

I agree with having both. The build time option means there’s no chance for a startup script to enable it without me knowing.

I also think that having a visible sign that it has been disabled at build time is good. Either the command failing or the version string. Anything you could write a simple check for if you wanted to be 100% that your build was correct.

I am thinking here about what if I shipped my downstream toolchain to a customer of mine who had strict confidentiality requirements. I have seen this be an issue with other tools in the past.

I think it’s also useful to have a way to disable it at runtime just because maybe my logging server isn’t reachable and I’m sick of timeouts. Maybe it requires me to be on the company VPN, that sort of thing
(and in rare cases maybe doing the telemetry is crashing so you always need some escape hatch).

Do you have an idea what else this command might end up doing?

For a simple on/off obviously a setting is ideal but if you want to be adding sub commands (or have space for downstreams to do so at least) then a new command makes sense (and the name doesn’t clash with anything to do with the act of debugging).

Does this interact at all with the structured logging that was proposed a while back? (or maybe even was implemented, I forget)

I guess no because as you said logging is really for developers of lldb. So the only similarity is the structured part and for this effort you’ll want something more capable than what the relatively simple logging needs.

I am wondering if it would make more sense to disable (at build time) individual telemetry backends instead of the overall infrastructure. After all, if all you have is a simple backend that writes the data to a file, then this is no different that our existing logging infra, and we don’t give you a way to disable that. And that would mean we can run some basic tests unconditionally.

That never got past the proposal stage, but I don’t think it would matter much even if it did. That was about adding more structure to the existing log messages, but it was relatively basic structure, like simply being able to delimit individual log messages. I’d assume that the information gathered here would be a lot more structured.

We have been logging the ‘statistics dump’ output to our servers at the end of each LLDB run and it is quite informative.

What is the performance of individual commands?

Any LLDB command can take quite a long time if it is the first to trigger an expensive path in the core of the debugger. Like the first “b main.cpp:12” will trigger all line table prologues to be parsed and any lines tables with a “main.cpp” to be fully parsed. So this can make the first breakpoint command that uses file and line to look more expensive than the rest. Any subsequent breakpoint command that sets things by file and line would be much cheaper as a result. So this means that even though you can try and get timings from individual LLDB commands, they might not always make sense or tell the truth about what happened internally in LLDB. The first expression to do a global name lookup might be very expensive as well, or any command that causes a global name lookup to happen, as the debug info might need to get indexed (if symbol preloading is off), and that might greatly contribute to the command’s timing. LLDB also parses debug info lazily so any command that causes a ton of debug info to have to be parsed might end up having a long run time, where subsequent command would take advantage of this already parsed information and be really quick.

Are there any commands or data formatters that are particularly slow or resource inefficient?

It would be nice to keep track of these with metrics and timings yes. But they would need to aggregate themselves in each data formatter as each time you stop they can add more time. I would be great to avoid a huge flurry of metrics being reported each time you run and stop somewhere as this can slow down the performance of the debugger, so any data gathering that is done needs to be very cheap to do and hopefully doesn’t produce a huge amount of metric packet traffic or produce any bottlenecks in the code.

What commands in LLDB are people using, and which commands are under-utilized?

That would be interesting. Though only for command line LLDB. Most IDE users don’t use many commands.

How many people are using LLDB, and how often?

We track this currently and it is really nice to know these numbers.

Things I don’t track yet via “statistics dump” but I want to:

  • list of expressions that are evaluated and the timings and or errors. This can help us see how expensive expression evaluation is
  • enumerating when we are unable to expand a type when displaying locals and parameters that should have been available but were removed from debug info due to -flimit-debug-info. We mark these types with special metadata and know when we have one and we attempt to find the debug info for them on the fly
  • step timings

The statistics dump command gives some high level information for an LLDB debugging session. This can be useful, but is also perhaps too high level for many use cases. As one example, it logs memory used overall, but doesn’t attribute memory increase by any particular command, so one would have to run statistics dump after every command to diagnose which command is causing LLDB to use more memory.

This is true, but it also allows us to quickly and quietly aggregate timings of things in the background and then fetch them when we want. I worry that a telemetry and is constantly pushing data somewhere could affect performance to take up needed CPU time during a debug session.

Enabling/configuring telemetry

I think enabling this with a command would be best. It would also allow us possibly create a telemetry plug-in that could distribute the data in many different ways:

  • Send it to a server
  • Log the data as it comes in
  • Aggregate the data and send it out in chunks as JSON or other protocol
    Since we won’t be able to tell exactly how people are going to want to distribute this telemetry I would vote to make it pluggable and agree on a telemetry entry format and allow different plug-ins to do that they want. It would be also nice to be able to opt in to the different telemetry that is available in case some is expensive to gather or maintain.

I like the data you mentioned in the “Startup info” and “command info” and " Data formatting/expression evaluation" and would add an extra one for logging API calls.

We should have a way to measure performance of important LLDB bottlenecks, such as time spent loading symbols from debug info or parsing the binary/coredump.

We have this already in the code that helps back the “statistics dump”. Each module has a “StatsDuration m_symtab_parse_time;” and a “StatsDuration m_symtab_index_time;” variables that start at zero and get incremented each time someone grabs these variables with a scoped variable like:

void do_some_work() {
  ElapsedTime elapsed(module_sp->GetSymtabParseTime());
  ...
}

This allows quick and easy measuring of time during potentially expensive areas of the code. I would suggest tapping into this same mechanism. We also have similar stuff in the SymbolFile class with:

virtual StatsDuration::Duration SymbolFile::GetDebugInfoParseTime();
virtual StatsDuration::Duration SymbolFile::GetDebugInfoIndexTime();

So we track the symbol table parse + index times and the debug info parse + index time. The debug info parse time constantly increases or stays the same over time.

So we can add more of these StatsDuration::Duration variables in places and aggregate results into this. You can also, in your command, grab all of these metrics up front, run your command and then get deltas from after the command is run if you want telemetry on each command, but this can take too much time to gather IMHO.

So there is a lot of overlap here with the “statistics dump” data, but the variable that track this data can easily be used by your new telemetry and we can even add more timings where we need them and expose them both via “statistics dump” and via telemetry.

You’re totally right. I glossed over the fact that rupprecht is proposing that the upstream backend be something like stderr and a local file.

I agree a build time option to remove all the telemetry code isn’t needed.

So you can also ignore this comment too. If there’s always a default backend it can always be enabled and tested.

Hi,

I’m coming from an outside perspective, in the sense that I don’t use LLDB and our company doesn’t ship it either to my knowledge. However, we do ship many other parts of the LLVM toolchain. Recently, I have been working on a downstream project that we have integrated into various downstream versions of the LLVM tools that we ship, to gather telemetry data from users of our tools. I therefore think that if you are looking to integrate telemetry into LLDB, I am strongly in favour of it being a) generic enough to work with other LLVM tools, and b) implemented in a way that we could easily attach our downstream telemetry code into it.

In our case, we have the concept of a telemetry “Session” which essentially means a run of a program for most tools. Client code configures this Session, such that telemetry data is ultimately written to a file. The data itself takes the form of events, of which we have about half a dozen different kinds (startup, durations, feature usage, errors, user interactions and the possibility for application-specific bespoke events).There is also some additional metadata like Host PC information and application information. The host PC info contains metrics such as RAM, cores, OS version, and a Machine-specific GUID (so that we can uniquely identify individual machines in our database, without directly identifying our users, unless they explicitly provide us with additional data). The application info contains things like application name and version. These events are represented in source code as distinct classes, with a common base class. This allows the event handling mechanism to process them in a homogenous way. It would be ideal if any LLVM telemetry implementation could mirror ours, because it would make integration that much simpler. I’d be happy to provide more details of what goes in each of our events, if it is useful.

During development, it has become clear that a key/value approach is the most flexible one, even when not serializing to JSON. For example, I have implemented a serialization mechanism that uses a std::variant-style class for abstracting the different possible data types, and stores them alongside the keys, in an unordered_map. This can then be easily queried by unit tests. This class also conveniently interacts with implementing bespoke events, and with a wrapper layer for exposing to our C# clients. Unstructured data on the other hand would be largely unusable for us, since we couldn’t convert that into usable JSON. I therefore strongly recommend using a key/value approach for building up data.

With regards to enabling/disabling, the telemetry code is always present in the final product, but it is opt-out by default. During installation, the user has the option to enable telemetry (following reading an appropriate legal agreement). Enabling telemetry sets an environment variable, which is checked by the runtime code. Presenting the telemetry opt-in at install time ensures visibility of it. Otherwise, I suspect it would be unlikely people would opt-in. I think the specific opt-in mechanism is probably best to be determined by the downstream vendors, because each will have their own mechanism for distribution and installation, and also their own requirements as to how to opt in or out of the process. For example, as we ship on Windows only, we also provide a registry kill switch that can be configured by sysadmins to disable telemetry across a set of machines, even if the user opts-in. Perhaps the approach should be for LLVM to provide a base interface that the downstream vendors implement. One question that I do have is if you are going to provide telemetry with an LLVM-level opt-in, do you need some sort of legal review of it, to ensure you have appropriate permission to gather data? Another point is that if enabling/disabling is not provided at build time, the “disabled” code path for telemetry needs to be extremely fast, so that it doesn’t impact users (indeed “enabled” should be pretty fast too, but it’s not quite so essential as “disabled”).

Aside from the user-level opt-in/out, there probably needs to be hooks so that downstream vendors can configure what kinds of telemetry data is gathered. For example, some vendors may only support specific data kinds, and therefore there’s no point in generating the data for other kinds. I think it would be sufficient for this to be a build-time only option, or possibly something that can be determined by individual downstreams, in case they choose to expose different levels of telemetry gathering to their users.

One thing that does need considering is the volume of data produced by the tool, because the backends need to be able to handle it in a timely manner without it degrading the program performance. In practice, we’ve found we have to think carefully about aggregating data into single accumulated blocks, rather than lots of repeated events, because otherwise our servers can’t handle it fast enough.

Regarding the kind of data gathered, we don’t gather user IDs explicitly. This is because User IDs ARE PII. We do have plans to somehow have a unique user ID that is untraceable to the user without them explicitly providing it to us, so that we can track a user’s behaviour across multiple telemetry sessions, but the mechanism for that is TBD. We also avoid having any free-form strings (including file names, command-line arguments etc) in the data, where that string could potentially originate from user input. This is because it would be too easy for a user to mistype and accidentally leak sensitive information into the telemetry data (have you ever accidentally sent a password via a Slack or other IM chat window when you thought focus was elsewhere?). Feature usage is instead a set of hard-coded strings, usually taken directly from the option names that are used. In the LLDB use-case, it might be that these names are the command names, after validation has occurred (i.e. unknown commands are not reported as telemetry, or simply reported as “unknown command”).

Regarding testing and build bots - I don’t think this should be peripheral tier. It’s not going to be that hard to maintain, if done well, and is going to have adoption by multiple users, I suspect. I don’t see why this would be any different. Any build bot that is configured to have telemetry enabled would need to be building the whole of LLVM ultimately, if we are going to go down the route of providing a general-purpose solution.

That’s about all my thoughts for now.

There are two knobs I would like users to have:

  1. The ability to turn bits of it on or off, not just the whole thing. If I’m doing sensitive work, I might be OK with logging the fact that I ran LLDB, but not the name of the binary I’m debugging or the commands I’m running.
  2. Destination-specific configuration, e.g. when logging to a file, the option to select which file it gets sent to.

I hope we can limit it to just a handful (usually just one) log for every interaction. For anything that can be unbounded, it should be aggregated before logging.
For interactive debugging, I wouldn’t expect latency to be a problem if we get aggregation right. But for anyone using LLDB in a script, we would have to make sure logging is rate limited. That might be up to the downstream implementation – throttling would be important if you’re sending each telemetry item as an RPC, maybe not so much if you’re just appending to a file.

+1. I expect there will be overlap on the backend, I’ll take a closer look at how statistics dump tracks things so I can reuse/extract that.

Excellent! I didn’t consider users outside of LLDB would be interested in this too. For simplicity I will probably develop it in the LLDB tree, but I’ll make sure it stays generic and compartmentalized so that it can be moved out of the tree when the time comes.

Stringly-typed development is not my first choice, but it makes a lot of sense for this library. I agree this is the style we should go with.

IANAL, but I think the answer is yes. Not within LLVM, but within your own company that distributes LLVM-based binaries. For a major use case of a company shipping a toolchain for its own employees to use & gather data on how their employees use it, I think it will not be a lengthy review. If Sony wants to ship this to game developers, you would probably want to go through something more formal. But again, IANAL :stuck_out_tongue:

Yes, it is! Usually there is a distinction between types of PII, and corp PII is in a different bucket that end user PII, and has different technical requirements. My employer has authorization to log PII related to me specifically (my username, when I used a tool, how I used it, etc.), but if I use that tool to crack open a coredump from a server handling PII and some of that PII gets logged through my use of the tool, my employer might not have authorization to gather that user’s PII.

We do need to be careful with free-form strings, but it would really make the telemetry a lot less effective if we didn’t have some method of free-form string collection. We’re actually specifically interested several string-based metrics, e.g. we want the filename so we can reproduce issues with LLDB being slow on large binaries, and we want the expression/data type/selected data formatter for evaluating an expression so we can investigate correctness and performance issues with that. My feeling is that these kinds of things are relatively low risk. But sure, having the LLDB prompt sitting there, waiting for you to accidentally type your password into the wrong window, is certainly a risk. I think that’s where configurability comes in – the amount of telemetry that gets collected should be controllable by the end user if you know that you’re prone to carelessly entering in passwords. And I think any entity that does telemetry collection has an obligation to provide a purge process where you can request your records to be expunged, specifically in cases like this.

My suggestion to put in the peripheral tier was back when I wasn’t sure anyone else would be interested in this and we would have to support this on our own as a unique configuration, but it sounds like that’s not the case :smiling_face_with_three_hearts:
Instead of anything being in the peripheral tier, it sounds like there will be a basic always-available (but off by default) implementation that just writes to a local file, and those tests can run on any build bot. The more advanced configuration (which metrics to collect & how to log it to your internal telemetry data destination) will live downstream anyway, so will be supported downstream-only.

This is fine for it to be available, but I think it’s important to make sure that it’s easy for downstream vendors to select which knobs they wish to have on/off by default/made available to users etc. In our case, for example, we’ll never want to expose the ability for users to capture input filenames, because of the PII issues discussed elsewhere.

More generally, I think it would be useful to have some sort of configuration hook that vendors can make use of. In that hook, vendors who wish to provide this sort of knob can easily do so, whilst other vendors can use it for other things. In our case, we’d probably use it to make sure the user has opted into telemetry and configure things accordingly, or something along those lines.

Was this a typo (i.e. “strongly-typed”) or did you mean something about strings here?

Don’t worry, we’ve already got our downstream version working for and have had the relevant laweyers look at it. I’m thinking more about if there’s ever a plan for upstream LLVM to gather the data itself, although that might not really be relevant at this time.

Right - there’s clearly a difference that depends on the use case. If you’re using telemetry to gather data about what’s going on within your own company, then the sensitivity of data is clearly different to when you’re using it to gather data transmitted by a third-party.

It depends on what you mean by low-risk. In our use case, where we’re gathering data from a game developer’s desktop machine, we clearly have to be careful we don’t accidentally gather a company’s secret intellectual property, or even anything that might hint at what they’re doing in terms of e.g. writing a sequel for a game or whatever. Filenames often have this sort of information in. Indeed, you even may have to be careful even within a company, for example if different individuals/departments have different secrecy levels (e.g. Top Secret department is happy to have their feature usage information gathered, but not filenames because of issues). Either way, as you say this is where configuration controls come in, as discussed above.