There are two knobs I would like users to have:
- The ability to turn bits of it on or off, not just the whole thing. If I’m doing sensitive work, I might be OK with logging the fact that I ran LLDB, but not the name of the binary I’m debugging or the commands I’m running.
- Destination-specific configuration, e.g. when logging to a file, the option to select which file it gets sent to.
It would be nice to keep track of these with metrics and timings yes. But they would need to aggregate themselves in each data formatter as each time you stop they can add more time. I would be great to avoid a huge flurry of metrics being reported each time you run and stop somewhere as this can slow down the performance of the debugger, so any data gathering that is done needs to be very cheap to do and hopefully doesn’t produce a huge amount of metric packet traffic or produce any bottlenecks in the code.
I hope we can limit it to just a handful (usually just one) log for every interaction. For anything that can be unbounded, it should be aggregated before logging.
For interactive debugging, I wouldn’t expect latency to be a problem if we get aggregation right. But for anyone using LLDB in a script, we would have to make sure logging is rate limited. That might be up to the downstream implementation – throttling would be important if you’re sending each telemetry item as an RPC, maybe not so much if you’re just appending to a file.
So there is a lot of overlap here with the “statistics dump” data, but the variable that track this data can easily be used by your new telemetry and we can even add more timings where we need them and expose them both via “statistics dump” and via telemetry.
+1. I expect there will be overlap on the backend, I’ll take a closer look at how statistics dump
tracks things so I can reuse/extract that.
I therefore think that if you are looking to integrate telemetry into LLDB, I am strongly in favour of it being a) generic enough to work with other LLVM tools, and b) implemented in a way that we could easily attach our downstream telemetry code into it.
Excellent! I didn’t consider users outside of LLDB would be interested in this too. For simplicity I will probably develop it in the LLDB tree, but I’ll make sure it stays generic and compartmentalized so that it can be moved out of the tree when the time comes.
During development, it has become clear that a key/value approach is the most flexible one, even when not serializing to JSON
Stringly-typed development is not my first choice, but it makes a lot of sense for this library. I agree this is the style we should go with.
One question that I do have is if you are going to provide telemetry with an LLVM-level opt-in, do you need some sort of legal review of it, to ensure you have appropriate permission to gather data?
IANAL, but I think the answer is yes. Not within LLVM, but within your own company that distributes LLVM-based binaries. For a major use case of a company shipping a toolchain for its own employees to use & gather data on how their employees use it, I think it will not be a lengthy review. If Sony wants to ship this to game developers, you would probably want to go through something more formal. But again, IANAL
Regarding the kind of data gathered, we don’t gather user IDs explicitly. This is because User IDs ARE PII.
Yes, it is! Usually there is a distinction between types of PII, and corp PII is in a different bucket that end user PII, and has different technical requirements. My employer has authorization to log PII related to me specifically (my username, when I used a tool, how I used it, etc.), but if I use that tool to crack open a coredump from a server handling PII and some of that PII gets logged through my use of the tool, my employer might not have authorization to gather that user’s PII.
We also avoid having any free-form strings (including file names, command-line arguments etc) in the data, where that string could potentially originate from user input.
We do need to be careful with free-form strings, but it would really make the telemetry a lot less effective if we didn’t have some method of free-form string collection. We’re actually specifically interested several string-based metrics, e.g. we want the filename so we can reproduce issues with LLDB being slow on large binaries, and we want the expression/data type/selected data formatter for evaluating an expression so we can investigate correctness and performance issues with that. My feeling is that these kinds of things are relatively low risk. But sure, having the LLDB prompt sitting there, waiting for you to accidentally type your password into the wrong window, is certainly a risk. I think that’s where configurability comes in – the amount of telemetry that gets collected should be controllable by the end user if you know that you’re prone to carelessly entering in passwords. And I think any entity that does telemetry collection has an obligation to provide a purge process where you can request your records to be expunged, specifically in cases like this.
Regarding testing and build bots - I don’t think this should be peripheral tier. It’s not going to be that hard to maintain, if done well, and is going to have adoption by multiple users, I suspect
My suggestion to put in the peripheral tier was back when I wasn’t sure anyone else would be interested in this and we would have to support this on our own as a unique configuration, but it sounds like that’s not the case
Instead of anything being in the peripheral tier, it sounds like there will be a basic always-available (but off by default) implementation that just writes to a local file, and those tests can run on any build bot. The more advanced configuration (which metrics to collect & how to log it to your internal telemetry data destination) will live downstream anyway, so will be supported downstream-only.