RFC: Python callback for data formatters type matching

slackito · August 2, 2022, 7:08am

Introduction

lldb currently only supports matching data formatters with types based on the name of the type (either by plain string matching or regex). This is enough most of the time: when you’re writing a prettyprinter, you usually know the name of the type you’re dealing with. However, in some cases it’s useful to be able to inspect the type as a part of the matching logic.

The motivating example for this RFC is prettyprinting protobuf (Protocol Buffers) messages. The protobuf compiler takes a message definition like this:

message SomeData {
  optional uint32 field1 = 0;
  repeated string field2 = 1;
  ...
}

and generates a C++ struct with a member for each message field:

class SomeData : public protobuf::Message {
  // - fields from the original message
  // - presence bits to indicate if optional fields have been set
  // - API member functions: accessors, serialization...
  // - etc.
}

With gdb, we can have a single python script that can prettyprint any protobuf message. We can do this because gdb allows prettyprinters to inspect the value that is being printed before selecting a particular prettyprinter: messages can have any arbitrary name, so the way to detect if a value is a protobuf message is to look up the inheritance hierarchy looking for the protobuf::Message base class. This is currently unfeasible to do with lldb, because lldb chooses the data formatter based on the type name alone.

Proposal

We’d like to propose adding an optional Python callback that would be called during formatter matching. Once we have a string/regex match with a candidate type, if no matching function was specified we stop looking because we have a match (this is the current behavior). However, if we have a callback function, we would then run the callback and pass the type to it. The callback can then inspect the type and decide whether or not to keep the match. If it returns True, we’ll keep the match and run that formatter, if False, we’d ignore that match and continue looking in the usual order.

We have a working prototype consisting of the following modifications:

Add a matching_function_name member to SBTypeNameSpecifier to allow API users to set callbacks on formatter registration.
Add to FormattersMatchCandidate objects the TypeImpl for the type being printed, and a pointer to the current ScriptInterpreter. We need this in order to run the callback on matches, and pass the type to it.
Add a new method to ScriptInterpreter and some SWIG glue to actually run the callback.
Finally, some modifications across the formatter matching code to plumb the new data down to the point where matching is performed.

Here’s some example code that shows what a type summary using a Python matching function would look like in that prototype:

def has_member_x(t, internal_dict):
    return len([x for x in t.get_members_array() if x.GetName() == "x"]) > 0

def print_member_x(valobj, internal_dict):
    return "something with a member x=%s" % valobj.GetChildMemberWithName("x")

cat = lldb.debugger.CreateCategory("test")
t = lldb.SBTypeNameSpecifier(".*", True, "has_member_x")
cat.AddTypeSummary(t, lldb.SBTypeSummary.CreateWithFunctionName("print_member_x"))
cat.SetEnabled(True)

You can take a look at the (WIP, not ready for review yet) patch here: ⚙ Diff View

With this new feature, we could solve the example problem above by registering a formatter with a catch-all ".*" regex and an appropriate callback to see if the type we got derives from protobuf::Message.

Performance impact

This feature should have negligible performance impact when not used. However, adding a formatter that uses a catch-all regex with a matching function will cause the callback function to be run once for each new type that gets printed (the formatter will be cached after first use for each particular type).

In order to quantify the performance impact in this case, We’ve prepared a test program that defines a struct with 1000 indirect members, each of a different type (source). Then we’ve added in lldb a summary function like the example above, using a matching function. Finally, we’ve timed the execution of lldb.debugger.HandleCommand("fr var big") twice, to see any differences between the first and successive runs.

For the baseline we’ve added a summary function with a simple regex and timed that, twice.

We’ve run two separate experiments:

one where the formatters don’t match any of the struct members, to measure the overhead from calling a simple matching function repeatedly, not the formatter itself.
one where the formatter always matches, and returns a fixed string. This allows us to check the execution time when the formatters are cached.

The following tables show the average running time over 10 runs:

Non-matching formatters	First time	Second time
catch-all regex + matching function	0.193 s	0.154 s
non-matching regex	0.152s	0.113 s

Matching formatters	First time	Second time
catch-all regex + return True matching function	0.187 s	0.139 s
catch-all regex only	0.181 s	0.135 s

jingham · August 2, 2022, 7:16pm

| slackito
August 2 |

| - |

Introduction

lldb currently only supports matching data formatters with types based on the name of the type (either by plain string matching or regex). This is enough most of the time: when you’re writing a prettyprinter, you usually know the name of the type you’re dealing with. However, in some cases it’s useful to be able to inspect the type as a part of the matching logic.

The motivating example for this RFC is prettyprinting protobuf (Protocol Buffers) messages. The protobuf compiler takes a message definition like this:
message SomeData {
  optional uint32 field1 = 0;
  repeated string field2 = 1;
  ...
}
and generates a C++ struct with a member for each message field:
class SomeData : public protobuf::Message {
  // - fields from the original message
  // - presence bits to indicate if optional fields have been set
  // - API member functions: accessors, serialization...
  // - etc.
}
With gdb, we can have a single python script that can prettyprint any protobuf message. We can do this because gdb allows prettyprinters to inspect the value that is being printed before selecting a particular prettyprinter: messages can have any arbitrary name, so the way to detect if a value is a protobuf message is to look up the inheritance hierarchy looking for the protobuf::Message base class. This is currently unfeasible to do with lldb, because lldb chooses the data formatter based on the type name alone.

The part where we pass the type to some matcher function to complete the match seems okay, but the part of this change were we’re making it really attractive to write .* matchers does give me the willies. I’m a little worried about performance - 100 elements in the locals view is not a crazy number, and if there were not one but 10 of these matchers, that’s up to your 1000 matches and we’re adding .2 seconds to every stop for any UI that renders all the locals. That’s not trivial. Plus, once there are a whole bunch of .* matchers running around, figuring out what formatter applies to what type will get more complicated - from the standpoint of somebody trying to manage all these formatters that’s not a great addition!

In your case, you are introducing a pretty general mechanism but you really want something much simpler. You want to add an attribute to the formatter for the base class that says “use me for the full class.” You would have to add a scan for base classes if you didn’t get a match for the full type in the formatter matching, returning the base class formatter as the derived class summary if there was such a summary. I don’t think that would be terribly hard. If you required the summaries that override base classes to mark themselves as such, this would be the same 10 special formatters doing a simple match against the base class types of an object type, which should be pretty cheap. This also seems a fairly natural extension.

That’s this case, but maybe you have other use cases that aren’t this simple request that further motivates this addition? Adding this level of complexity to the matching - which people already struggle to figure out - worries me a bit, but if there’s no other way to do what you want, this isn’t a terrible way to do it.

Jim

slackito · August 3, 2022, 9:32pm

100 elements in the locals view is not a crazy number, and if there were not one but 10 of these matchers, that’s up to your 1000 matches and we’re adding .2 seconds to every stop for any UI that renders all the locals

I’m not sure if “adding .2 seconds to every stop” is a correct interpretation of the data I posted.

IIUC the case that is most relevant for latency is the non-matching case, because formatters that match a type will be cached after the first lookup, bypassing all successive invocations of the callback. If we look at the “Non-matching formatters” table, the “non-matching regex” row is the current state of things. It is testing lldb with the default set of formatters, plus a summary formatter that has a regex that doesn’t match anything in the program and doesn’t have a matching callback. Basically adding a plain regex matcher to the defaults so that the baseline has the same number of registered matchers to try to get a slightly less biased comparison.

The table compares it against the worst new case this proposal introduces, which is a .* regex with a callback. That is, a matcher where the callback is always called.

So we have 0.152 s increasing to 0.193 s when a matching function is introduced (first time), and 0.113 s increasing to 0.154 s (second time). That’s about 40ms added by running the callbacks for 1000 match attempts.

Does this sound right or am I overlooking something?

This is not to say it’s not a significant increase. Adding ~40 ms to something that takes ~110 ms (in the particular case we’re discussing) is still an almost 40% latency increase, and I agree this can a problem in interactive usage. But we should agree on the numbers so we can properly evaluate the trade-off.

By the way, please feel free to suggest any other performance test that you think would be important to evaluate this proposal and I’ll do my best to collect the data.

That’s this case, but maybe you have other use cases that aren’t this simple request that further motivates this addition? Adding this level of complexity to the matching - which people already struggle to figure out - worries me a bit, but if there’s no other way to do what you want, this isn’t a terrible way to do it.

We decided to go with this approach first because it seemed better to implement something flexible than a more ad-hoc solution, and something like this is closer to functionality gdb offers, so narrowing that gap looked like a good idea.

We don’t have any other specific need for the added flexibility right now. Talking to @labath a while ago, he mentioned that this could be a slightly better way to deal with a conflict between libc++ and libstdc++ prettyprinters, where to avoid claiming std::__cxx11::list (used by libstdc++), lldb currently registers the libc++ formatter with "^std::__([A-Zabd-z0-9]|cx?[A-Za-wyz0-9]|cxx1?[A-Za-z02-9]|cxx11[[:alnum:]])[[:alnum:]]*::list<.+>(( )?&)?$".

On one hand this is a pretty minor thing and probably falls into “making .* attractive will cause more .* formatters to be written” but, on the other hand, I feel like this tension suggests it can be actually a useful feature? In the sense that there is some unmet demand and users would supposedly jump at the chance to use it.

In any case, your suggestion would solve the problem with protobufs so I’ll give it a try and I’m happy to go with that one instead if it works and we all agree it’s the best outcome

jingham · August 3, 2022, 11:44pm

It seems like this feature is mostly going to be useful for .* matchers. After all, if you can reasonably write a regex that covers all the cases your formatter might want to format, then you could have the formatter do them all, the callback wouldn’t be needed. And so far as I can see there aren’t any great use cases for .* matches otherwise, so I was charging this feature with introducing .* matchers at all.

I didn’t see from you numbers how this would scale with number of .* matchers. If you add this either it won’t see any use or we’ll get a decent handful of these .* matchers, so we should know what that impact is.

I think Pavel’s use case would be better dealt with by having a human-readable way to set the std version explicitly on the formatter, then lldb would do the job of detecting the right std prefixes. As with the protobuf case, it seems like we’re fixing an explicit problem at too high a level, in this case that going from a reasonable description of the version of the STL you are targeting → a match string is gross and we should help people out there. In the protobuf case it’s having a base class formatter work for the full object not just for printing the base class.

I am a little worried about performance, but I doubt we’re much slower than gdb at this sort of thing, and it seemingly works there? It does help that you only give the callback one shot at any given type. I was thinking you would want to support things like “If I have debug information for type Y, then I can format type X” so you’d want to have the callback be live each time waiting to see if a debug information for Y shows up. That would require keeping the callback live. But it makes things more efficient if you only get one shot.

I also know that it’s already pretty confusing figuring out how the matching gets done from the outside. This adds a wildcard chooser to the picture, and I worry will overwhelm these meagre facilities.

But if you are excited about this as a general idea, and it looks from your numbers like ~10 .* formatters with callbacks and 1000 locals is still on the order of .1 sec, then go for it.

Jim

jingham · August 4, 2022, 1:18am

BTW, if after some thought it seems like this is really going to be a feature for .* formatters then you could just make it unconditional so you aren’t bothering with a pointless regex compare. Just keep these in a separate group and run them after the cached matches and exact & regex formatters. That probably won’t make the implementation much simpler, but it would remove unnecessary complexity from the UI.

Jim

slackito · August 13, 2022, 2:30am

Hi again!

I’ve been trying to figure out how I would implement the specific solution of adding an attribute to the formatter.

The most obvious way given how the code is structured would be to add a flag similar to, for example, eTypeOptionCascade. Then we would treat it similar to cascading through typedefs: modify FormatManager::GetPossibleMatches to recursively add all base classes to the vector of FormattersMatchCandidate and later use the flag to decide if a regex match on a candidate is actually a match or not. A strawman patch would look like this: ⚙ Diff View

However, this also seems potentially inefficient: it would require expanding the candidate set with all base classes for every thing that gets printed, which could be a lot if there’s a deep hierarchy.

So I came back to re-read the original response and now I think you might be suggesting something completely different:

You would have to add a scan for base classes if you didn’t get a match for the full type in the formatter matching, returning the base class formatter as the derived class summary if there was such a summary.

So, if I understand correctly, this means:

Check the cache. If it matches we’re done
Do the whole search (registered categories, applicable language categories, hardcoded formatters)
If it fails, check if we have any formatters that have this new flag
2.1 If there are any, start a whole new search looking for matches against all base classes.
Cache the result of the whole search at the end*

I can think of a couple of potential issues with this second approach:

It would add another layer to category priorities: we’d scan all categories in order looking for a formatter, then scan all of them again looking for a “matches derived classes” formatter. So a lower-priority but specific formatter would win against a higher-priority but only matching a base class formatter. I think this is not unreasonable, but it might be unintuitive for users.
We’d have to do it carefully to do the right thing in combination with the other flags (e.g. when the base class is a typedef and the formatter is set to cascade)

Am I understanding correctly? I don’t have a lot of experience with lldb code, and I also don’t have a clear mental model of what kind of code base a typical user has, so it’s hard for me to decide what is the right trade-off between the different options here.

* Aside: The formatter cache doesn’t cover all searches right now. Currently the cache is updated inside FormatManager::GetCached and then after returning from GetCached we try LanguageCategories and hardcoded formatters (link to the relevant part of the code). Is this something that should be fixed or is there a reason not to cache these results?

slackito · August 16, 2022, 11:45pm

Then we would treat it similar to cascading through typedef s: modify FormatManager::GetPossibleMatches to recursively add all base classes to the vector of FormattersMatchCandidate and later use the flag to decide if a regex match on a candidate is actually a match or not.

Never mind about this particular part of the post. The naive approach described here is even worse than I thought. It’s quadratic with the depth of the inheritance hierarchy when there’s no match: we loop through N base classes, then when the debugger tries to print the base class as a member it results in looping through N-1 base classes, and so on.

labath · August 17, 2022, 8:40am

FWIW, the same thing will happen with the original .*+python callback version, although the callback could theoretically do some internal caching to avoid traversing the hierarchy twice. If we wanted to avoid that in the eTypeOptionCascade version, we’d need to implement a similar cache inside lldb.

slackito · August 17, 2022, 10:40pm

FWIW, the same thing will happen with the original .*+python callback version

You’re right. I just went back to our gdb protobuf prettyprinter and it actually doesn’t recurse up the hierarchy. It just looks one level up.

The rationale is that if we print any derived class as a protobuf we’d omit extra fields added to a manually-created subclass, so it intentionally matches only direct subclasses.

So one way to sidestep the performance problem would be to make the new flag eTypeOptionAlsoMatchDirectlyDerivedClasses. It looks a little too ad-hoc for my taste, though

slackito · August 31, 2022, 12:08am

Trying to move this forwards once again, given the performance problems I’ve encountered with quadratic lookups in my attempt of adding a flag to make a formatter match derived classes, I’m more inclined to go with the Python callback: the pushback wasn’t very strong, the performance impact seems acceptable (and it should be minimal after the first lookup for each type given the result will be cached), and I think it’s worth the added flexibility.

For what it’s worth, I think the quadratic lookups with the flag solution should be fixable by adding caching to some intermediate steps, but that feels like a much more intrusive change: different parts of the matching process happen in different classes, and it’s no longer enough to cache just the type name and the end result of the lookup, we’d have to include in the cache whether any given match candidate skips typedefs/pointers/references/base classes and if we found a formatter for it in those conditions.

Assuming we go with the callback option, I’m not sure how to expose it from the API. Right now all the calls that add formatters to categories take a SBTypeNameSpecifier to specify what to match. So the easiest way (and what I did in my original patch) is to add the matching callback to SBTypeNameSpecifier as an optional member. Thus having a catch-all ".*" regex and a callback.

However, having a pointless regex is not ideal:

and I’m not sure what’s the best way to add this to the API.

If we skip the regex, it doesn’t make a lot of sense to put a lone callback in an SBTypeNameSpecifier, as we’re no longer specifying a name. So I’m thinking about introducing a new SBFormatterMatchingCallback (or similar) type, so we can introduce new overloads in SBTypeCategory:

category.AddTypeFormat(SBFormatterMatchingCallback cb, SBTypeFormat type_format)
category.AddTypeSummary(SBFormatterMatchingCallback cb, SBTypeSummary type_summary) 
category.AddTypeFilter(SBFormatterMatchingCallback cb, SBTypeFilter type_filter)
category.AddTypeSynthetic(SBFormatterMatchingCallback cb, SBTypeSynthetic type_synthetic)
# (and matching Delete* functions)

and then, for each category, have another separate map for callback-based formatters in FormatterContainerPair in addition to the existing ones for exact matchers and regex matchers (and rename it to FormatterContainerGroup`?)

Does this make sense? Any advice would be much appreciated, as I’m still not very familiar with the codebase and I would like to do it in the most idiomatic way possible, without “fighting” the existing design.

jingham · August 31, 2022, 1:27am

| slackito
August 31 |

| - |

Trying to move this forwards once again, given the performance problems I’ve encountered with quadratic lookups in my attempt of adding a flag to make a formatter match derived classes, I’m more inclined to go with the Python callback: the pushback wasn’t very strong, the performance impact seems acceptable (and it should be minimal after the first lookup for each type given the result will be cached), and I think it’s worth the added flexibility.

For what it’s worth, I think the quadratic lookups with the flag solution should be fixable by adding caching to some intermediate steps, but that feels like a much more intrusive change: different parts of the matching process happen in different classes, and it’s no longer enough to cache just the type name and the end result of the lookup, we’d have to include in the cache whether any given match candidate skips typedefs/pointers/references/base classes and if we found a formatter for it in those conditions.

Assuming we go with the callback option, I’m not sure how to expose it from the API. Right now all the calls that add formatters to categories take a SBTypeNameSpecifier to specify what to match. So the easiest way (and what I did in my original patch) is to add the matching callback to SBTypeNameSpecifier as an optional member. Thus having a catch-all ".*" regex and a callback.

However, having a pointless regex is not ideal:

jingham:

BTW, if after some thought it seems like this is really going to be a feature for .* formatters then you could just make it unconditional so you aren’t bothering with a pointless regex compare. Just keep these in a separate group and run them after the cached matches and exact & regex formatters. That probably won’t make the implementation much simpler, but it would remove unnecessary complexity from the UI.

and I’m not sure what’s the best way to add this to the API.

If we skip the regex, it doesn’t make a lot of sense to put a lone callback in an SBTypeNameSpecifier, as we’re no longer specifying a name. So I’m thinking about introducing a new SBFormatterMatchingCallback (or similar) type, so we can introduce new overloads in SBTypeCategory:
category.AddTypeFormat(SBFormatterMatchingCallback cb, SBTypeFormat type_format)
category.AddTypeSummary(SBFormatterMatchingCallback cb, SBTypeSummary type_summary) 
category.AddTypeFilter(SBFormatterMatchingCallback cb, SBTypeFilter type_filter)
category.AddTypeSynthetic(SBFormatterMatchingCallback cb, SBTypeSynthetic type_synthetic)
# (and matching Delete* functions)
and then, for each category, have another separate map for callback-based formatters in FormatterContainerPair in addition to the existing ones for exact matchers and regex matchers (and rename it to FormatterContainerGroup`?)

Does this make sense? Any advice would be much appreciated, as I’m still not very familiar with the codebase and I would like to do it in the most idiomatic way possible, without “fighting” the existing design.

To me, it depends a bit on whether you want to support C++ callback matchers at the SB layer or not. The rest of the SB API’s (both Python & C++) that deal with Type summaries/synthetic children. only let you supply python function data or a python function or class name for the actual summary providing code. They don’t take C++ function pointers. So it would be consistent to also only allow matchers in the current scripting language. In that case, at the SB API level, you really are specifying a name, but it has the meaning of “python function name” or “python class name” depending on how you want to implement the callback.

Note, it’s correct to grouse a bit that “NameSpecifier” isn’t really accurate, since these don’t specify anything, they are matchers. OTOH, that’s equally incorrect for regex matchers. So you wouldn’t be abusing the name any more than we already are.

So at the SB layer, using the SBTypeNameSpecifier makes sense, perhaps even more that with regex - there’s no sense in which a regex is a name… And it would avoid introducing another class and a bunch of similar overloads which would keep the API set more compact.

The only problem is that you have to invent an API to make these that doesn’t conflict with the extant ones. Maybe (though this is over specified) add:

SBTypeNameSpecifier::SBTypeNameSpecifier(const char *name, bool is_regex, bool is_function);

Or make an enum with {by_name, by_regex, by_function} and then use that in place of the booleans? Anyway, it seems like “type name”, “matcher regex” and “matcher function” are all just a single string, so using this class for them doesn’t seem particularly awkward.

For sure, at the lldb_private layer, you will want to separate these out and put them in different maps. But again, I’m not sure you need a separate class for that. But if you do, that’s less of an issue since at that point you are in lldb_private, and can be a little uglier for the sake of programming convenience.

Jim

Topic		Replies	Views
A bytecode for (LLDB) data formatters LLDB	62	984	November 18, 2024
Traversing member types of a type LLDB	13	454	February 12, 2024
[analyzer] gdb pretty printers for analyzer specific types Static Analyzer	13	292	August 31, 2020
[RFC][LLDB] Highlighting function names in LLDB backtraces LLDB debuginfo	5	139	March 27, 2025
Semantics of SBValue::CreateChildAtOffset LLDB	1	242	January 29, 2022

RFC: Python callback for data formatters type matching

Introduction

Proposal

Performance impact

Introduction

Related topics