UnicodeDecodeError for serialize SBValue description

Follow-up for the previous question:

Our python code is trying to call json.dumps to serialize the variable evaluation result into string block and send to IDE via RPC, however it failed with “UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xc9 in position 10: invalid continuation byte” because SBValue.description seems to return non-utf-8 string:

(lldb) fr v
error: biggrep_master_server_async 0x10b9a91a: DW_TAG_member ‘_M_pod_data’ refers to type 0x10bb1e99 which extends beyond the bounds of 0x10b9a901
error: biggrep_master_server_async 0x10b98edc: DW_TAG_member ‘small_’ refers to type 0x10bb1d9f which extends beyond the bounds of 0x10b98ed3
error: biggrep_master_server_async 0x10baf034: DW_TAG_member ‘__size’ refers to type 0x10baf04d which extends beyond the bounds of 0x10baefae
(facebook::biggrep::BigGrepMasterAsync *) this = 0x00007fd14d374fd0
(const string &const) corpus = error: summary string parsing error: {
store_ = {
= {
small_ = {}
ml_ = (data_ = “��UH\x89�H�}�H\x8bE�]ÐUH\x89�H��H\x89}�H\x8bE�H\x89��~\xb4��\x90��UH\x89�SH\x83�H\x89}�H�u�H�E�H���\x9e���H\x8b\x18H\x8bE�H���O\xb4��H\x89ƿ\b”, size_ = 0, capacity_ = 1441151880758558720)
}
}
}

File “/data/users/jeffreytan/fbsource/fbobjc/Tools/Nuclide/pkg/nuclide-debugger-lldb-server/scripts/chromedebugger.py”, line 91, in received_message
response_in_json = json.dumps(response);
File “/usr/lib64/python2.6/json/init.py”, line 230, in dumps
return _default_encoder.encode(obj)
File “/usr/lib64/python2.6/json/encoder.py”, line 367, in encode
chunks = list(self.iterencode(o))
File “/usr/lib64/python2.6/json/encoder.py”, line 309, in _iterencode
for chunk in self._iterencode_dict(o, markers):
File “/usr/lib64/python2.6/json/encoder.py”, line 275, in _iterencode_dict
for chunk in self._iterencode(value, markers):
File “/usr/lib64/python2.6/json/encoder.py”, line 309, in _iterencode
for chunk in self._iterencode_dict(o, markers):
File “/usr/lib64/python2.6/json/encoder.py”, line 275, in _iterencode_dict
for chunk in self._iterencode(value, markers):
File “/usr/lib64/python2.6/json/encoder.py”, line 306, in _iterencode
for chunk in self._iterencode_list(o, markers):
File “/usr/lib64/python2.6/json/encoder.py”, line 204, in _iterencode_list
for chunk in self._iterencode(value, markers):
File “/usr/lib64/python2.6/json/encoder.py”, line 309, in _iterencode
for chunk in self._iterencode_dict(o, markers):
File “/usr/lib64/python2.6/json/encoder.py”, line 275, in _iterencode_dict
for chunk in self._iterencode(value, markers):
File “/usr/lib64/python2.6/json/encoder.py”, line 309, in _iterencode
for chunk in self._iterencode_dict(o, markers):
File “/usr/lib64/python2.6/json/encoder.py”, line 275, in _iterencode_dict
for chunk in self._iterencode(value, markers):
File “/usr/lib64/python2.6/json/encoder.py”, line 294, in _iterencode
yield encoder(o)
UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xc9 in position 10: invalid continuation byte

Question:
Is the non utf-8 string expected or just gabage data because of the DW_TAG_member error? What is the proper way find out the string encoding and serialize using json.dumps()?

Jeffrey

Btw: after patching with Siva’s fix http://reviews.llvm.org/D18008, the first field ‘small_’ is fixed, however the second field ‘ml_’ still emits garbage:

(lldb) fr v corpus
(const string &const) corpus = error: summary string parsing error: {
store_ = {
= {
small_ = “www”
ml_ = (data_ = “��UH\x89�H�}�H\x8bE�]ÐUH\x89�H��H\x89}�H\x8bE�H\x89��~\xb4��\x90��UH\x89�SH\x83�H\x89}�H�u�H�E�H���\x9e���H\x8b\x18H\x8bE�H���O\xb4��H\x89ƿ\b”, size_ = 0, capacity_ = 1441151880758558720)
}
}
}

Thanks for any info regarding how to encode this string.

Jeffrey

Do you still see the DW_TAG_member related error?

A wild (and really wild at that) guess: Is it utf16 data that is being
decoded as utf8?

As David Blaikie mentioned on the other thread, it would really help
if you provide us with a minimal example to repro this. Atleast, repro
instructions.

Thanks Siva. All the DW_TAG_member related errors seems to go away after patching with your fix. The current problem is handling the decoding.

Here is the correct decoding from gdb whic might be useful:

(gdb) p corpus
$3 = (const std::string &) @0x7fd133cfb888: {
static npos = 18446744073709551615, store_ = {
static kIsLittleEndian = ,
static kIsBigEndian = , {
small_ = “www”, ‘\000’ <repeats 20 times>, “\024”, ml_ = {
data_ = 0x777777 <std::Any_data::M_access<void folly::fibers::Baton::waitFiber<folly::fibers::FirstArgOf<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::{lambda(folly::fibers::Promise<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::SelectionResult>)#1}, void>::type::value_type folly::fibers::await<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::{lambda(folly::fibers::Promise<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::SelectionResult>)#1}>(folly::fibers::FirstArgOf&&)::{lambda()#1}>(folly::fibers::FiberManager&, folly::fibers::FirstArgOf<folly::fibers::FirstArgOf<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::{lambda(folly::fibers::Promise<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::SelectionResult>)#1}, void>::type::value_type folly::fibers::await<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::{lambda(folly::fibers::Promise<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::SelectionResult>)#1}>(folly::fibers::FirstArgOf&&)::{lambda()#1}, void>::type::value_type)::{lambda(folly::fibers::Fiber&)#1}*>() const+25> “\311\303UH\211\345H\211}\370H\213E\370]ÐUH\211\345H\203\354\020H\211}\370H\213E\370H\211\307\350~\264\312\377\220\311\303UH\211\345SH\203\354\030H\211}\350H\211u\340H\213E\340H\211\307\350\236\377\377\377H\213\030H\213E\350H\211\307\350O\264\312\377H\211ƿ\b”, size = 0,
capacity
= 1441151880758558720}}}}

Utf-16 does not seem to decode it, while ‘latin-1’ does:

‘\xc9’.decode(‘utf-16’)
Traceback (most recent call last):
File “”, line 1, in
File “/mnt/gvfs/third-party2/python/55c1fd79d91c77c95932db31a4769919611c12bb/2.7.8/centos6-native/da39a3e/lib/python2.7/encodings/utf_16.py”, line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: ‘utf16’ codec can’t decode byte 0xc9 in position 0: truncated data

‘\xc9’.decode(‘latin-1’)
u’\xc9’

Instead of guessing what kind of decoding I should use, I would use ‘ensure_ascii=False’ to prevent the crash for now.

I tried to reproduce this crash, but it seems that the crash might be related with some internal stl implementation we are using. I will see if I can narrow down to a small repro later.

Thanks
Jeffrey

So you need to be prepared to escape any text that can have special characters. A "std::string" or any container can contain special characters. If you are encoding stuff into JSON, you will either need to escape any special characters, or hex encode the string into ASCII hex bytes.

In debuggers we often get bogus data because variables are not initialized, but the compiler tells us that a variable is valid in address range [0x1000-0x2000), but it actually is [0x1200-0x2000). If we read a variable in this case, a std::string might contain bogus data and the bytes might not make sense. So you always have to be prepared for bad data.

If we look at:

  store_ = {
     = {
      small_ = "www"
      ml_ = (data_ =
"��UH\x89�H�}�H\x8bE�]ÐUH\x89�H��H\x89}�H\x8bE�H\x89��~\xb4��\x90��UH\x89�SH\x83�H\x89}�H�u�H�E�H���\x9e���H\x8b\x18H\x8bE�H���O\xb4��H\x89ƿ\b",
size_ = 0, capacity_ = 1441151880758558720)
    }
  }
}

We can see the "size_" is zero, and capacity_ is 1441151880758558720 (which is 0x1400000000000000). "data_" seems to be some random pointer.

On MacOSX, we have a special formatting code that displays std::string in CPlusPlusLanguage.cpp that gets installed in the LoadLibCxxFormatters() or LoadLibStdcppFormatters() functions with code like:

    lldb::TypeSummaryImplSP std_string_summary_sp(new CXXFunctionSummaryFormat(stl_summary_flags, lldb_private::formatters::LibcxxStringSummaryProvider, "std::string summary provider"));
    cpp_category_sp->GetTypeSummariesContainer()->Add(ConstString("std::__1::string"), std_string_summary_sp);

Special flags are set on std::string to say "don't show children of this and just show a summary" So if a std::string contained "hello". So for the following code:

std::string h ("hello");

You should just see:

(lldb) fr var h
(std::__1::string) h = "hello"

If you take a look at the normal value in the raw we see:

(lldb) fr var --raw h
(std::__1::string) h = {
  __r_ = {
    std::__1::__libcpp_compressed_pair_imp<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__rep, std::__1::allocator<char>, 2> = {
      __first_ = {
         = {
          __l = {
            __cap_ = 122511465736202
            __size_ = 0
            __data_ = 0x0000000000000000
          }
          __s = {
             = {
              __size_ = '\n'
              __lx = '\n'
            }
            __data_ = {
              [0] = 'h'
              [1] = 'e'
              [2] = 'l'
              [3] = 'l'
              [4] = 'o'
              [5] = '\0'
              [6] = '\0'
              [7] = '\0'
              [8] = '\0'
              [9] = '\0'
              [10] = '\0'
              [11] = '\0'
              [12] = '\0'
              [13] = '\0'
              [14] = '\0'
              [15] = '\0'
              [16] = '\0'
              [17] = '\0'
              [18] = '\0'
              [19] = '\0'
              [20] = '\0'
              [21] = '\0'
              [22] = '\0'
            }
          }
          __r = {
            __words = {
              [0] = 122511465736202
              [1] = 0
              [2] = 0
            }
          }
        }
      }
    }
  }
}

So the main question is why are our "std::string" formatters not kicking in for you. That comes down to a typename match, or the format of the string isn't what the formatter is expecting.

But again, since you std::string can contain anything, you will need to escape any and all text that is encoded into JSON to ensure it doesn't contain anything JSON can't deal with.

This is kind of orthogonal to your problem, but the reason why you are not seeing the kind of simplified printing Greg is suggesting, is because your std::string doesn’t look like any of the kinds we recognize

Specifically, LLDB data formatters work by matching against type names, and once they recognize a typename, then they try to inspect the variable in order to grab a summary
In your example, your std::string exposes a layout that we are not handling - hence we bail out of the formatter and we fall back to the raw view

If you want pretty printing to work, you’ll need to write a data formatter

There are a few avenues. The obvious easy one is to extend the existing std::string formatter to recognize your type’s internal layout.
If one were signing up for more infrastructure work, they could decide to try and detect shared library loads and load formatters that match with whatever libraries are being loaded.

Thanks Greg for the detailed explanation, very helpful.

  1. Just to confirm, the weird string displayed is because ‘data_’ points to some random memory? So what gdb displays is also some random memory content not something that more meaningful than us? I thought we(lldb) did not display std::string content well but gdb does it correct.
  2. I guess the std::string formatter did not kick in because our company may link some special stl implementation. Let me share our binary for you to confirm.
  3. I dumped the content of the object we try to json.dumps() against, here is the content:
    response: {‘id’: 57, ‘result’: {‘result’: [{‘name’: ‘data_’, ‘value’: {‘type’: ‘object’, ‘description’: ‘(char *) “\xc9\xc3UH\x89\xe5H\x8 9}\xf8H\x8bE\xf8]\xc3\x90UH\x89\xe5H\x83\xec\x10H\x89}\xf8H\x8bE\xf8H\x89\xc7\xe8~\xb4\xca\xff\x90\xc9\xc3UH\x89\ xe5SH\x83\xec\x18H\x89}\xe8H\x89u\xe0H\x8bE\xe0H\x89\xc7\xe8\x9e\xff\xff\xffH\x8b\x18H\x8bE\xe8H\x89\xc7\xe8O\xb4\ xca\xffH\x89\xc6\xbf\b”’, ‘objectId’: ‘RemoteObjectManager.118’}}, {‘name’: ‘size_’, ‘value’: {‘type’: ‘object’, ‘descr iption’: ‘(std::size_t) 0’}}, {‘name’: ‘capacity_’, ‘value’: {‘type’: ‘object’, ‘description’: ‘(std::size_t) 14411518807 58558720’}}]}}
    So seems that the problem is json.dumps() is trying to treat the raw byte array as utf8 which failed.

So we need to figure out how to escape the raw byte array into string so that we can json.dumps() it. The key question is how do we know the correct encoding of the byte array. Is my understanding correct that only the formatter has the knowledge to decode the byte array correctly? If we fail to find a type formatter(which is this case) and get a raw field with byte array, we have no knowledge of the encoding so either we have to guess one default encoding and try it or just display the raw byte array content instead of decoding it?

Jeffrey

Thanks Greg for the detailed explanation, very helpful.
1. Just to confirm, the weird string displayed is because 'data_' points to some random memory?

Yes.

So what gdb displays is also some random memory content not something that more meaningful than us? I thought we(lldb) did not display std::string content well but gdb does it correct.

So the "size_" variable is zero, so anything that GDB is displaying is shear luck of what the contents of memory are that "data_" points to. You can't rely on any contents of "data_" since it is clearly bogus. What you really want to see is just the string that "std::string" points to:

(std::string) my_string = "Hello"

Or for a std::string that contains 0, 1, and 2 as characters:

(std::string) my_string = "\x00\x01\x02"

2. I guess the std::string formatter did not kick in because our company may link some special stl implementation. Let me share our binary for you to confirm.

You can get some help from Enrico to see why things are not displaying correctly. My guess is this C++ standard library is different from the ones that we added support for.

3. I dumped the content of the object we try to json.dumps() against, here is the content:
response: {'id': 57, 'result': {'result': [{'name': 'data_', 'value': {'type': 'object', 'description': '(char *) "\xc9\xc3UH\\x89\xe5H\x8 9}\xf8H\\x8bE\xf8]\xc3\x90UH\\x89\xe5H\x83\xec\x10H\\x89}\xf8H\\x8bE\xf8H\\x89\xc7\xe8~\\xb4\xca\xff\\x90\xc9\xc3UH\\x89\ xe5SH\\x83\xec\x18H\\x89}\xe8H\x89u\xe0H\x8bE\xe0H\x89\xc7\xe8\\x9e\xff\xff\xffH\\x8b\\x18H\\x8bE\xe8H\x89\xc7\xe8O\\xb4\ xca\xffH\\x89\xc6\xbf\\b"', 'objectId': 'RemoteObjectManager.118'}}, {'name': 'size_', 'value': {'type': 'object', 'descr iption': '(std::size_t) 0'}}, {'name': 'capacity_', 'value': {'type': 'object', 'description': '(std::size_t) 14411518807 58558720'}}]}}
So seems that the problem is json.dumps() is trying to treat the raw byte array as utf8 which failed.
So we need to figure out how to escape the raw byte array into string so that we can json.dumps() it. The key question is how do we know the correct encoding of the byte array.

It doesn't really matter. Just know that any of the strings from:

const char *SBValue::GetName();
const char *SBValue::GetTypeName ();
const char *SBValue::GetDisplayTypeName();
const char *SBValue::GetValue();
const char *SBValue::GetSummary();
const char *SBValue::GetObjectDescription();
const char *SBValue::GetLocation ();

Will need to be escaped.

Is my understanding correct that only the formatter has the knowledge to decode the byte array correctly?

We dump the values as strings. You won't get bytes out. You might get UTF8 bytes or other things that JSON might interpret as special characters and any C strings that you get from the above calls will just need to be escaped if needed.

If we fail to find a type formatter(which is this case) and get a raw field with byte array, we have no knowledge of the encoding so either we have to guess one default encoding and try it or just display the raw byte array content instead of decoding it?

Again, this is all C strings. I don't think anything else matters.

Our JSON.cpp has the following:

int
JSONParser::GetEscapedChar(bool &was_escaped)
{
    was_escaped = false;
    const char ch = GetChar();
    if (ch == '\\')
    {
        was_escaped = true;
        const char ch2 = GetChar();
        switch (ch2)
        {
            case '"':
            case '\\':
            case '/':
            default:
                break;

            case 'b': return '\b';
            case 'f': return '\f';
            case 'n': return '\n';
            case 'r': return '\r';
            case 't': return '\t';
            case 'u':
                {
                    const int hi_byte = DecodeHexU8();
                    const int lo_byte = DecodeHexU8();
                    if (hi_byte >=0 && lo_byte >= 0)
                        return hi_byte << 8 | lo_byte;
                    return -1;
                }
                break;
        }
        return ch2;
    }
    return ch;
}

You can see how it is used when the JSON parser is parsing in JSONParser::GetToken() in the '"' case.

Thanks, I will try this escape mechanism for the returned C string.

Hi Enrico,

Any suggestion/example how to add a data formatter for our own STL string? From the output below I can see we are using our own “fbstring_core” which I assume I need to write a type summary for this type:

frame variable corpus -T
(const string &const) corpus = error: summary string parsing error: {
(std::fbstring_core) store_ = {
(std::fbstring_core::(anonymous union)) = {
(char [24]) small_ = “www”
(std::fbstring_core::MediumLarge) ml_ = {
(char *) data_ = 0x0000000000777777 "H\x89U\xa8H\x89M\xa0L\x89E\x98H\x8bE\xa8H\x89��_U��D\x88e�H\x8bE\xa0H\x89��]U��H\x89�H\x8dE�H\x89�H\x89�����L\x8dm�H\x8bE\x98H\x89��IU��\x88]�L\x8be\xb0L\x89��
(std::size_t) size_ = 0
(std::size_t) capacity_ = 1441151880758558720
}
}
}
}

Thanks.
Jeffrey

Hi Enrico,

Any suggestion/example how to add a data formatter for our own STL string? From the output below I can see we are using our own “fbstring_core” which I assume I need to write a type summary for this type:

frame variable corpus -T
(const string &const) corpus = error: summary string parsing error: {
(std::fbstring_core) store_ = {
(std::fbstring_core::(anonymous union)) = {
(char [24]) small_ = “www”
(std::fbstring_core::MediumLarge) ml_ = {
(char *) data_ = 0x0000000000777777 "H\x89U\xa8H\x89M\xa0L\x89E\x98H\x8bE\xa8H\x89��_U��D\x88e�H\x8bE\xa0H\x89��]U��H\x89�H\x8dE�H\x89�H\x89�����L\x8dm�H\x8bE\x98H\x89��IU��\x88]�L\x8be\xb0L\x89��
(std::size_t) size_ = 0
(std::size_t) capacity_ = 1441151880758558720
}
}
}
}

Admittedly, this is going to be a little vague since I haven’t really seen your code and I am only working off of one sample

There’s going to be two parts to getting this to work:

Part 1 - Formatting fbstring_core

At a glance, an fbstring_core can be backed by two representations. A “small” representation (a char array), and a “medium/large" representation (a char* + a size)
I assume that the way you tell one from the other is

if (size == 0) small
else medium-large

If my assumption is not correct, you’ll need to discover what the correct discriminator logic is - the class has to know, and so do you :slight_smile:

Armed with that knowledge, look in lldb source/Plugins/Language/CPlusPlus/Formatters/LibCxx.cpp
There’s a bunch of code that deals with formatting llvm’s libc++ std::string - which follows a very similar logic to your class

ExtractLibcxxStringInfo() is the function that handles discovering which layout the string uses - where the data lives - and how much data there is

Once you have told yourself how much data there is (the size) and where it lives (array or pointer), LibcxxStringSummaryProvider() has the easy task - it sets up a StringPrinter, tells it how much data to print, where to get it from, and then delegates the StringPrinter to do the grunt work
StringPrinter is a nifty little tool - it can handle generating summaries for different kinds of strings (UTF8? UTF16? we got it - is a \0 a terminator? what quote character would you like? …) - you point it at some data, set up a few options, and it will generate a printable representation for you - if your string type is doing anything out of the ordinary, let’s talk - I am definitely open to extending StringPrinter to handle even more magic

Part 2 - Teaching std::string that it can be backed by an fbstring_core

At the end of part 1, you’ll probably end up with a FBStringCoreSummaryProvider() - now you need to teach LLDB about it
The obvious thing you could do would be to go in CPlusPlusLanguage::GetFormatters() add a LoadFBStringFormatter(g_category) to it - and then imitate - say - LoadLibCxxFormatters()

AddCXXSummary(cpp_category_sp, lldb_private::formatters::FBStringCoreSummaryProvider, “fbstringcore summary provider", ConstString(“std::fbstring_core<.+>"), stl_summary_flags, true);

That will work - but what you would see is:

(const string &const) corpus = error: summary string parsing error: {
(std::fbstring_core) store_ = “www"

You wanna do

(lldb) log enable lldb formatters
(lldb) frame variable -T corpus

It will list one or more typenames - the most specific one is the one you like (e.g. for libc++ we get std::__1::string - this is how we tell ourselves this is the std::string from libc++)
Once you find that typename, you’ll make a new formatter - FBStringSummaryProvider() - and register that formatter with that very specific typename

All that FBStringSummaryProvider() has to do is get the “store_” member (ValueObject::GetChildMemberWithName() is your friend) - and pass it down to FBStringCoreSummaryProvider()

I understand this may seem a little convoluted and arcane at first - but feel free to ask more questions, and I’ll try to help out!

Thanks.
Jeffrey

Thanks,
- Enrico
:envelope_with_arrow: egranata@.com :phone: 27683

Thanks Enrico. This is very detailed! I will take a look.
Btw: originally, I was hoping that data formatter can be added without changing the source code. Like giving a xml/json format file telling lldb the memory layout/structure of the data structure, lldb can parse the xml/json and deduce the formatting. This is approach used by data visualizer in VS debugger: https://msdn.microsoft.com/en-us/library/jj620914.aspx
This will make adding data formatter more extensible/flexible. Any reason we did not take this approach?

Jeffrey

LLDB supports adding data formatters without modifying the source code and I would strongly prefer to go that way as we don’t want each user of LLDB to start adding data formatters to their own custom types. We have a pretty detailed (but possible a bit outdated) description about how they work and how you can add a new one here: http://lldb.llvm.org/varformats.html

Enrico: Is there any reason you suggested the data formatters written inside LLDB over the python based ones?

I don't think Enrico was suggesting that we maintain a bunch of third party data formatters in the lldb source base. He was giving C++ examples (using the lldb_private API's) because the STL formatters are in C++, so that's what he had on hand to demonstrate the kinds of algorithms you would use to dig into these complex structures. For the most part the lldb_private API's used in Enrico's examples are mirrored in the SB API's pretty directly, so this isn't a terrible source for examples.

Note, it used to be possible to write C++ based data formatters, build them in a shared library and load them with the "plugin load" command. These have the advantage of working on systems that don't support Python. Not sure what the state of that is these days, however. But even if you were going to write C++ formatters you'd be better off using the SB API's not the lldb_private API's since then your plugins would have a longer useful life-cycle.

Jim

I don’t think Enrico was suggesting that we maintain a bunch of third party data formatters in the lldb source base.

That depends - if this std::string implementation is part of a publicly available STL implementation, it might make sense for us to “know about it” out of the box in the same way we know about libstdc++ and libc++
If it is an internal-only string class, then, yes, I would definitely not suggest putting this inside the LLDB core

He was giving C++ examples (using the lldb_private API’s) because the STL formatters are in C++, so that’s what he had on hand to demonstrate the kinds of algorithms you would use to dig into these complex structures. For the most part the lldb_private API’s used in Enrico’s examples are mirrored in the SB API’s pretty directly, so this isn’t a terrible source for examples.

Note, it used to be possible to write C++ based data formatters, build them in a shared library and load them with the “plugin load” command. These have the advantage of working on systems that don’t support Python. Not sure what the state of that is these days, however.

It might or might not work. If it didn’t work and somebody wanted to fix that, I suspect we would gladly accept their patches.

But even if you were going to write C++ formatters you’d be better off using the SB API’s not the lldb_private API’s since then your plugins would have a longer useful life-cycle.

Jim

LLDB supports adding data formatters without modifying the source code and I would strongly prefer to go that way as we don’t want each user of LLDB to start adding data formatters to their own custom types. We have a pretty detailed (but possible a bit outdated) description about how they work and how you can add a new one here: http://lldb.llvm.org/varformats.html

Enrico: Is there any reason you suggested the data formatters written inside LLDB over the python based ones?

Thanks Enrico. This is very detailed! I will take a look.
Btw: originally, I was hoping that data formatter can be added without changing the source code. Like giving a xml/json format file telling lldb the memory layout/structure of the data structure, lldb can parse the xml/json and deduce the formatting. This is approach used by data visualizer in VS debugger: https://msdn.microsoft.com/en-us/library/jj620914.aspx
This will make adding data formatter more extensible/flexible. Any reason we did not take this approach?

Jeffrey

Hi Enrico,

Any suggestion/example how to add a data formatter for our own STL string? From the output below I can see we are using our own “fbstring_core” which I assume I need to write a type summary for this type:

frame variable corpus -T
(const string &const) corpus = error: summary string parsing error: {
(std::fbstring_core) store_ = {
(std::fbstring_core::(anonymous union)) = {
(char [24]) small_ = “www”
(std::fbstring_core::MediumLarge) ml_ = {
(char *) data_ = 0x0000000000777777 "H\x89U\xa8H\x89M\xa0L\x89E\x98H\x8bE\xa8H\x89��_U��D\x88e�H\x8bE\xa0H\x89��]U��H\x89�H\x8dE�H\x89�H\x89��� ��L\x8dm�H\x8bE\x98H\x89��IU��\x88]�L\x8be\xb0L\x89��
(std::size_t) size_ = 0
(std::size_t) capacity_ = 1441151880758558720
}
}
}
}

Admittedly, this is going to be a little vague since I haven’t really seen your code and I am only working off of one sample

There’s going to be two parts to getting this to work:

Part 1 - Formatting fbstring_core

At a glance, an fbstring_core can be backed by two representations. A “small” representation (a char array), and a “medium/large" representation (a char* + a size)
I assume that the way you tell one from the other is

if (size == 0) small
else medium-large

If my assumption is not correct, you’ll need to discover what the correct discriminator logic is - the class has to know, and so do you :slight_smile:

Armed with that knowledge, look in lldb source/Plugins/Language/CPlusPlus/Formatters/LibCxx.cpp
There’s a bunch of code that deals with formatting llvm’s libc++ std::string - which follows a very similar logic to your class

ExtractLibcxxStringInfo() is the function that handles discovering which layout the string uses - where the data lives - and how much data there is

Once you have told yourself how much data there is (the size) and where it lives (array or pointer), LibcxxStringSummaryProvider() has the easy task - it sets up a StringPrinter, tells it how much data to print, where to get it from, and then delegates the StringPrinter to do the grunt work
StringPrinter is a nifty little tool - it can handle generating summaries for different kinds of strings (UTF8? UTF16? we got it - is a \0 a terminator? what quote character would you like? …) - you point it at some data, set up a few options, and it will generate a printable representation for you - if your string type is doing anything out of the ordinary, let’s talk - I am definitely open to extending StringPrinter to handle even more magic

Part 2 - Teaching std::string that it can be backed by an fbstring_core

At the end of part 1, you’ll probably end up with a FBStringCoreSummaryProvider() - now you need to teach LLDB about it
The obvious thing you could do would be to go in CPlusPlusLanguage::GetFormatters() add a LoadFBStringFormatter(g_category) to it - and then imitate - say - LoadLibCxxFormatters()

AddCXXSummary(cpp_category_sp, lldb_private::formatters::FBStringCoreSummaryProvider, “fbstringcore summary provider", ConstString(“std::fbstring_core<.+>"), stl_summary_flags, true);

That will work - but what you would see is:

(const string &const) corpus = error: summary string parsing error: {
(std::fbstring_core) store_ = “www"

You wanna do

(lldb) log enable lldb formatters
(lldb) frame variable -T corpus

It will list one or more typenames - the most specific one is the one you like (e.g. for libc++ we get std::__1::string - this is how we tell ourselves this is the std::string from libc++)
Once you find that typename, you’ll make a new formatter - FBStringSummaryProvider() - and register that formatter with that very specific typename

All that FBStringSummaryProvider() has to do is get the “store_” member (ValueObject::GetChildMemberWithName() is your friend) - and pass it down to FBStringCoreSummaryProvider()

I understand this may seem a little convoluted and arcane at first - but feel free to ask more questions, and I’ll try to help out!

Thanks.
Jeffrey

This is kind of orthogonal to your problem, but the reason why you are not seeing the kind of simplified printing Greg is suggesting, is because your std::string doesn’t look like any of the kinds we recognize

Specifically, LLDB data formatters work by matching against type names, and once they recognize a typename, then they try to inspect the variable in order to grab a summary
In your example, your std::string exposes a layout that we are not handling - hence we bail out of the formatter and we fall back to the raw view

If you want pretty printing to work, you’ll need to write a data formatter

There are a few avenues. The obvious easy one is to extend the existing std::string formatter to recognize your type’s internal layout.
If one were signing up for more infrastructure work, they could decide to try and detect shared library loads and load formatters that match with whatever libraries are being loaded.

So you need to be prepared to escape any text that can have special characters. A “std::string” or any container can contain special characters. If you are encoding stuff into JSON, you will either need to escape any special characters, or hex encode the string into ASCII hex bytes.

In debuggers we often get bogus data because variables are not initialized, but the compiler tells us that a variable is valid in address range [0x1000-0x2000), but it actually is [0x1200-0x2000). If we read a variable in this case, a std::string might contain bogus data and the bytes might not make sense. So you always have to be prepared for bad data.

If we look at:

store_ = {
= {
small_ = “www”
ml_ = (data_ =
“��UH\x89�H�}�H\x8bE�]ÐUH\x89�H��H\x89}�H\x8bE�H\x89��~\xb4��\x90��UH\x89�SH\x83�H\x89}�H�u�H�E�H���\x9e���H\x8b\x18H\x8bE�H���O\xb4��H\x89ƿ\b”,
size_ = 0, capacity_ = 1441151880758558720)
}
}
}

We can see the “size_” is zero, and capacity_ is 1441151880758558720 (which is 0x1400000000000000). “data_” seems to be some random pointer.

On MacOSX, we have a special formatting code that displays std::string in CPlusPlusLanguage.cpp that gets installed in the LoadLibCxxFormatters() or LoadLibStdcppFormatters() functions with code like:

lldb::TypeSummaryImplSP std_string_summary_sp(new CXXFunctionSummaryFormat(stl_summary_flags, lldb_private::formatters::LibcxxStringSummaryProvider, “std::string summary provider”));
cpp_category_sp->GetTypeSummariesContainer()->Add(ConstString(“std::__1::string”), std_string_summary_sp);

Special flags are set on std::string to say “don’t show children of this and just show a summary” So if a std::string contained “hello”. So for the following code:

std::string h (“hello”);

You should just see:

(lldb) fr var h
(std::__1::string) h = “hello”

If you take a look at the normal value in the raw we see:

(lldb) fr var --raw h
(std::__1::string) h = {
_r = {
std::__1::__libcpp_compressed_pair_imp<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator >::__rep, std::__1::allocator, 2> = {
_first = {
= {
__l = {
_cap = 122511465736202
_size = 0
_data = 0x0000000000000000
}
__s = {
= {
_size = ‘\n’
__lx = ‘\n’
}
_data = {
[0] = ‘h’
[1] = ‘e’
[2] = ‘l’
[3] = ‘l’
[4] = ‘o’
[5] = ‘\0’
[6] = ‘\0’
[7] = ‘\0’
[8] = ‘\0’
[9] = ‘\0’
[10] = ‘\0’
[11] = ‘\0’
[12] = ‘\0’
[13] = ‘\0’
[14] = ‘\0’
[15] = ‘\0’
[16] = ‘\0’
[17] = ‘\0’
[18] = ‘\0’
[19] = ‘\0’
[20] = ‘\0’
[21] = ‘\0’
[22] = ‘\0’
}
}
__r = {
__words = {
[0] = 122511465736202
[1] = 0
[2] = 0
}
}
}
}
}
}
}

So the main question is why are our “std::string” formatters not kicking in for you. That comes down to a typename match, or the format of the string isn’t what the formatter is expecting.

But again, since you std::string can contain anything, you will need to escape any and all text that is encoded into JSON to ensure it doesn’t contain anything JSON can’t deal with.

Thanks Siva. All the DW_TAG_member related errors seems to go away after patching with your fix. The current problem is handling the decoding.

Here is the correct decoding from gdb whic might be useful:
(gdb) p corpus
$3 = (const std::string &) @0x7fd133cfb888: {
static npos = 18446744073709551615, store_ = {
static kIsLittleEndian = ,
static kIsBigEndian = , {
small_ = “www”, ‘\000’ <repeats 20 times>, “\024”, ml_ = {
data_ = 0x777777 <std::Any_data::M_access<void folly::fibers::Baton::waitFiber<folly::fibers::FirstArgOf<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::{lambda(folly::fibers::Promise<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::SelectionResult>)#1}, void>::type::value_type folly::fibers::await<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::{lambda(folly::fibers::Promise<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::SelectionResult>)#1}>(folly::fibers::FirstArgOf&&)::{lambda()#1}>(folly::fibers::FiberManager&, folly::fibers::FirstArgOf<folly::fibers::FirstArgOf<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::{lambda(folly::fibers::Promise<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::SelectionResult>)#1}, void>::type::value_type folly::fibers::await<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::{lambda(folly::fibers::Promise<facebook::servicerouter::RequestDispatcherBasefacebook::servicerouter::ThriftDispatcher::prepareForSelection(facebook::servicerouter::DispatchContext&)::SelectionResult>)#1}>(folly::fibers::FirstArgOf&&)::{lambda()#1}, void>::type::value_type)::{lambda(folly::fibers::Fiber&)#1}*>() const+25> “\311\303UH\211\345H\211}\370H\213E\370]ÐUH\211\345H\203\354\020H\211}\370H\213E\370H\211\307\350~\264\312\377\220\311\303UH\211\345SH\203\354\030H\211}\350H\211u\340H\213E\340H\211\307\350\236\377\377\377H\213\030H\213E\350H\211\307\350O\264\312\377H\211ƿ\b”, size = 0,
capacity
= 1441151880758558720}}}}

Utf-16 does not seem to decode it, while ‘latin-1’ does:

‘\xc9’.decode(‘utf-16’)

Traceback (most recent call last):
File “”, line 1, in
File “/mnt/gvfs/third-party2/python/55c1fd79d91c77c95932db31a4769919611c12bb/2.7.8/centos6-native/da39a3e/lib/python2.7/encodings/utf_16.py”, line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: ‘utf16’ codec can’t decode byte 0xc9 in position 0: truncated data

‘\xc9’.decode(‘latin-1’)

u’\xc9’

Instead of guessing what kind of decoding I should use, I would use ‘ensure_ascii=False’ to prevent the crash for now.

I tried to reproduce this crash, but it seems that the crash might be related with some internal stl implementation we are using. I will see if I can narrow down to a small repro later.

Thanks
Jeffrey

Btw: after patching with Siva’s fix http://reviews.llvm.org/D18008, the
first field ‘small_’ is fixed, however the second field ‘ml_’ still emits
garbage:

(lldb) fr v corpus
(const string &const) corpus = error: summary string parsing error: {
store_ = {
= {
small_ = “www”
ml_ = (data_ =
“��UH\x89�H�}�H\x8bE�]ÐUH\x89�H��H\x89}�H\x8bE�H\x89��~\xb4��\x90��UH\x89�SH\x83�H\x89}�H�u�H�E�H���\x9e���H\x8b\x18H\x8bE�H���O\xb4��H\x89ƿ\b”,
size_ = 0, capacity_ = 1441151880758558720)
}
}
}

Do you still see the DW_TAG_member related error?

A wild (and really wild at that) guess: Is it utf16 data that is being
decoded as utf8?

As David Blaikie mentioned on the other thread, it would really help
if you provide us with a minimal example to repro this. Atleast, repro
instructions.


lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev


lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

Thanks,

Thanks,


lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev


lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

Thanks,
- Enrico
:envelope_with_arrow: egranata@.com :phone: 27683

Thanks Enrico. This is very detailed! I will take a look.
Btw: originally, I was hoping that data formatter can be added without changing the source code. Like giving a xml/json format file telling lldb the memory layout/structure of the data structure, lldb can parse the xml/json and deduce the formatting. This is approach used by data visualizer in VS debugger: https://msdn.microsoft.com/en-us/library/jj620914.aspx
This will make adding data formatter more extensible/flexible. Any reason we did not take this approach?

The way I understand the Natvis system, it allows one to provide a bunch of expressions that describe how the debugger would go about retrieving the interesting data bits
This has the bonus of being really easy, since you’re writing code in the same language/context of the types you’re formatting
On the other hand it has a few drawbacks, in terms of performance as well as safety (imagine trying to run code on an object when said object is in an incoherent state)
The LLDB approach, on the other hand, is that you should try to not run code when providing these data formatters. In order to do that, we vend an API that can do things such as retrieve child values, read memory, cast values, …, all without code execution
Once you have this kind of API that is not expressed in your source language, you might just as well describe it in a scripting language. Hence were born the Python data formatters.
In order for us to gain even more performance for native system types that we know we’re gonna run into all the time, we then switched a bunch of the “mission critical” formatters from Python to C++
The Python extension points are still available, as Jim pointed out, and you are more than welcome to use those instead of modifying the debugger core

Jeffrey

Thanks,
- Enrico
:envelope_with_arrow: egranata@.com :phone: 27683

One quick question: do we support getting type summary string from inferior method call? After reading our own fbstring_core code, I found I need to mirror a lot of what fbstring_core.c_str() method is doing in python. I wonder if we can just call ${var.c_str()} as the type summary? I suspect one of the concern is side-effect(the inferior method may throw exception or cause problems) but I would not see why this can’t be done. By allowing this we can keep the data formatter truth one copy(in source code) instead of potential out-of-sync(let say the std::string author decided to change it implementation, the python data formatter associated with it needs to be modified at the same time which is a maintain nightmare).

Jeffrey

I did a quick testing to call SBFrame.EvaluateExpression(‘string.c_str()’) for the summary. The result shows valobj.GetFrame() returns None so does this mean this is not supported?

def DoTest(valobj,internal_dict):
print “valobj: %s” % valobj
print “valobj.GetFrame(): %s” % valobj.GetFrame()
summaryValue = valobj.GetFrame().EvaluateExpression(valobj.name + ‘.c_str()’)
print “summaryValue: %s” % summaryValue
return 'Summary from c_str(): %s ’ % summaryValue.GetSummary()

type summary add -F DoTest -x “std::fbstring_core”

Output:

valobj.GetFrame(): No value
summaryValue: No value
valobj: (std::string) $6 = {
store_ = Summary from c_str(): None
}

Jeffrey

One quick question: do we support getting type summary string from inferior method call?

No - for that you are going to need to write a Python formatter.

Running code in formatters is a risky endeavor for a bunch of reasons, so it is by design that it is not an easily accessible building block

After reading our own fbstring_core code, I found I need to mirror a lot of what fbstring_core.c_str() method is doing in python. I wonder if we can just call ${var.c_str()} as the type summary? I suspect one of the concern is side-effect(the inferior method may throw exception or cause problems) but I would not see why this can’t be done.

Because, as you say, it has a high risk of side effects
Also, it is less efficient than direct memory reads (I have a comparison graph somewhere where some data formatter became an order of magnitude faster once it stopped running code)
Also, what happens when your object is in scope but not yet initialized, or you’re stopped in its destructor and it’s partially torn down? Are you going to make all your methods able to deal safely with states that should never happen in production because you might actually run into them in the debugger?

By allowing this we can keep the data formatter truth one copy(in source code) instead of potential out-of-sync(let say the std::string author decided to change it implementation, the python data formatter associated with it needs to be modified at the same time which is a maintain nightmare).

The model you’re describing is similar to the “po” model we have in ObjC and Swift. Those languages provide a sanctioned language-blessed way to create an object description in program code (see -description for ObjC, and the whole Mirror story for Swift)
Those are OK because they are only triggered by explicit user action (running the “po” command in the debugger), and yet we still occasionally see problems with them where the user didn’t realize the program state was corrupt enough that running code there was a bad idea

C++ has no such sanctioned mechanism to generate descriptions - if one came about, LLDB would support it, in the form of making “po” do the right thing for C++ objects

Jeffrey

Thanks,
- Enrico
:envelope_with_arrow: egranata@.com :phone: 27683

In theory what you’re doing looks like it should be supported. I am not sure why your example is failing the way it is.