RFC: Moving debug info parsing out of process

Hi all,

We’ve got some internal efforts in progress, and one of those would benefit from debug info parsing being out of process (independently of whether or not the rest of LLDB is out of process).

There’s a couple of advantages to this, which I’ll enumerate here:

  • It improves one source of instability in LLDB which has been known to be problematic – specifically, that debug info can be bad and handling this can often be difficult and bring down the entire debug session. While other efforts have been made to address stability by moving things out of process, they have not been upstreamed, and even if they had I think we would still want this anyway, for reasons that follow.
  • It becomes theoretically possible to move debug info parsing not just to another process, but to another machine entirely. In a broader sense, this decouples the physical debug info location (and for that matter, representation) from the debugger host.
  • It becomes testable as an independent component, because you can just send requests to it and dump the results and see if they make sense. Currently there is almost zero test coverage of this aspect of LLDB apart from what you can get after going through many levels of indirection via spinning up a full debug session and doing things that indirectly result in symbol queries.
    The big win here, at least from my point of view, is the second one. Traditional symbol servers operate by copying entire symbol files (DSYM, DWP, PDB) from some machine to the debugger host. These can be very large – we’ve seen 12+ GB in some cases – which ranges from “slow bandwidth hog” to “complete non-starter” depending on the debugger host and network. In this kind of scenario, one could theoretically run the debug info process on the same NAS, cloud, or whatever as the symbol server. Then, rather than copying over an entire symbol file, it responds only to the query you issued – if you asked for a type, it just returns a packet describing the type you requested.

The API itself would be stateless (so that you could make queries for multiple targets in any order) as well as asynchronous (so that responses might arrive out of order). Blocking could be implemented in LLDB, but having the server be asynchronous means multiple clients could connect to the same server instance. This raises interesting possibilities. For example, one can imagine thousands of developers connecting to an internal symbol server on the network and being able to debug remote processes or core dumps over slow network connections or on machines with very little storage (e.g. chromebooks).

On the LLDB side, all of this is hidden behind the SymbolFile interface, so most of LLDB doesn’t have to change at all. While this is in development, we could have SymbolFileRemote and keep the existing local codepath the default, until such time that it’s robust and complete enough that we can switch the default.

Thoughts?

Interesting idea.

Would you build the server using the pieces we have in the current SymbolFile implementations? What do you mean by “switching the default”? Do you expect LLDB to spin up a server if there’s none configured in the environment?

Fred

I would probably build the server by using mostly code from LLVM. Since it would contain all of the low level debug info parsing libraries, i would expect that all knowledge of debug info (at least, in the form that compilers emit it in) could eventually be removed from LLDB entirely.

So, for example, all of the efforts to merge LLDB and LLVM’s DWARF parsing libraries could happen by first implementing inside of LLVM whatever functionality is missing, and then using that from within the server. And yes, I would expect lldb to spin up a server, just as it does with lldb-server today if you try to debug something. It finds the lldb-server binary and runs it.

When I say “switching the default”, what I mean is that if someday this hypothetical server supports everything that the current in-process parsing codepath supports, we could just delete that entire codepath and switch everything to the out of process server, even if that server were running on the same physical machine as the debugger client (which would be functionally equivalent to what we have today).

I would probably build the server by using mostly code from LLVM. Since it would contain all of the low level debug info parsing libraries, i would expect that all knowledge of debug info (at least, in the form that compilers emit it in) could eventually be removed from LLDB entirely.

That’s quite an ambitious goal.

I haven’t looked at the SymbolFile API, what do you expect the exchange currency between the server and LLDB to be? Serialized compiler ASTs? If that’s the case, it seems like you need a strong rev-lock between the server and the client. Which in turn add quite some complexity to the rollout of new versions of the debugger.

So, for example, all of the efforts to merge LLDB and LLVM’s DWARF parsing libraries could happen by first implementing inside of LLVM whatever functionality is missing, and then using that from within the server. And yes, I would expect lldb to spin up a server, just as it does with lldb-server today if you try to debug something. It finds the lldb-server binary and runs it.

When I say “switching the default”, what I mean is that if someday this hypothetical server supports everything that the current in-process parsing codepath supports, we could just delete that entire codepath and switch everything to the out of process server, even if that server were running on the same physical machine as the debugger client (which would be functionally equivalent to what we have today).

(I obviously knew what you meant by "switching the default”, I was trying to ask about how… to which the answer is by spinning up a local server)

Do you envision LLDB being able to talk to more than one server at the same time? It seems like this could be useful to debug a local build while still having access to debug symbols for your dependencies that have their symbols in a central repository.

Fred

I would probably build the server by using mostly code from LLVM. Since it would contain all of the low level debug info parsing libraries, i would expect that all knowledge of debug info (at least, in the form that compilers emit it in) could eventually be removed from LLDB entirely.

That’s quite an ambitious goal.

I haven’t looked at the SymbolFile API, what do you expect the exchange currency between the server and LLDB to be? Serialized compiler ASTs? If that’s the case, it seems like you need a strong rev-lock between the server and the client. Which in turn add quite some complexity to the rollout of new versions of the debugger.

Definitely not serialized ASTs, because you could be debugging some language other than C++. Probably something more like JSON, where you parse the debug info and send back some JSON representation of the type / function / variable the user requested, which can almost be a direct mapping to LLDB’s internal symbol hierarchy (e.g. the Function, Type, etc classes). You’d still need to build the AST on the client

So, for example, all of the efforts to merge LLDB and LLVM’s DWARF parsing libraries could happen by first implementing inside of LLVM whatever functionality is missing, and then using that from within the server. And yes, I would expect lldb to spin up a server, just as it does with lldb-server today if you try to debug something. It finds the lldb-server binary and runs it.

When I say “switching the default”, what I mean is that if someday this hypothetical server supports everything that the current in-process parsing codepath supports, we could just delete that entire codepath and switch everything to the out of process server, even if that server were running on the same physical machine as the debugger client (which would be functionally equivalent to what we have today).

(I obviously knew what you meant by "switching the default”, I was trying to ask about how… to which the answer is by spinning up a local server)

Do you envision LLDB being able to talk to more than one server at the same time? It seems like this could be useful to debug a local build while still having access to debug symbols for your dependencies that have their symbols in a central repository.

I hadn’t really thought of this, but it certainly seems possible. Since the API is stateless, it could send requests to any server it wanted, with some mechanism of selecting between them.

I would probably build the server by using mostly code from LLVM. Since it would contain all of the low level debug info parsing libraries, i would expect that all knowledge of debug info (at least, in the form that compilers emit it in) could eventually be removed from LLDB entirely.

That’s quite an ambitious goal.

I haven’t looked at the SymbolFile API, what do you expect the exchange currency between the server and LLDB to be? Serialized compiler ASTs? If that’s the case, it seems like you need a strong rev-lock between the server and the client. Which in turn add quite some complexity to the rollout of new versions of the debugger.

Definitely not serialized ASTs, because you could be debugging some language other than C++. Probably something more like JSON, where you parse the debug info and send back some JSON representation of the type / function / variable the user requested, which can almost be a direct mapping to LLDB’s internal symbol hierarchy (e.g. the Function, Type, etc classes). You’d still need to build the AST on the client

This seems fairly easy for Function or symbols in general, as it’s easy to abstract their few properties, but as soon as you get to the type system, I get worried.

Your representation needs to have the full expressivity of the underlying debug info format. Inventing something new in that space seems really expensive. For example, every piece of information we add to the debug info in the compiler would need to be handled in multiple places:

  • the server code
  • the client code that talks to the server
  • the current “local" code (for a pretty long while)
    Not ideal. I wish there was a way to factor at least the last 2.

But maybe I’m misunderstanding exactly what you’d put in your JSON. If it’s very close to the debug format (basically a JSON representation of the DWARF or the PDB), then it becomes more tractable as the client code can be the same as the current local one with some refactoring.

Fred

When I see this “parsing DWARF and turning it into something else” it is very reminiscent of what clayborg is trying to do with GSYM. You’re both talking about leveraging LLVM’s parser, which is great, but I have to wonder if there isn’t more commonality being left on the table. Just throwing that thought out there; I don’t have anything specific to suggest.

–paulr

I would probably build the server by using mostly code from LLVM. Since it would contain all of the low level debug info parsing libraries, i would expect that all knowledge of debug info (at least, in the form that compilers emit it in) could eventually be removed from LLDB entirely.

That’s quite an ambitious goal.

I haven’t looked at the SymbolFile API, what do you expect the exchange currency between the server and LLDB to be? Serialized compiler ASTs? If that’s the case, it seems like you need a strong rev-lock between the server and the client. Which in turn add quite some complexity to the rollout of new versions of the debugger.

Definitely not serialized ASTs, because you could be debugging some language other than C++. Probably something more like JSON, where you parse the debug info and send back some JSON representation of the type / function / variable the user requested, which can almost be a direct mapping to LLDB’s internal symbol hierarchy (e.g. the Function, Type, etc classes). You’d still need to build the AST on the client

This seems fairly easy for Function or symbols in general, as it’s easy to abstract their few properties, but as soon as you get to the type system, I get worried.

Your representation needs to have the full expressivity of the underlying debug info format. Inventing something new in that space seems really expensive. For example, every piece of information we add to the debug info in the compiler would need to be handled in multiple places:

  • the server code
  • the client code that talks to the server
  • the current “local" code (for a pretty long while)
    Not ideal. I wish there was a way to factor at least the last 2.

How often does this actually happen though? The C++ type system hasn’t really undergone very many fundamental changes over the years. I mocked up a few samples of what some JSON descriptions would look like, and it didn’t seem terrible. It certainly is some work – there’s no denying – but I think a lot of the “expressivity” of the underlying format is actually more accurately described as “flexibility”. What I mean by this is that there are both many different ways to express the same thing, as well as many entities that can express different things depending on how they’re used. An intermediate format gives us a way to eliminate all of that flexibility and instead offer consistency, which makes client code much simpler. In a way, this is a similar benefit to what one gets by compiling a source language down to LLVM IR and then operating on the LLVM IR because you have a much simpler grammar to deal with, along with more semantic restrictions on what kind of descriptions you form with that grammar (to be clear: JSON itself is not restrictive, but we can make our schema restrictive).

For what it’s worth, in an earlier message I mentioned that I would probably build the server by using mostly code from LLVM, and making sure that it supported the union of things currently supported by LLDB and LLVM’s DWARF parsers. Doing that would naturally require merging the two (which has been talked about for a long time) as a pre-requisite, and I would expect that for testing purposes we might want something like llvm-dwarfdump but that dumps a higher level description of the information (if we change our DWARF emission code in LLVM for example, to output the exact same type in slightly different ways in the underlying DWARF, we wouldn’t want our test to break, for example). So for example imagine you could run something like lldb-dwarfdump -lookup-type=foo a.out and it would dump some description of the type that is resilient to insignificant changes in the underlying DWARF.

At that point you’re already 90% of the way towards what I’m proposing, and it’s useful independently.

GSYM, as I understand it, is basically just an evolution of Breakpad symbols. It doesn’t contain full fidelity debug information (type information, function parameters, etc).

Hi Zachary,

[…]
Thoughts?
Having a standalone symbols interface would open many tooling possibilities, the available interfaces are too dwarfish and too primitive. This necessarily does not require an out-of-process symbol server but I see that it is appealing to you especially with the problems you are facing.

I do not want start bikeshedding on implementation details already as it seems you have your own but I suggest starting with a linetable interface. It has a simple and stable interface addr2locs/loc2addrs, is complete on its own (no symbols required), not prone to dwarf/pdb or language oddities, and imho is the most fundamental debug information. This would allow you to focus on the necessary details and still have a good portion of functionality.
Out-of-process symbol server do work but are less useful nowadays. Hope it solves the problems you are facing.

-Sanimir

I would probably build the server by using mostly code from LLVM. Since it would contain all of the low level debug info parsing libraries, i would expect that all knowledge of debug info (at least, in the form that compilers emit it in) could eventually be removed from LLDB entirely.

That’s quite an ambitious goal.

I haven’t looked at the SymbolFile API, what do you expect the exchange currency between the server and LLDB to be? Serialized compiler ASTs? If that’s the case, it seems like you need a strong rev-lock between the server and the client. Which in turn add quite some complexity to the rollout of new versions of the debugger.

Definitely not serialized ASTs, because you could be debugging some language other than C++. Probably something more like JSON, where you parse the debug info and send back some JSON representation of the type / function / variable the user requested, which can almost be a direct mapping to LLDB’s internal symbol hierarchy (e.g. the Function, Type, etc classes). You’d still need to build the AST on the client

This seems fairly easy for Function or symbols in general, as it’s easy to abstract their few properties, but as soon as you get to the type system, I get worried.

Your representation needs to have the full expressivity of the underlying debug info format. Inventing something new in that space seems really expensive. For example, every piece of information we add to the debug info in the compiler would need to be handled in multiple places:

  • the server code
  • the client code that talks to the server
  • the current “local" code (for a pretty long while)
    Not ideal. I wish there was a way to factor at least the last 2.

How often does this actually happen though? The C++ type system hasn’t really undergone very many fundamental changes over the years.

I think over the last year we’ve done at least a couple extensions to what we put in DWARF (for ObjC classes and ARM PAC support which is not upstream yet). Adrian usually does those evolutions, so he might have a better idea. We plan on potentially adding a bunch more information to DWARF to more accurately represent the Obj-C type system.

I mocked up a few samples of what some JSON descriptions would look like, and it didn’t seem terrible. It certainly is some work – there’s no denying – but I think a lot of the “expressivity” of the underlying format is actually more accurately described as “flexibility”. What I mean by this is that there are both many different ways to express the same thing, as well as many entities that can express different things depending on how they’re used. An intermediate format gives us a way to eliminate all of that flexibility and instead offer consistency, which makes client code much simpler. In a way, this is a similar benefit to what one gets by compiling a source language down to LLVM IR and then operating on the LLVM IR because you have a much simpler grammar to deal with, along with more semantic restrictions on what kind of descriptions you form with that grammar (to be clear: JSON itself is not restrictive, but we can make our schema restrictive).

What I’m worried about is not exactly the amount of work, just the scope of the new abstraction. It needs to be good enough for any language and any debug information format. It needs efficient implementation of at least symbols, types, decl contexts, frame information, location expressions, target register mappings… And it’ll require the equivalent of the various ASTParser implementations. That’s a lot of new and forked code. I’d feel way better if we were able to reuse some of the existing code. I’m not sure how feasible this is though.

For what it’s worth, in an earlier message I mentioned that I would probably build the server by using mostly code from LLVM, and making sure that it supported the union of things currently supported by LLDB and LLVM’s DWARF parsers. Doing that would naturally require merging the two (which has been talked about for a long time) as a pre-requisite, and I would expect that for testing purposes we might want something like llvm-dwarfdump but that dumps a higher level description of the information (if we change our DWARF emission code in LLVM for example, to output the exact same type in slightly different ways in the underlying DWARF, we wouldn’t want our test to break, for example). So for example imagine you could run something like lldb-dwarfdump -lookup-type=foo a.out and it would dump some description of the type that is resilient to insignificant changes in the underlying DWARF.

At which level do you consider the “DWARF parser” to stop and the debugger policy to start? In my view, the DWARF parser stop at the DwarfDIE boundary. Replacing it wouldn’t get us closer to a higher-level abstraction.

At that point you’re already 90% of the way towards what I’m proposing, and it’s useful independently.

I think that “90%” figure is a little off :slight_smile: But please don’t take my questions as opposition to the general idea. I find the idea very interesting, and we could maybe use something similar internally so I am interested. That’s why I’m asking questions.

Fred

For what it’s worth, in an earlier message I mentioned that I would probably build the server by using mostly code from LLVM, and making sure that it supported the union of things currently supported by LLDB and LLVM’s DWARF parsers. Doing that would naturally require merging the two (which has been talked about for a long time) as a pre-requisite, and I would expect that for testing purposes we might want something like llvm-dwarfdump but that dumps a higher level description of the information (if we change our DWARF emission code in LLVM for example, to output the exact same type in slightly different ways in the underlying DWARF, we wouldn’t want our test to break, for example). So for example imagine you could run something like lldb-dwarfdump -lookup-type=foo a.out and it would dump some description of the type that is resilient to insignificant changes in the underlying DWARF.

At which level do you consider the “DWARF parser” to stop and the debugger policy to start? In my view, the DWARF parser stop at the DwarfDIE boundary. Replacing it wouldn’t get us closer to a higher-level abstraction.

At the level where you have an alternative representation that you no longer have to access to the debug info. In LLDB today, this “representation” is a combination of LLDB’s own internal symbol hierarchy (e.g. lldb_private::Type, lldb_private::Function, etc) and the Clang AST. Once you have constructed those 2 things, the DWARF parser is out of the picture.

A lot of the complexity in processing raw DWARF comes from handling different versions of the DWARF spec (e.g. supporting DWARF 4 & DWARF 5), collecting and interpreting the subset of attributes which happens be present, following references to other parts of the DWARF, and then at the end of all this (or perhaps during all of this), dealing with “partial information” (e.g. something that would have saved me a lot of trouble was missing, now I have to do extra work to find it).

I’m treading DWARF expressions as an exception though, because it would be somewhat tedious and not provide much value to convert those into some text format and then evaluate the text representation of the expression since it’s already in a format suitable for processing. So for this case, you could just encode the byte sequence into a hex string and send that.

I hinted at this already, but part of the problem (at least in my mind) is that our “DWARF parser” is intermingled with the code that interprets the parsed DWARF. We parse a little bit, build something, parse a little bit more, add on to the thing we’re building, etc. This design is fragile and makes error handling difficult, so part of what I’m proposing is a separation here, where “parse as much as possible, and return an intermediate representation that is as finished as we are able to make it”.

This part is independent of whether DWARF parsing is out of process however. That’s still useful even if DWARF parsing is in process, and we’ve talked about something like that for a long time, whereby we have some kind of API that says “give me the thing, handle all errors internally, and either return me a thing which I can trust or an error”. I’m viewing “thing which I can trust” as some representation which is separate from the original DWARF, and which we could test – for example – by writing a tool which dumps this representation

At that point you’re already 90% of the way towards what I’m proposing, and it’s useful independently.

I think that “90%” figure is a little off :slight_smile: But please don’t take my questions as opposition to the general idea. I find the idea very interesting, and we could maybe use something similar internally so I am interested. That’s why I’m asking questions.

Hmm, well I think the 90% figure is pretty accurate. Because if we envision a hypothetical command line tool which ingests DWARF from a binary or set of binaries, and has some command line interface that allows you to query it in the same way our SymbolFile plugins can be queried, and dumps its output in some intermediate format (maybe JSON, maybe something else) and is sufficiently descriptive to make a Clang AST or build LLDB’s internal symbol & type hierarchy out of it, then at that point the only thing missing from my original proposal is a socket to send that over the wire and something on the other end to make the Clang AST and LLDB type / symbol hierarchy.

I’m aware that GSYM doesn’t have full info, but if you’re both looking at symbol-server kinds of mechanics and protocols, it would be silly to separate them into Clayborg-servers and Zach-servers just because GSYM cares mainly about line info.

But whatever. You guys are designing this, go for it.

–paulr

For what it’s worth, in an earlier message I mentioned that I would probably build the server by using mostly code from LLVM, and making sure that it supported the union of things currently supported by LLDB and LLVM’s DWARF parsers. Doing that would naturally require merging the two (which has been talked about for a long time) as a pre-requisite, and I would expect that for testing purposes we might want something like llvm-dwarfdump but that dumps a higher level description of the information (if we change our DWARF emission code in LLVM for example, to output the exact same type in slightly different ways in the underlying DWARF, we wouldn’t want our test to break, for example). So for example imagine you could run something like lldb-dwarfdump -lookup-type=foo a.out and it would dump some description of the type that is resilient to insignificant changes in the underlying DWARF.

At which level do you consider the “DWARF parser” to stop and the debugger policy to start? In my view, the DWARF parser stop at the DwarfDIE boundary. Replacing it wouldn’t get us closer to a higher-level abstraction.

At the level where you have an alternative representation that you no longer have to access to the debug info. In LLDB today, this “representation” is a combination of LLDB’s own internal symbol hierarchy (e.g. lldb_private::Type, lldb_private::Function, etc) and the Clang AST. Once you have constructed those 2 things, the DWARF parser is out of the picture.

A lot of the complexity in processing raw DWARF comes from handling different versions of the DWARF spec (e.g. supporting DWARF 4 & DWARF 5), collecting and interpreting the subset of attributes which happens be present, following references to other parts of the DWARF, and then at the end of all this (or perhaps during all of this), dealing with “partial information” (e.g. something that would have saved me a lot of trouble was missing, now I have to do extra work to find it).

I’m treading DWARF expressions as an exception though, because it would be somewhat tedious and not provide much value to convert those into some text format and then evaluate the text representation of the expression since it’s already in a format suitable for processing. So for this case, you could just encode the byte sequence into a hex string and send that.

I hinted at this already, but part of the problem (at least in my mind) is that our “DWARF parser” is intermingled with the code that interprets the parsed DWARF. We parse a little bit, build something, parse a little bit more, add on to the thing we’re building, etc. This design is fragile and makes error handling difficult, so part of what I’m proposing is a separation here, where “parse as much as possible, and return an intermediate representation that is as finished as we are able to make it”.

This part is independent of whether DWARF parsing is out of process however. That’s still useful even if DWARF parsing is in process, and we’ve talked about something like that for a long time, whereby we have some kind of API that says “give me the thing, handle all errors internally, and either return me a thing which I can trust or an error”. I’m viewing “thing which I can trust” as some representation which is separate from the original DWARF, and which we could test – for example – by writing a tool which dumps this representation

Ok, here we are talking about something different (which you might have been expressing since the beginning and I misinterpreted). If you want to decouple dealing with DIEs from creating ASTs as a preliminary, then I think this would be super valuable and it addresses my concerns about duplicating the AST creation logic.

I’m sure Greg would have comments about the challenges of lazily parsing the DWARF in such a design.

At that point you’re already 90% of the way towards what I’m proposing, and it’s useful independently.

I think that “90%” figure is a little off :slight_smile: But please don’t take my questions as opposition to the general idea. I find the idea very interesting, and we could maybe use something similar internally so I am interested. That’s why I’m asking questions.

Hmm, well I think the 90% figure is pretty accurate. Because if we envision a hypothetical command line tool which ingests DWARF from a binary or set of binaries, and has some command line interface that allows you to query it in the same way our SymbolFile plugins can be queried, and dumps its output in some intermediate format (maybe JSON, maybe something else) and is sufficiently descriptive to make a Clang AST or build LLDB’s internal symbol & type hierarchy out of it, then at that point the only thing missing from my original proposal is a socket to send that over the wire and something on the other end to make the Clang AST and LLDB type / symbol hierarchy.

A more accurate reflection of my feelings would have been “those 90% seem harder to achieve than you think”. I obviously have no data to back this up, so please prove me wrong!

Fred

Well, I was originally talking about both lumped into one thing. Because this is a necessary precursor to having it be out of process :slight_smile:

Since we definitely agree on this portion, the question then becomes: Suppose we have this firm API boundary across which we either return errors or things that can be trusted. What are the things which can be trusted? Are they DIEs? I’m not sure they should be, because we’d have to synthesize DIEs on the fly in the case where we got something that was bad but we tried to “fix” it (in order to sanitize the debug info into something the caller can make basic assumptions about). And additionally, it doesn’t really make the client’s job much easier as far as parsing goes.

So, I think it should build up a little bit higher representation of the debug info, perhaps by piecing together information from multiple DIEs and sources, and return that. Definitely laziness will have to be maintained, but I don’t think that’s inherently more difficult with a design where we return something higher level than DIEs.

Thoughts?

Hi Guys,

I hope you don’t mind me chiming in I’ve been following this thread. I am a little familiar with the Eclipse Target Communications Framework (TCF).

Hi all,

We’ve got some internal efforts in progress, and one of those would benefit from debug info parsing being out of process (independently of whether or not the rest of LLDB is out of process).

There’s a couple of advantages to this, which I’ll enumerate here:

  • It improves one source of instability in LLDB which has been known to be problematic – specifically, that debug info can be bad and handling this can often be difficult and bring down the entire debug session. While other efforts have been made to address stability by moving things out of process, they have not been upstreamed, and even if they had I think we would still want this anyway, for reasons that follow.

Where do you draw the line between debug info and the in-process part of LLDB? I’m asking because I have never seen the mechanical parsing of DWARF to be a source of instability; most crashes in LLDB are when reconstructing Clang ASTs because we’re breaking some subtle and badly enforced invariants in Clang’s Sema. Perhaps parsing PDBs is less stable? If you do mean at the AST level then I agree with the sentiment that it is a common source of crashes, but I don’t see a good way of moving that component out of process. Serializing ASTs or types in general is a hard problem, and I’d find the idea of inventing yet another serialization format for types that we would have to develop, test, and maintain quite scary.

  • It becomes theoretically possible to move debug info parsing not just to another process, but to another machine entirely. In a broader sense, this decouples the physical debug info location (and for that matter, representation) from the debugger host.

I can see how that can be useful in some settings. You’d need a really low latency network connection to make interactive debugging work but I expect you’ve got that covered :slight_smile:

  • It becomes testable as an independent component, because you can just send requests to it and dump the results and see if they make sense. Currently there is almost zero test coverage of this aspect of LLDB apart from what you can get after going through many levels of indirection via spinning up a full debug session and doing things that indirectly result in symbol queries.

You are right that the type system debug info ingestion and AST reconstruction is primarily tested end-to-end.

The big win here, at least from my point of view, is the second one. Traditional symbol servers operate by copying entire symbol files (DSYM, DWP, PDB) from some machine to the debugger host. These can be very large – we’ve seen 12+ GB in some cases – which ranges from “slow bandwidth hog” to “complete non-starter” depending on the debugger host and network.

12 GB sounds suspiciously large. Do you know how this breaks down between line table, types, and debug locations? If it’s types, are you deduplicating them? For comparison, the debug info of LLDB (which contains two compilers and a debugger) compresses to under 500MB, but perhaps the binaries you are working with are really just that much larger.

In this kind of scenario, one could theoretically run the debug info process on the same NAS, cloud, or whatever as the symbol server. Then, rather than copying over an entire symbol file, it responds only to the query you issued – if you asked for a type, it just returns a packet describing the type you requested.

The API itself would be stateless (so that you could make queries for multiple targets in any order) as well as asynchronous (so that responses might arrive out of order). Blocking could be implemented in LLDB, but having the server be asynchronous means multiple clients could connect to the same server instance. This raises interesting possibilities. For example, one can imagine thousands of developers connecting to an internal symbol server on the network and being able to debug remote processes or core dumps over slow network connections or on machines with very little storage (e.g. chromebooks).

You could just run LLDB remotely :wink:

That sounds all cool, but in my opinion you are leaving out the really important part: what is the abstraction level of the API going to be?

To be blunt, I’m against inventing yet another serialization format for types not just because of the considerable engineering effort it will take to get this right, but also because of the maintenance burden it would impose. We already have to support loading types from DWARF, PDB, Clang modules, the Objective-C runtime, Swift modules, and probably more sources, all of these operate to some degree at different levels of abstraction. Adding another source or abstraction layer into the mix needs to be really well thought out and justified.

On the LLDB side, all of this is hidden behind the SymbolFile interface, so most of LLDB doesn’t have to change at all. While this is in development, we could have SymbolFileRemote and keep the existing local codepath the default, until such time that it’s robust and complete enough that we can switch the default.

The SymbolFile interface ultimately vends compiler types so now I’m really curious what kind of data you are planning to send over the wire.

thanks for sharing,
Adrian

We recently ran some testing and found lldb crashing while parsing
DWARF (or, sometimes, failing to parse allegedly valid DWARF and
returning some default constructed object and crashing later on). See,
e.g. https://bugs.llvm.org/show_bug.cgi?id=40827
Qirun did his testing on Linux, FWIW. I would like to point out that
the problems we ended up finding test some less stressed (but IMHO,
equally important configurations, namely older compiler(s) [clang
3.8/clang 4.0/clang 5.0 etc..] and optimized code (-O1/-O2/-O3/-Os)].

Hi all,

We’ve got some internal efforts in progress, and one of those would benefit from debug info parsing being out of process (independently of whether or not the rest of LLDB is out of process).

There’s a couple of advantages to this, which I’ll enumerate here:

  • It improves one source of instability in LLDB which has been known to be problematic – specifically, that debug info can be bad and handling this can often be difficult and bring down the entire debug session. While other efforts have been made to address stability by moving things out of process, they have not been upstreamed, and even if they had I think we would still want this anyway, for reasons that follow.

Where do you draw the line between debug info and the in-process part of LLDB? I’m asking because I have never seen the mechanical parsing of DWARF to be a source of instability; most crashes in LLDB are when reconstructing Clang ASTs because we’re breaking some subtle and badly enforced invariants in Clang’s Sema. Perhaps parsing PDBs is less stable? If you do mean at the AST level then I agree with the sentiment that it is a common source of crashes, but I don’t see a good way of moving that component out of process. Serializing ASTs or types in general is a hard problem, and I’d find the idea of inventing yet another serialization format for types that we would have to develop, test, and maintain quite scary.

If anything I think parsing PDBs is more stable. There is close to zero flexibility in how types and symbols can be represented in PDB / CodeView, and on top of that, there are very few producers. Combined, this means we can assume almost everything about the structure of the records.

Yes the crashes happen at the AST level (most of them anyway, not all - there are definitely examples of crashing in the actual parsing code), but the fact that there is so much flexibility in how records can be specified in DWARF exacerbates the problem by complicating the parsing code, which is then not well tested because of all the different code paths.

  • It becomes testable as an independent component, because you can just send requests to it and dump the results and see if they make sense. Currently there is almost zero test coverage of this aspect of LLDB apart from what you can get after going through many levels of indirection via spinning up a full debug session and doing things that indirectly result in symbol queries.

You are right that the type system debug info ingestion and AST reconstruction is primarily tested end-to-end.

Do you consider this something worth addressing by testing the debug info ingestion in isolation?

The big win here, at least from my point of view, is the second one. Traditional symbol servers operate by copying entire symbol files (DSYM, DWP, PDB) from some machine to the debugger host. These can be very large – we’ve seen 12+ GB in some cases – which ranges from “slow bandwidth hog” to “complete non-starter” depending on the debugger host and network.

12 GB sounds suspiciously large. Do you know how this breaks down between line table, types, and debug locations? If it’s types, are you deduplicating them? For comparison, the debug info of LLDB (which contains two compilers and a debugger) compresses to under 500MB, but perhaps the binaries you are working with are really just that much larger.

They really are that large.

In this kind of scenario, one could theoretically run the debug info process on the same NAS, cloud, or whatever as the symbol server. Then, rather than copying over an entire symbol file, it responds only to the query you issued – if you asked for a type, it just returns a packet describing the type you requested.

The API itself would be stateless (so that you could make queries for multiple targets in any order) as well as asynchronous (so that responses might arrive out of order). Blocking could be implemented in LLDB, but having the server be asynchronous means multiple clients could connect to the same server instance. This raises interesting possibilities. For example, one can imagine thousands of developers connecting to an internal symbol server on the network and being able to debug remote processes or core dumps over slow network connections or on machines with very little storage (e.g. chromebooks).

You could just run LLDB remotely :wink:

That sounds all cool, but in my opinion you are leaving out the really important part: what is the abstraction level of the API going to be?

To be blunt, I’m against inventing yet another serialization format for types not just because of the considerable engineering effort it will take to get this right, but also because of the maintenance burden it would impose. We already have to support loading types from DWARF, PDB, Clang modules, the Objective-C runtime, Swift modules, and probably more sources, all of these operate to some degree at different levels of abstraction. Adding another source or abstraction layer into the mix needs to be really well thought out and justified.

Let’s ignore whether the format can be serialized and instead focus on the abstraction level of the API. Personally, I think the format should be higher level than DWARF DIEs but lower level than an AST. By making it higher level than DWARF DIEs, we could use the same abstraction to represent PDB types and symbols as well, and by making it lower level than ASTs, we could support non-clang TypeSystems. This way, you have one API which gives you “something” that you can trust and works with any underlying debug info format, and one codepath that builds the AST from it, regardless of which Debug info format and programming language it describes.

In a way, this is like separating the DWARFASTParserClang / SymbolFileDWARF and PDBASTBuilder / SymbolFileNativePDB, and instead have some library called DebugInfoParser, and a single ASTParser class which says DIParser->ParseTypes() and then builds an AST from it without knowing what format it orignated from.

On the LLDB side, all of this is hidden behind the SymbolFile interface, so most of LLDB doesn’t have to change at all. While this is in development, we could have SymbolFileRemote and keep the existing local codepath the default, until such time that it’s robust and complete enough that we can switch the default.

The SymbolFile interface ultimately vends compiler types so now I’m really curious what kind of data you are planning to send over the wire.

So again, let’s ignore “the wire” for the sake of this discussion. SymbolFile does vend compiler types, but that doesn’t mean we can’t have a single “master” SymbolFile implementation which a) calls into DebugInfoParser (which need not be out of process), and then b) uses the result of these library calls to construct an AST.

Wanted to bump this thread for visibility. If nothing else, I’m interested in an answer to this question. Because if people agree that it would be valuable to test this going forward, we should work out a plan about what such tests would look like and how to refactor the code appropriately to make it possible.