Trying to understand symbol importing and its relationship to ASTs

I’m chasing a crash in lldb, and my current “that doesn’t seem right” has to do with a conflict between a decl and its origin decl (the transformation done at the beginning of tools/lldb/source/Expression/ClangASTSource.cpp:ClangASTSource::layoutRecordType()). So I’m trying to understand how decls and origin decls get setup during the symbol import process. Can anyone give me a sketch/hand? Specific questions include:

  • There are multiple ASTContexts involved (e.g. the src and dst contexts in the signature of tools/lldb/source/Symbol/ClangASTImporter.cpp:ClangASTImporter::CopyType); do those map to compilation units, or to shared library modules? Is there a simple way to tell what CU/.so an ASTContext maps to?
  • Does a decl always have an origin decl, even if it was loaded from an ASTContext (?) that has a complete definition?
  • When an origin decl is looked up, should all the types in it be completed, or might it have incomplete types? It seems as if there is code assuming that these types will always be complete.

Context (warning, gets detailed, possibly with irrelevant details because newbie): lldb is crashing in clang::ASTContext::getASTRecordLayout with the assertion “Cannot get layout of forward declarations!”. The type in question is an incomplete type (string16, aka. basic_string<unsigned short, …>). Normally clang::ASTContext::getASTRecordLayout() would call getExternalSource()->CompleteType() to complete the type, but in this case it isn’t because the type is marked as !hasExternalLexicalStorage().

The weird thing is that the type has previously been completed, further up the stack, but in a different AST node (same name). In more detail: Class A contains an instance of class B contains an instance of class C (==string16). I’m seeing getASTRecordLayout called on class A, which then calls it (indirectly, though the EmptySubobjectMap construtor) on class B, which then calls it (ditto) on class C (all works). Then the stack unwinds up to the B call, which proceeds to the Builder.Layout() line in that function. It ends up (through the transformation mentioned above in clang::ClangASTSource::LayoutRecordType()) calling getASTRecordLayout() on the origin decl. When it recurses down to class C, that node isn’t complete, isn’t completed, and causes an assertion. So I’m trying to figure out whether the problem is that any decl hanging off an origin_decl should be complete, or that that node shouldn’t be marked as !hasExternalLexicalStorage(). (Or something else; I’ve already gone through several twists and turns debugging this problem :-}.)

The crash is reproducible, but one of the reproduction steps is “Build chrome”, so I figured I’d work on it some myself to teach myself lldb rather than try to file a bug on it. The wisdom of that choice in question :-}.

Any thoughts anyone has would be welcome.

– Randy

This sounds like it might be a symptom of the -fstandalone-debug-info issue. There was some recent discussion on lldb-dev about it and in the past on cfe-dev.

First, are you building Chromium with Clang?

If you are, can you rebuild Chromium (or just the TU that has A’s debug info) with -fstandalone-debug-info and still reproduce the problem?

If so, we should still try to fix this in LLDB. It mostly just narrows down what kind of problem this is.

I'm chasing a crash in lldb, and my current "that doesn't seem right" has to do with a conflict between a decl and its origin decl (the transformation done at the beginning of tools/lldb/source/Expression/ClangASTSource.cpp:ClangASTSource::layoutRecordType()). So I'm trying to understand how decls and origin decls get setup during the symbol import process. Can anyone give me a sketch/hand? Specific questions include:
* There are multiple ASTContexts involved (e.g. the src and dst contexts in the signature of tools/lldb/source/Symbol/ClangASTImporter.cpp:ClangASTImporter::CopyType); do those map to compilation units, or to shared library modules? Is there a simple way to tell what CU/.so an ASTContext maps to?

Every executable file is represented by a lldb_private::Module (this includes both executables and shared libraries) and each lldb_private::Module has its own ASTContext (one per module, and all compilation units are all represented in one big ASTContext). The DWARF debug info is parsed and it creates types in the ASTContext in the corresponding lldb_private::Module.

* Does a decl always have an origin decl, even if it was loaded from an ASTContext (?) that has a complete definition?

Origin decl is so we know where a decl originally came from because the definition might not yet be complete (think "class Foo;") and might need to be completed. A little background on how we lazily parse classes.

When someone needs a type, we parse the type (SymbolFileDWARF::ParseType). If that type is a class we always just parse a forward decl to the class ("class Foo;"). The DWARF parser (SymbolFileDWARF) implements clang::ExternalASTSource so it can complete a type only when the compiler needs to know more. When the compiler or ClangASTType needs to know more about a type it asks the type to get a complete version of itself and SymbolFileDWARF::CompleteTagDecl is called to complete the type. We then parse all ivars, methods, and everything else about a type. We also assist in laying out the CXXRecordDecl by another callback SymbolFileDWARF::LayoutRecordType (which is part of the clang::ExternalASTSource). We need to assist in laying things out because the DWARF debug info doesn't always include all required attributes or #pragma information in order for us to create the types correctly. So this SymbolFileDWARF::LayoutRecordType allows us to tell the compiler about the offsets of ivars so they are always correct.

Back to origin decls: When running an expression we create a new ASTContext that is for the expression only. decls are copied from the ASTContext for the lldb_private::Module over into the ASTContext for the expression. When they are copied, only a forward decls are copied, and they may need to be completed. When this happens we might need to ask the type in the original ASTContext to complete itself so that we can copy a complete definition over into the expression ASTContext. This is the reason we track the origin decls. Sometimes you have a type that is only a forward decl, and that is ok as we don't always have the full definition of a class.

* When an origin decl is looked up, should all the types in it be completed, or might it have incomplete types? It seems as if there is code assuming that these types will always be complete.

There are two forms of incomplete types:
1 - incomplete types that have full definitions and just haven't been completed (and might have to find the original decl, ask it to complete itself, then copy the origin decl when the current decl needs to be copied from one AST to another)
2 - types that are actually forward declarations and will be told they are just forward decls

So we sometimes do run into cases where we don't have the debug info for something because the compiler pulled it out trying to minimize the debug info.

Context (warning, gets detailed, possibly with irrelevant details because newbie): lldb is crashing in clang::ASTContext::getASTRecordLayout with the assertion "Cannot get layout of forward declarations!". The type in question is an incomplete type (string16, aka. basic_string<unsigned short, ...>). Normally clang::ASTContext::getASTRecordLayout() would call getExternalSource()->CompleteType() to complete the type, but in this case it isn't because the type is marked as !hasExternalLexicalStorage().

That mean the type was not complete in the DWARF for the lldb_private::Module it originates from.

The *weird* thing is that the type has previously been completed, further up the stack, but in a different AST node (same name). In more detail: Class A contains an instance of class B contains an instance of class C (==string16). I'm seeing getASTRecordLayout called on class A, which then calls it (indirectly, though the EmptySubobjectMap construtor) on class B, which then calls it (ditto) on class C (all works). Then the stack unwinds up to the B call, which proceeds to the Builder.Layout() line in that function. It ends up (through the transformation mentioned above in clang::ClangASTSource::LayoutRecordType()) calling getASTRecordLayout() on the origin decl. When it recurses down to class C, that node isn't complete, isn't completed, and causes an assertion. So I'm trying to figure out whether the problem is that any decl hanging off an origin_decl should be complete, or that that node shouldn't be marked as !hasExternalLexicalStorage(). (Or something else; I've already gone through several twists and turns debugging this problem :-}.)

We have a problem in the compiler currently where for classes like:

class A : public B
{
    ...
}

The compiler says "ahh, you didn't use class B so I am not going to emit debug info for it.". This really can hose us up because we now create a ASTContext for the expression and we want a definition for "A" and the user wants to call a method that is in class "B", but we can't because the compiler removed the definition. What we currently do is figure out that we have a forward declaration to "B" only, and when we create type "A" in the module's ASTContext, we say "B" is an empty class with no ivars and no methods. To fix this, you can specify "-fstandalone-debug" to the clang compiler to tell it not to do this removal of debug info for things that are inherited from.

The other problem we have is say you two modules "foo.dylib" and "bar.dylib", both have debug info, and "foo.dylib" has debug info with a complete "A" and complete "B" definition, but "bar.dylib" has a complete "A" definition, but only a forward "B" definition. The ASTContext for foo.dylib believes class "A" to look like it really is, and "bar.dylib" has a definition for "A" that believe it inherits from an empty class with no ivars and no methods. Now we write and expression that uses a variable in "foo.dylib" whose type is "A" and one from "bar.dylib" whose type is "A" and we try to copy the definitions for "A" from the source ASTContext in "foo.dylib" over into the expression AST (this works) and then we try to copy the version from "bar.dylib" into the expression context and the AST copying code notices that the definitions for class "A" don't match. The copy would have worked in the copies of "A" are the same and nothing would have been copied, but it fails when they are different. This is a know limitation of using the clang ASTContext classes to represent our types and is also the reason the "-fstandalone-debug" is the default setting for clang or Darwin, and probably should be for anyone else wanting to use lldb to debug.

The crash is reproducible, but one of the reproduction steps is "Build chrome", so I figured I'd work on it some myself to teach myself lldb rather than try to file a bug on it. The wisdom of that choice in question :-}.

Any thoughts anyone has would be welcome.

So try things out with -fstandalone-debug and see if that fixes your problems. If it does it gives us a work around for now, but we should really be fixing any crashing bugs that occur due to this kind of issue in LLDB in the long run.

I hope this helps you understand a bit more and gives you enough to go on.

Greg

This sounds like it might be a symptom of the -fstandalone-debug-info
issue. There was some recent discussion on lldb-dev about it and in the
past on cfe-dev.

Yep. I thought it might be, but (as I read that initial discussion) it
sounded like building chrome with -fstandalone-debug wasn't a great idea
because it would blow out the linker. So I figured I'd dive in and try and
understand the problem in more detail and maybe fix it.

First, are you building Chromium with Clang?

If you are, can you rebuild Chromium (or just the TU that has A's debug
info) with -fstandalone-debug-info and still reproduce the problem?

Yes, and no. So it is -fstandalone-debug related. I'll respond on Greg's
response with more questions; I'd like to understand what would be required
to fix this properly, and if it looks tractable, maybe take it on. (From a
naive, debug user perspective, it was really easy to run into debugging
chrome.)

-- Randy

Greg: Thanks very much for the detailed explanation! As I mentioned in my response to Reid, this does indeed seem related to the known -fstandalone-debug issue. I’d still like to dig down to the floor (i.e. to the point where I understand this specific issue), with a vague hope that it may be a reasonable thing for me to try and fix. So I’d like to ask you a couple of questions on your summary.

First question: Is there a tool to probe for symbol information (forward decl vs. full information) in a shared library? I see llvm-dwarfdump, but it looks to be just dumping symbols rather than interpreting them.

Greg: Thanks very much for the detailed explanation! As I mentioned in my response to Reid, this does indeed seem related to the known -fstandalone-debug issue. I'd still like to dig down to the floor (i.e. to the point where I understand this specific issue), with a vague hope that it may be a reasonable thing for me to try and fix. So I'd like to ask you a couple of questions on your summary.

First question: Is there a tool to probe for symbol information (forward decl vs. full information) in a shared library? I see llvm-dwarfdump, but it looks to be just dumping symbols rather than interpreting them.

This comes down to really dumping the DWARF. We have a dwarfdump command on MacOSX, if you have access to a Mac I can help you with how to just see the information you want to as llvm-dwarfdump doesn't have the tools we need (lookup a DWARF debug info entry (DIE) by name, or by offset, dump a single DIE with children/parents, etc).

>
>
> I'm chasing a crash in lldb, and my current "that doesn't seem right" has to do with a conflict between a decl and its origin decl (the transformation done at the beginning of tools/lldb/source/Expression/ClangASTSource.cpp:ClangASTSource::layoutRecordType()). So I'm trying to understand how decls and origin decls get setup during the symbol import process. Can anyone give me a sketch/hand? Specific questions include:
> * There are multiple ASTContexts involved (e.g. the src and dst contexts in the signature of tools/lldb/source/Symbol/ClangASTImporter.cpp:ClangASTImporter::CopyType); do those map to compilation units, or to shared library modules? Is there a simple way to tell what CU/.so an ASTContext maps to?

Every executable file is represented by a lldb_private::Module (this includes both executables and shared libraries) and each lldb_private::Module has its own ASTContext (one per module, and all compilation units are all represented in one big ASTContext). The DWARF debug info is parsed and it creates types in the ASTContext in the corresponding lldb_private::Module.

> * Does a decl always have an origin decl, even if it was loaded from an ASTContext (?) that has a complete definition?

Origin decl is so we know where a decl originally came from because the definition might not yet be complete (think "class Foo;") and might need to be completed. A little background on how we lazily parse classes.

When someone needs a type, we parse the type (SymbolFileDWARF::ParseType). If that type is a class we always just parse a forward decl to the class ("class Foo;"). The DWARF parser (SymbolFileDWARF) implements clang::ExternalASTSource so it can complete a type only when the compiler needs to know more. When the compiler or ClangASTType needs to know more about a type it asks the type to get a complete version of itself and SymbolFileDWARF::CompleteTagDecl is called to complete the type. We then parse all ivars, methods, and everything else about a type. We also assist in laying out the CXXRecordDecl by another callback SymbolFileDWARF::LayoutRecordType (which is part of the clang::ExternalASTSource). We need to assist in laying things out because the DWARF debug info doesn't always include all required attributes or #pragma information in order for us to create the types correctly. So this SymbolFileDWARF::LayoutRecordType allows us to tell the compiler about the offsets of ivars so they are always correct.

Back to origin decls: When running an expression we create a new ASTContext that is for the expression only. decls are copied from the ASTContext for the lldb_private::Module over into the ASTContext for the expression. When they are copied, only a forward decls are copied, and they may need to be completed. When this happens we might need to ask the type in the original ASTContext to complete itself so that we can copy a complete definition over into the expression ASTContext. This is the reason we track the origin decls. Sometimes you have a type that is only a forward decl, and that is ok as we don't always have the full definition of a class.

> * When an origin decl is looked up, should all the types in it be completed, or might it have incomplete types? It seems as if there is code assuming that these types will always be complete.

There are two forms of incomplete types:
1 - incomplete types that have full definitions and just haven't been completed (and might have to find the original decl, ask it to complete itself, then copy the origin decl when the current decl needs to be copied from one AST to another)
2 - types that are actually forward declarations and will be told they are just forward decls

So we sometimes do run into cases where we don't have the debug info for something because the compiler pulled it out trying to minimize the debug info.

>
> Context (warning, gets detailed, possibly with irrelevant details because newbie): lldb is crashing in clang::ASTContext::getASTRecordLayout with the assertion "Cannot get layout of forward declarations!". The type in question is an incomplete type (string16, aka. basic_string<unsigned short, ...>). Normally clang::ASTContext::getASTRecordLayout() would call getExternalSource()->CompleteType() to complete the type, but in this case it isn't because the type is marked as !hasExternalLexicalStorage().

That mean the type was not complete in the DWARF for the lldb_private::Module it originates from.
>
> The *weird* thing is that the type has previously been completed, further up the stack, but in a different AST node (same name). In more detail: Class A contains an instance of class B contains an instance of class C (==string16). I'm seeing getASTRecordLayout called on class A, which then calls it (indirectly, though the EmptySubobjectMap construtor) on class B, which then calls it (ditto) on class C (all works). Then the stack unwinds up to the B call, which proceeds to the Builder.Layout() line in that function. It ends up (through the transformation mentioned above in clang::ClangASTSource::LayoutRecordType()) calling getASTRecordLayout() on the origin decl. When it recurses down to class C, that node isn't complete, isn't completed, and causes an assertion. So I'm trying to figure out whether the problem is that any decl hanging off an origin_decl should be complete, or that that node shouldn't be marked as !hasExternalLexicalStorage(). (Or something else; I've already gone through several twists and turns debugging this problem :-}.)

We have a problem in the compiler currently where for classes like:

class A : public B
{
    ...
}

The compiler says "ahh, you didn't use class B so I am not going to emit debug info for it.". This really can hose us up because we now create a ASTContext for the expression and we want a definition for "A" and the user wants to call a method that is in class "B", but we can't because the compiler removed the definition. What we currently do is figure out that we have a forward declaration to "B" only, and when we create type "A" in the module's ASTContext, we say "B" is an empty class with no ivars and no methods. To fix this, you can specify "-fstandalone-debug" to the clang compiler to tell it not to do this removal of debug info for things that are inherited from.

The other problem we have is say you two modules "foo.dylib" and "bar.dylib", both have debug info, and "foo.dylib" has debug info with a complete "A" and complete "B" definition, but "bar.dylib" has a complete "A" definition, but only a forward "B" definition. The ASTContext for foo.dylib believes class "A" to look like it really is, and "bar.dylib" has a definition for "A" that believe it inherits from an empty class with no ivars and no methods. Now we write and expression that uses a variable in "foo.dylib" whose type is "A" and one from "bar.dylib" whose type is "A" and we try to copy the definitions for "A" from the source ASTContext in "foo.dylib" over into the expression AST (this works) and then we try to copy the version from "bar.dylib" into the expression context and the AST copying code notices that the definitions for class "A" don't match. The copy would have worked in the copies of "A" are the same and nothing would have been copied, but it fails when they are different. This is a know limitation of using the clang ASTContext classes to represent our types and is also the reason the "-fstandalone-debug" is the default setting for clang or Darwin, and probably should be for anyone else wanting to use lldb to debug.

So that sounds like it could be my situation (with A (defined in liba) containing B (defined in libb) rather than inheriting from it, but I'd think that'd be identical from a layout perspective). But I'm not quite seeing how that maps to the execution flow I'm seeing in my debugging. If I understand your description above correctly, what I was seeing was CompleteType called on the forward decl of my A, and called successfully; both A & B were fully populated. But then later we got the origin decl for A, and CompleteType was called on it, and B was not filled out in that.

If this is the case where A and B were complete in the source AST and copied to a destination AST and B wasn't able to be completed, it might be just a need to complete the inherited class B in the source AST prior to copying it to the dest AST. I would be very surprised if this is the issue though since we wouldn't be able to complete class A without first having completed class B in the source AST.

Is it that the first CompleteType was done in the expression ASTContext (which presumably has access to search all the library ASTContexts) and the second one was done in the context of the liba ASTContext, and so didn't have access to the libb information? And if so, why isn't the first one strictly better?

So currently everything _only_ has visibility in their own AST when making types within an AST. So if liba has a complete A but a forward B, that is how the type would be represented in liba. When we are displaying a type later, we are able to grab the type from any AST if we know it is a forward decl, but if liba has a complete A and it inherits from a forward decl B, we will tell B within liba that it is complete and has no ivars or methods, otherwise the clang code that we use to build the module's AST will assert and kill your program because it is unhappy with class you are trying to create...

> The crash is reproducible, but one of the reproduction steps is "Build chrome", so I figured I'd work on it some myself to teach myself lldb rather than try to file a bug on it. The wisdom of that choice in question :-}.
>
> Any thoughts anyone has would be welcome.

So try things out with -fstandalone-debug and see if that fixes your problems. If it does it gives us a work around for now, but we should really be fixing any crashing bugs that occur due to this kind of issue in LLDB in the long run.

Do you have a sense of what the proper fix would be?

Just make sure LLDB does the best it can with the information it is given. In the above case as described, if we have a full A and forward decl B, we end up with the notion that we have:

class B {};

class A : public B {
    ... all ivars and methods for A
};

So we lose debugging fidelity because all debug info for B is not around.

In the previous thread I think you indicated that the compiler should emit debug information a la' -fstandalone-debug, and the linker should collapse the information back down, but in this case it seems like the debugger should be able to find the information in the other shared library (though I do understand that there's a more general problem that doesn't solve, when the debugging information isn't emitted anywhere for a particular class).

If there is a full definition for B _somewhere_ in liba, then we are good and this should work. If it isn't working this is the bug we need to fix. But if B is in another library like libb, then as far as we know for the type of A within liba, B is a forward declaration or just an empty base class.

Everything within a module is self contained, so all types are only derived from types from the current module. We have to keep things this way because you might unload libb.dylib and reload a newer version of libb.dylib. If we allowed modules to grab information from other modules, then we would have a large dependency graph to follow when a module is replaced... So if we copied a copy of B from libb.dylib before it was rebuilt, then we start debugging something that uses liba.dylib, and then libb.dylib get reloaded... Which version of "B" do you want if A hasn't been updated? The old "B" or the new "B"? And who is to say that the version of "B" that we imported from libb.dylib was correct in the first place? Maybe someone built liba.dylib when B looked like:

class B {
public:
    int m_int;
};

but libb.dylib was rebuilt so it now looks like:

class B {
public:
    int m_int[32];
};

But you still start a debug session with the liba.dylib that was built with the old B, but you pull in the debug info from the new libb.dylib.... You see where I am going with this? The only thing we can trust as far as debug information goes is the binary itself and its debug info. That guarantees we are as correct as possible, keeps us from having to try and track dependencies between modules.

One thing that is important to understand: when you display variables, we can pull information from any module. So if you have a class C:

class C {
public:
    B *m_b;
}

C c();

When we display this using "frame variable a" or using "expression a", when we try to display "B *m_b", we will ask the class B if it is a forward decl, and it is, the frame variable code will search all modules from the target we are using to debug (usually a couple of hundred different shared libraries) for the real definition of "B" and then use that when we try to expand "m_b" so we can view its ivars. So the variable display code knows how to always look for the real definition of things, but the type within each clang AST will only have visibility into its own module. 

One unfortunate side affect of having to complete "B" for class "A" when it looks like:

class A : public B {
    ... all ivars and methods for A
};

We told B it was complete and has no ivars or methods to keep clang happy to it doesn't assert and kill the debugger. So any other variables within that same module that have a "B *" ivar that was just a forward decl will think they have the complete definition of "B". Part of the solution to the issues you are running into is to mark the record decl for "B" in a way that said "I had to complete this type by telling it that it has no ivars or methods, but it was really a forward decl". That way when we try to display a type C from above (if C comes from a module with a full A that inherits from a forward B), we know to still try and find the full definition of "B" from somewhere else.

I hope this clears up some of the reasons for the way things are and helps you understand more the scope of the problem.

Greg

>
> Greg: Thanks very much for the detailed explanation! As I mentioned in
my response to Reid, this does indeed seem related to the known
-fstandalone-debug issue. I'd still like to dig down to the floor (i.e. to
the point where I understand this specific issue), with a vague hope that
it may be a reasonable thing for me to try and fix. So I'd like to ask you
a couple of questions on your summary.
>
> First question: Is there a tool to probe for symbol information (forward
decl vs. full information) in a shared library? I see llvm-dwarfdump, but
it looks to be just dumping symbols rather than interpreting them.

This comes down to really dumping the DWARF. We have a dwarfdump command
on MacOSX, if you have access to a Mac I can help you with how to just see
the information you want to as llvm-dwarfdump doesn't have the tools we
need (lookup a DWARF debug info entry ( DIE) by name, or by offset, dump a
single DIE with children/parents, etc).

I do--my laptop's a mac, and while it's not as beefy as my linux box, it's
serviceable. I'll try and take a look with that utility (which I'm hoping
will work on linux binaries).

>
> >
> >
> > I'm chasing a crash in lldb, and my current "that doesn't seem right"
has to do with a conflict between a decl and its origin decl (the
transformation done at the beginning of
tools/lldb/source/Expression/ClangASTSource.cpp:ClangASTSource::layoutRecordType()).
So I'm trying to understand how decls and origin decls get setup during
the symbol import process. Can anyone give me a sketch/hand? Specific
questions include:
> > * There are multiple ASTContexts involved (e.g. the src and dst
contexts in the signature of
tools/lldb/source/Symbol/ClangASTImporter.cpp:ClangASTImporter::CopyType);
do those map to compilation units, or to shared library modules? Is there
a simple way to tell what CU/.so an ASTContext maps to?
>
> Every executable file is represented by a lldb_private::Module (this
includes both executables and shared libraries) and each
lldb_private::Module has its own ASTContext (one per module, and all
compilation units are all represented in one big ASTContext). The DWARF
debug info is parsed and it creates types in the ASTContext in the
corresponding lldb_private::Module.
>
> > * Does a decl always have an origin decl, even if it was loaded from
an ASTContext (?) that has a complete definition?
>
> Origin decl is so we know where a decl originally came from because the
definition might not yet be complete (think "class Foo;") and might need to
be completed. A little background on how we lazily parse classes.
>
> When someone needs a type, we parse the type
(SymbolFileDWARF::ParseType). If that type is a class we always just parse
a forward decl to the class ("class Foo;"). The DWARF parser
(SymbolFileDWARF) implements clang::ExternalASTSource so it can complete a
type only when the compiler needs to know more. When the compiler or
ClangASTType needs to know more about a type it asks the type to get a
complete version of itself and SymbolFileDWARF::CompleteTagDecl is called
to complete the type. We then parse all ivars, methods, and everything else
about a type. We also assist in laying out the CXXRecordDecl by another
callback SymbolFileDWARF::LayoutRecordType (which is part of the
clang::ExternalASTSource). We need to assist in laying things out because
the DWARF debug info doesn't always include all required attributes or
#pragma information in order for us to create the types correctly. So this
SymbolFileDWARF::LayoutRecordType allows us to tell the compiler about the
offsets of ivars so they are always correct.
>
> Back to origin decls: When running an expression we create a new
ASTContext that is for the expression only. decls are copied from the
ASTContext for the lldb_private::Module over into the ASTContext for the
expression. When they are copied, only a forward decls are copied, and they
may need to be completed. When this happens we might need to ask the type
in the original ASTContext to complete itself so that we can copy a
complete definition over into the expression ASTContext. This is the reason
we track the origin decls. Sometimes you have a type that is only a forward
decl, and that is ok as we don't always have the full definition of a class.
>
> > * When an origin decl is looked up, should all the types in it be
completed, or might it have incomplete types? It seems as if there is code
assuming that these types will always be complete.
>
> There are two forms of incomplete types:
> 1 - incomplete types that have full definitions and just haven't been
completed (and might have to find the original decl, ask it to complete
itself, then copy the origin decl when the current decl needs to be copied
from one AST to another)
> 2 - types that are actually forward declarations and will be told they
are just forward decls
>
> So we sometimes do run into cases where we don't have the debug info for
something because the compiler pulled it out trying to minimize the debug
info.
>
> >
> > Context (warning, gets detailed, possibly with irrelevant details
because newbie): lldb is crashing in clang::ASTContext::getASTRecordLayout
with the assertion "Cannot get layout of forward declarations!". The type
in question is an incomplete type (string16, aka. basic_string<unsigned
short, ...>). Normally clang::ASTContext::getASTRecordLayout() would call
getExternalSource()->CompleteType() to complete the type, but in this case
it isn't because the type is marked as !hasExternalLexicalStorage().
>
> That mean the type was not complete in the DWARF for the
lldb_private::Module it originates from.
> >
> > The *weird* thing is that the type has previously been completed,
further up the stack, but in a different AST node (same name). In more
detail: Class A contains an instance of class B contains an instance of
class C (==string16). I'm seeing getASTRecordLayout called on class A,
which then calls it (indirectly, though the EmptySubobjectMap construtor)
on class B, which then calls it (ditto) on class C (all works). Then the
stack unwinds up to the B call, which proceeds to the Builder.Layout() line
in that function. It ends up (through the transformation mentioned above
in clang::ClangASTSource::LayoutRecordType()) calling getASTRecordLayout()
on the origin decl. When it recurses down to class C, that node isn't
complete, isn't completed, and causes an assertion. So I'm trying to
figure out whether the problem is that any decl hanging off an origin_decl
should be complete, or that that node shouldn't be marked as
!hasExternalLexicalStorage(). (Or something else; I've already gone
through several twists and turns debugging this problem :-}.)
>
> We have a problem in the compiler currently where for classes like:
>
> class A : public B
> {
> ...
> }
>
> The compiler says "ahh, you didn't use class B so I am not going to emit
debug info for it.". This really can hose us up because we now create a
ASTContext for the expression and we want a definition for "A" and the user
wants to call a method that is in class "B", but we can't because the
compiler removed the definition. What we currently do is figure out that we
have a forward declaration to "B" only, and when we create type "A" in the
module's ASTContext, we say "B" is an empty class with no ivars and no
methods. To fix this, you can specify "-fstandalone-debug" to the clang
compiler to tell it not to do this removal of debug info for things that
are inherited from.
>
>
> The other problem we have is say you two modules "foo.dylib" and
"bar.dylib", both have debug info, and "foo.dylib" has debug info with a
complete "A" and complete "B" definition, but "bar.dylib" has a complete
"A" definition, but only a forward "B" definition. The ASTContext for
foo.dylib believes class "A" to look like it really is, and "bar.dylib" has
a definition for "A" that believe it inherits from an empty class with no
ivars and no methods. Now we write and expression that uses a variable in
"foo.dylib" whose type is "A" and one from "bar.dylib" whose type is "A"
and we try to copy the definitions for "A" from the source ASTContext in
"foo.dylib" over into the expression AST (this works) and then we try to
copy the version from "bar.dylib" into the expression context and the AST
copying code notices that the definitions for class "A" don't match. The
copy would have worked in the copies of "A" are the same and nothing would
have been copied, but it fails when they are different. This is a know
limitation of using the clang ASTContext classes to represent our types and
is also the reason the "-fstandalone-debug" is the default setting for
clang or Darwin, and probably should be for anyone else wanting to use lldb
to debug.
>
> So that sounds like it could be my situation (with A (defined in liba)
containing B (defined in libb) rather than inheriting from it, but I'd
think that'd be identical from a layout perspective). But I'm not quite
seeing how that maps to the execution flow I'm seeing in my debugging. If
I understand your description above correctly, what I was seeing was
CompleteType called on the forward decl of my A, and called successfully;
both A & B were fully populated. But then later we got the origin decl for
A, and CompleteType was called on it, and B was not filled out in that.

If this is the case where A and B were complete in the source AST and
copied to a destination AST and B wasn't able to be completed, it might be
just a need to complete the inherited class B in the source AST prior to
copying it to the dest AST. I would be very surprised if this is the issue
though since we wouldn't be able to complete class A without first having
completed class B in the source AST.

> Is it that the first CompleteType was done in the expression ASTContext
(which presumably has access to search all the library ASTContexts) and the
second one was done in the context of the liba ASTContext, and so didn't
have access to the libb information? And if so, why isn't the first one
strictly better?

So currently everything _only_ has visibility in their own AST when making
types within an AST. So if liba has a complete A but a forward B, that is
how the type would be represented in liba. When we are displaying a type
later, we are able to grab the type from any AST if we know it is a forward
decl, but if liba has a complete A and it inherits from a forward decl B,
we will tell B within liba that it is complete and has no ivars or methods,
otherwise the clang code that we use to build the module's AST will assert
and kill your program because it is unhappy with class you are trying to
create...

>
>
> > The crash is reproducible, but one of the reproduction steps is "Build
chrome", so I figured I'd work on it some myself to teach myself lldb
rather than try to file a bug on it. The wisdom of that choice in
question :-}.
> >
> > Any thoughts anyone has would be welcome.
>
> So try things out with -fstandalone-debug and see if that fixes your
problems. If it does it gives us a work around for now, but we should
really be fixing any crashing bugs that occur due to this kind of issue in
LLDB in the long run.
>
> Do you have a sense of what the proper fix would be?

Just make sure LLDB does the best it can with the information it is given.
In the above case as described, if we have a full A and forward decl B, we
end up with the notion that we have:

class B {};

class A : public B {
    ... all ivars and methods for A
};

So we lose debugging fidelity because all debug info for B is not around.

> In the previous thread I think you indicated that the compiler should
emit debug information a la' -fstandalone-debug, and the linker should
collapse the information back down, but in this case it seems like the
debugger should be able to find the information in the other shared library
(though I do understand that there's a more general problem that doesn't
solve, when the debugging information isn't emitted anywhere for a
particular class).

If there is a full definition for B _somewhere_ in liba, then we are good
and this should work. If it isn't working this is the bug we need to fix.
But if B is in another library like libb, then as far as we know for the
type of A within liba, B is a forward declaration or just an empty base
class.

Ok, that seems like the key issue, then, and I should be able to figure out
relatively easily which case we're in.

Everything within a module is self contained, so all types are only
derived from types from the current module. We have to keep things this way
because you might unload libb.dylib and reload a newer version of
libb.dylib. If we allowed modules to grab information from other modules,
then we would have a large dependency graph to follow when a module is
replaced... So if we copied a copy of B from libb.dylib before it was
rebuilt, then we start debugging something that uses liba.dylib, and then
libb.dylib get reloaded... Which version of "B" do you want if A hasn't
been updated? The old "B" or the new "B"? And who is to say that the
version of "B" that we imported from libb.dylib was correct in the first
place? Maybe someone built liba.dylib when B looked like:

class B {
public:
    int m_int;
};

but libb.dylib was rebuilt so it now looks like:

class B {
public:
    int m_int[32];
};

But you still start a debug session with the liba.dylib that was built
with the old B, but you pull in the debug info from the new libb.dylib....
You see where I am going with this? The only thing we can trust as far as
debug information goes is the binary itself and its debug info. That
guarantees we are as correct as possible, keeps us from having to try and
track dependencies between modules.

One thing that is important to understand: when you display variables, we
can pull information from any module. So if you have a class C:

class C {
public:
    B *m_b;
}

C c();

When we display this using "frame variable a" or using "expression a",
when we try to display "B *m_b", we will ask the class B if it is a forward
decl, and it is, the frame variable code will search all modules from the
target we are using to debug (usually a couple of hundred different shared
libraries) for the real definition of "B" and then use that when we try to
expand "m_b" so we can view its ivars. So the variable display code knows
how to always look for the real definition of things, but the type within
each clang AST will only have visibility into its own module.

Amusingly, I'm doing almost exactly that. I'm stopped in a method of class
A, printing a member variable b of A which is of class B*, and both B & A
are defined in a single library. But the printing of b requires laying A
out, which requires completing A's type, which means completing a type also
defined in the same library for another member variable, which then
requires completing a type I believe is defined in a different library. My
presumption is that it's the laying out that gets me in trouble (because
it's important to keep the types separate between the shared libs).

One unfortunate side affect of having to complete "B" for class "A" when
it looks like:

class A : public B {
    ... all ivars and methods for A
};

We told B it was complete and has no ivars or methods to keep clang happy
to it doesn't assert and kill the debugger. So any other variables within
that same module that have a "B *" ivar that was just a forward decl will
think they have the complete definition of "B". Part of the solution to the
issues you are running into is to mark the record decl for "B" in a way
that said "I had to complete this type by telling it that it has no ivars
or methods, but it was really a forward decl". That way when we try to
display a type C from above (if C comes from a module with a full A that
inherits from a forward B), we know to still try and find the full
definition of "B" from somewhere else.

I hope this clears up some of the reasons for the way things are and helps
you understand more the scope of the problem.

It does; thank you. I want to keep digging until I'm certain that this
maps to the same issue. If it does, I'm probably not going to want to take
on fixing the linker to collapse redundant debug information (which sounds
like what's necessary), at least at this stage of my lldb/llvm engagement.
Maybe after I've gotten a couple more small changes under my belt.

-- Randy

Just to report back with my final analysis on this thread, in case it’s useful for other folks:

  • This was indeed a problem of a type not being fully defined in the shared library in which it was used. I couldn’t use the mac dwarfdump on linux binaries, but pyelftools (https://github.com/eliben/pyelftools) was pretty easy to hack to do what I wanted. Slow, but it worked.

  • I did a bit of investigation as to how painful using -fstandalone-debug would be for me as a workaround. In my usual, shared library build configuration, it increases build time by 245s = 11%. It’s a bit tricky for me to tell how much of that is in the compilation phase and how much in the linking phase (there are a lot of separate linking phases in the chrome shared library build), but in one shared library (net) build times went from 70.8s to 76.5s and linking went from 3.1 to 4.0s. Perfectly acceptable, but not ideal long-term.

  • I also looked at static linking; in this case the final link went from 41.8s to 92.3s. Possibly more worrisomely, the Max RSS went from 12GB to 26GB (i.e. the size of the machine required to successfully link chrome may have doubled, though maybe the RSS would have been smaller with more memory pressure–my machine main memory is 32GB).

  • Part of the reason I did that analysis is that IIUC, the currently proposed solution to this problem is to make compiling -fstandalone-debug the default, and then change the linker to eliminate duplicate symbol information. In other words, compilation (and compilation times) will be the same as what we observe today with -fstandalone-debug, and the linker will have more work to do, though possibly we can keep the RSS down if we’re clever. So for compile/link performance, I’m not sure the currently proposed solution is ideal. (Though I’m sure it would help binary size, and likely debugger symbol read performance.)

Just musing: How bad would it be to make unloading a shared library a very expensive (in the debugger) operation? Specifically, just nuke all the symbol information already read in? I think that would solve the interdependency problem you mentioned in your last email (with a sledgehammer, but it’s a solution :-J) and IIUC it would allow us to look up types between shared libraries without any worries.

– Randy

Just to report back with my final analysis on this thread, in case it's useful for other folks:

* This was indeed a problem of a type not being fully defined in the shared library in which it was used. I couldn't use the mac dwarfdump on linux binaries, but pyelftools (GitHub - eliben/pyelftools: Parsing ELF and DWARF in Python) was pretty easy to hack to do what I wanted. Slow, but it worked.

* I did a bit of investigation as to how painful using -fstandalone-debug would be for me as a workaround. In my usual, shared library build configuration, it increases build time by 245s = 11%. It's a bit tricky for me to tell how much of that is in the compilation phase and how much in the linking phase (there are a lot of separate linking phases in the chrome shared library build), but in one shared library (net) build times went from 70.8s to 76.5s and linking went from 3.1 to 4.0s. Perfectly acceptable, but not ideal long-term.

Agreed, ok for now but we should solve this in the long term.

* I also looked at static linking; in this case the final link went from 41.8s to 92.3s. Possibly more worrisomely, the Max RSS went from 12GB to 26GB (i.e. the size of the machine required to successfully link chrome may have doubled, though maybe the RSS would have been smaller with more memory pressure--my machine main memory is 32GB).

* Part of the reason I did that analysis is that IIUC, the currently proposed solution to this problem is to make compiling -fstandalone-debug the default, and then change the linker to eliminate duplicate symbol information. In other words, compilation (and compilation times) will be the same as what we observe today with -fstandalone-debug, and the linker will have *more* work to do, though possibly we can keep the RSS down if we're clever. So for compile/link performance, I'm not sure the currently proposed solution is ideal. (Though I'm sure it would help binary size, and likely debugger symbol read performance.)

Just musing: How bad would it be to make unloading a shared library a very expensive (in the debugger) operation?

Not sure what this would solve. If a shared library changes, we should reload it as soon as we detect this. On MacOSX we have a UUID in each binary that changes when any important bits change.

Specifically, just nuke *all* the symbol information already read in? I think that would solve the interdependency problem you mentioned in your last email (with a sledgehammer, but it's a solution :-J) and IIUC it would allow us to look up types between shared libraries without any worries.

That works for single binary debugging, but we strive to be able to share the debug info in each module. If we make any modifications it should be to the ClangASTImporter class and have it be able to recognize when it imported an incomplete class that was completed only to keep clang happy, and have it "do the right thing" by finding the real definition and importing the right version. Then -fno-standalone-debug can work. This would involve marking the classes/structs in the clang ASTs that were incorrectly completed to keep clang happy, and being able to track that during imports.

>
> Just to report back with my final analysis on this thread, in case it's
useful for other folks:
>
> * This was indeed a problem of a type not being fully defined in the
shared library in which it was used. I couldn't use the mac dwarfdump on
linux binaries, but pyelftools (GitHub - eliben/pyelftools: Parsing ELF and DWARF in Python) was
pretty easy to hack to do what I wanted. Slow, but it worked.
>
> * I did a bit of investigation as to how painful using
-fstandalone-debug would be for me as a workaround. In my usual, shared
library build configuration, it increases build time by 245s = 11%. It's a
bit tricky for me to tell how much of that is in the compilation phase and
how much in the linking phase (there are a lot of separate linking phases
in the chrome shared library build), but in one shared library (net) build
times went from 70.8s to 76.5s and linking went from 3.1 to 4.0s.
Perfectly acceptable, but not ideal long-term.

Agreed, ok for now but we should solve this in the long term.

> * I also looked at static linking; in this case the final link went from
41.8s to 92.3s. Possibly more worrisomely, the Max RSS went from 12GB to
26GB (i.e. the size of the machine required to successfully link chrome may
have doubled, though maybe the RSS would have been smaller with more memory
pressure--my machine main memory is 32GB).
>
> * Part of the reason I did that analysis is that IIUC, the currently
proposed solution to this problem is to make compiling -fstandalone-debug
the default, and then change the linker to eliminate duplicate symbol
information. In other words, compilation (and compilation times) will be
the same as what we observe today with -fstandalone-debug, and the linker
will have *more* work to do, though possibly we can keep the RSS down if
we're clever. So for compile/link performance, I'm not sure the currently
proposed solution is ideal. (Though I'm sure it would help binary size,
and likely debugger symbol read performance.)
>
> Just musing: How bad would it be to make unloading a shared library a
very expensive (in the debugger) operation?

Not sure what this would solve. If a shared library changes, we should
reload it as soon as we detect this. On MacOSX we have a UUID in each
binary that changes when any important bits change.

> Specifically, just nuke *all* the symbol information already read in? I
think that would solve the interdependency problem you mentioned in your
last email (with a sledgehammer, but it's a solution :-J) and IIUC it would
allow us to look up types between shared libraries without any worries.

That works for single binary debugging, but we strive to be able to share
the debug info in each module.

I don't think I'm following your terminology. Presuming "single binary"
means "statically linked, no shared libraries", I think my suggestion would
work for multiple binaries (i.e. "dynamically linked, several shared
libraries"); it just (sic?) involves wiping out all in-memory debugging
information whenever any shared library is unloaded. That might not be
worth it if there's a common use case of debugging programs that often
unload and reload shared libraries (and if it's expensive for lldb to load
debugging information in from disk). But I suspect I'm not following what
you're saying on a more basic level.

If we make any modifications it should be to the ClangASTImporter class and

have it be able to recognize when it imported an incomplete class that was
completed only to keep clang happy, and have it "do the right thing" by
finding the real definition and importing the right version.Then
-fno-standalone-debug can work. This would involve marking the
classes/structs in the clang ASTs that were incorrectly completed to keep
clang happy, and being able to track that during imports.

I thought that we didn't want to complete types between libraries. If type
1 is defined in library A but contains a copy of type 2 which is incomplete
in library A, we didn't want to import it from library B (where a complete
definition existed) because that meant of a new version of library B was
loaded, we'd have to untangle the A->B (1->2) pointers created when we
looked up type 2 in library B. Are you suggesting that we make those
points go through some type of intermediate data structure that we can
dynamically update? If so, I certainly see the value of that approach over
the one I'm suggesting. It's also more complex coding (based on my current
limited knowledge of the internals of lldb), but still probably the right
long-term solution to aim for.

-- Randy