[DebugInfo] Enabling constructor homing by default

Hi debug-info folks,

I've recently been experimenting with the -debug-info-kind=constructor
model for debug-info creation, which is leading to some significant
reductions in .debug_info on our large C++ benchmarks, which is great!
I see in PR46537 that there's a plan to eventually enable this by
default -- is this something we can target for LLVM12, or are there
outstanding issues?

While experimenting I was also interested to see that for DWARF and
constructor homing, we emit skeleton type definitions if functions
have an inlined copy in the translation unit. For example in [0] where
I've uploaded a couple of dexter tests for constructor homing, in
partial-type/main.cpp we get:

DW_TAG_class_type
  DW_AT_name ("foo")
  DW_AT_declaration (true)

  DW_TAG_subprogram
    DW_AT_linkage_name ("_ZNK3foo8asStringB5cxx11Ev")
    DW_AT_name ("asString")
    DW_AT_decl_file ("./theclass.h")
    DW_AT_decl_line (12)
    DW_AT_type (0x000014e6 "string")
    DW_AT_declaration (true)
    DW_AT_external (true)
    DW_AT_accessibility (DW_ACCESS_public)

    DW_TAG_formal_parameter
      DW_AT_type (0x0000371b "const foo*")
      DW_AT_artificial (true)

    NULL

And as expected no further type information (aside from the
destructor, also inlined). It seems gdb and lldb are able to find the
full type definition even when there's a skeleton type. When exploring
this with Paul, we worried a bit that LTO could de-duplicate to the
skeleton type definition rather than the full one, is there protection
against that happening somewhere?

[0] https://reviews.llvm.org/D91648

Hi debug-info folks,

I've recently been experimenting with the -debug-info-kind=constructor
model for debug-info creation, which is leading to some significant
reductions in .debug_info on our large C++ benchmarks, which is great!
I see in PR46537 that there's a plan to eventually enable this by
default -- is this something we can target for LLVM12, or are there
outstanding issues?

There's some discussion around the issue found/patch proposed here:
https://reviews.llvm.org/D90719 - I hope we can fix libc++ instead of
adding a workaround in ctor homing itself.

While experimenting I was also interested to see that for DWARF and
constructor homing, we emit skeleton type definitions if functions
have an inlined copy in the translation unit. For example in [0] where
I've uploaded a couple of dexter tests for constructor homing, in
partial-type/main.cpp we get:

DW_TAG_class_type
  DW_AT_name ("foo")
  DW_AT_declaration (true)

  DW_TAG_subprogram
    DW_AT_linkage_name ("_ZNK3foo8asStringB5cxx11Ev")
    DW_AT_name ("asString")
    DW_AT_decl_file ("./theclass.h")
    DW_AT_decl_line (12)
    DW_AT_type (0x000014e6 "string")
    DW_AT_declaration (true)
    DW_AT_external (true)
    DW_AT_accessibility (DW_ACCESS_public)

    DW_TAG_formal_parameter
      DW_AT_type (0x0000371b "const foo*")
      DW_AT_artificial (true)

    NULL

And as expected no further type information (aside from the
destructor, also inlined). It seems gdb and lldb are able to find the
full type definition even when there's a skeleton type.

Yep, this kind of DWARF is already generated for a number of other
cases - the other two forms of type homing that are implemented in
clang already:

* vtable based type homing (gcc implements this as well):
  struct t1 { virtual void f1(); };
  t1 v1; // use.cpp
  void t1::f1() { } // definition.cpp
  (the file with "use" will not have a definition of t1, the
definition of t1 will appear in the file containing the definition of
f1)
* explicit template instantiation decl/def (gcc doesn't implement this):
  template<typename T> struct t1 { };
  extern template struct t1<int>; // any use of t1<int> that is
covered by this decl will have t1 as a declaration, not a definition
  template struct t1<int>; // this will force the definition of
t1<int> to be emitted, even if it's otherwise unreferenced

All sorts of members can appear in these skeletal definitions. Even
non-inline members can appear there - if the ctor isn't defined in
this translation unit, for instance (eg: if you have an inline ctor -
and your implementation file defines the non-inline members, as it
should, then the type may not be defined in the DWARF for that
translation unit).

This sort of behavior is also seen with type units - where the type is
declared in the CU (DW_TAG_structure_type with DW_AT_declaration true
and DW_AT_signature) but then any members that need to be referenced
(eg: member function declarations that need to be referenced from
member function definitions outside the type definition/type unit - or
nested types that need to be referenced, etc) are included in this
skeletal type declaration. GCC implements this the same way.

Because GCC implements the vtable homing and the type unit member
situation the same way (not a coincidence, I copied both of these from
GCC when working on reducing Clang's debug info size), it's a pretty
solid foundation to build other homing strategies on top of.

When exploring
this with Paul, we worried a bit that LTO could de-duplicate to the
skeleton type definition rather than the full one, is there protection
against that happening somewhere?

Yep - this is the same logic that's used for the simpler cases (eg:
one file contains "struct x; x *y;" and another file contains "struct
x { }; x z;" - LTO is already designed to deduplicate those two and
prefer the definition).

For more detail, consider the LLVM IR metadata representation of thees
"skeleton types" - actually they're more identical to a pure
declaration (as would be produced by the 'y' example above):

type definitions in LLVM IR debug info metadata include a list of
members, but this list is not exhaustive (even without any interesting
type homing) - for instance implicit special members, member/nested
types, and instantiations of member function templates - all those
kinds of members do not appear in the member list, but instead they
appear separately (held alive by the llvm::Function or similar that
refers to them) and declare that their scope is the type they are a
member of. Effectively they insert themselves into the type.

This means that when two LLVM IR modules are linked together, the
types can be deduplicated based on the ODR (using the types mangled
name as the key) without trying to merge the member lists - but if one
module has mem<int> defined and another has mem<float> defined, they
naturally merge - the type is deduplicated and so the "scope" of those
members that aren't in the member list naturally end up referring to
the singular chosen type definition.

Type homing adds the possibility that these non-member-list
definitions can also be plain member functions, not in that special
list of 3 kinds of entities, and that these non-member-list
definitions can refer to a declaration of a type rather than a
definition. Same merging happens - declaration gets deduplicated with
the definition, and all these non-member-list definitions now refer to
the definition.

(this would be important even if there were no member functions, eg in
C code one file might have "struct t1; void f1(t1*) {}" and another
might have "struct t1 { }; void f2(t1) {}" and we would want to ensure
that when those modules are linked together, both the type of the
pointer in f1 and the type of the parameter in f2 refer to the same
type - and that that type is the definition, not the declaration)