[Proposal][Debuginfo] dsymutil-like tool for ELF.

dblaikie · September 2, 2020, 6:44pm

Fair enough - thanks for clarifying the differences! (I’d still lean a bit towards this being dwz-esque, as you say “an extension of classic dwz”

I doubt a little about “llvm-dwz” since it might confuse people who would expect exactly the same behavior.
But if we think of it as “an extension of classic dwz” and the possible confusion is not a big deal then
I would be fine with “llvm-dwz”.

using a bit more domain knowledge (of terminators and C++ odr - though I’m not sure dsymutil does rely on the ODR, does it? It relies on it to know that two names represent the same type, I suppose, but doesn’t assume they’re already identical, instead it merges their members))

if dsymutil is able to find a full definition then it would remove all other definitions(which matched by name) and set all references to that found definition. If it is not able to find a full definition then it would do nothing. i.e. if there are two incomplete definitions(DW_AT_declaration (true)) with the same name then they would not be merged. That is a possible improvement - to teach dsymutil to merge incomplete types.

Huh, what does it do with extra member function definitions found in later definitions? (eg: struct x { template void f(); }; - in one translation unit x::f is instantiated, in another x::f is instantiated - how are the two represented with dsymutil?)

Alexey · September 2, 2020, 10:26pm

    Fair enough - thanks for clarifying the differences! (I'd still
    lean a bit towards this being dwz-esque, as you say "an extension
    of classic dwz"

    I doubt a little about "llvm-dwz" since it might confuse people
    who would expect exactly the same behavior.
    But if we think of it as "an extension of classic dwz" and the
    possible confusion is not a big deal then
    I would be fine with "llvm-dwz".

    using a bit more domain knowledge (of terminators and C++ odr -
    though I'm not sure dsymutil does rely on the ODR, does it? It
    relies on it to know that two names represent the same type, I
    suppose, but doesn't assume they're already identical, instead it
    merges their members))

    if dsymutil is able to find a full definition then it would remove
    all other definitions(which matched by name) and set all
    references to that found definition. If it is not able to find a
    full definition then it would do nothing. i.e. if there are two
    incomplete definitions(DW_AT_declaration (true)) with the same
    name then they would not be merged. That is a possible improvement
    - to teach dsymutil to merge incomplete types.

Huh, what does it do with extra member function definitions found in later definitions? (eg: struct x { template<typename T> void f(); }; - in one translation unit x::f<int> is instantiated, in another x::f<float> is instantiated - how are the two represented with dsymutil?)

They would be considered as two not matched types. dsymutil would not merge them somehow and thus would not use single type description. There would be two separate types called "x" which would have mostly matched members but differ with x::f<int> and x::f<float>. No any de-duplication in that case.

dblaikie · September 2, 2020, 10:36pm

Fair enough - thanks for clarifying the differences! (I’d still lean a bit towards this being dwz-esque, as you say “an extension of classic dwz”

I doubt a little about “llvm-dwz” since it might confuse people who would expect exactly the same behavior.
But if we think of it as “an extension of classic dwz” and the possible confusion is not a big deal then
I would be fine with “llvm-dwz”.

using a bit more domain knowledge (of terminators and C++ odr - though I’m not sure dsymutil does rely on the ODR, does it? It relies on it to know that two names represent the same type, I suppose, but doesn’t assume they’re already identical, instead it merges their members))

if dsymutil is able to find a full definition then it would remove all other definitions(which matched by name) and set all references to that found definition. If it is not able to find a full definition then it would do nothing. i.e. if there are two incomplete definitions(DW_AT_declaration (true)) with the same name then they would not be merged. That is a possible improvement - to teach dsymutil to merge incomplete types.

Huh, what does it do with extra member function definitions found in later definitions? (eg: struct x { template void f(); }; - in one translation unit x::f is instantiated, in another x::f is instantiated - how are the two represented with dsymutil?)

They would be considered as two not matched types. dsymutil would not merge them somehow and thus would not use single type description. There would be two separate types called “x” which would have mostly matched members but differ with x::f and x::f. No any de-duplication in that case.

Oh, that’s unfortunate. It’d be nice for C++ at least, to implement a potentially faster dsymutil mode that could get this right and not have to actually check for type equivalence, instead relying on the name of the type to determine that it must be identical.

The first instance of the type that’s encountered has its fully qualified name or mangled name recorded in a map pointing to the DIE. Any future instance gets downgraded to a declaration, and /certain/ members get dropped, but other members get stuck on the declaration (same sort of DWARF you see with “struct foo { virtual void f1(); template void f2() { } }; void test(foo& f) { f.f2(); }”). Recording all the member functions of the type/static member variable types might be needed in cases where some member functions are defined in one translation unit and some defined in another - though I guess that infrastructure is already in place/that just works today.

Dave

Alexey · September 3, 2020, 11:07am

Not that we do not interested in strip-debug/only-keep-debug kind of debug info distribution model. But our customers also found the model, when optimized debug info is already put into the binary, useful. It is a bit more convenient to pass a single binary to someone other to debug. Another thing is that it is a bit more convenient to manage/keep a single binary with debug info for daily builds to be able to quickly evaluate possible problems. Using a stripped debug info file assumes some process to work with it(how it is stored/how is distributed). Such a process makes sense when binaries shared with customers. But when debug builds are shared inside an organization it might be more convenient to share just a single file. Thus, it would be convenient if tools would support both scenarios.

Alexey · September 3, 2020, 12:15pm

My understanding, is that there is not such infrastructure currently. Current infrastructure allows to reference single existing type declaration(canonical) from other units. It does not allow to reference different parts(in different units) of incomplete type. I think it would be necessary to change the order of how compilation units are processed to implement such types merging. Currently, after the compilation unit is analyzed(scanned for types and dead info) it started to be emitted. It looks like, to support merging, it would be necessary to analyze all CUs first(to create canonical representation) and then start to emit them. I am going to start to work on a prototype of parallel per-compilation unit implementation of DWARFLinker. (basing on the scenario which Jonas described in other letter in that thread). The types merging could be the next step… Following is the result of compilation of your example on darwin(showing that dsymutil does not merge such types): $ cat struct.h #ifndef MY_H #define MY_H struct foo { template int fff () { return sizeof(T); } }; #endif // MY_H $ cat mod1.cpp #include “struct.h” int test1 ( ) { foo var; return var.fff(); } $ cat mod2.cpp #include “struct.h” int test2 ( ) { foo var; return var.fff(); } $ cat main.cpp #include “struct.h” int test1(); int test2(); int main ( void ) { test1(); test2(); return 0; } $ clang++ main.cpp mod1.cpp mod2.cpp -O -g -fno-inline $ llvm-dwarfdump -a a.out.dSYM/Contents/Resources/DWARF/a.out | less 0x00000056: DW_TAG_compile_unit DW_AT_language (DW_LANG_C_plus_plus) DW_AT_name (“mod1.cpp”) 0x000000ae: DW_TAG_structure_type <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< DW_AT_name (“foo”) DW_AT_byte_size (0x01) 0x000000b7: DW_TAG_subprogram DW_AT_linkage_name (“_ZN3foo3fffIiEEiv”) DW_AT_name (“fff”) 0x0000011f: DW_TAG_compile_unit DW_AT_language (DW_LANG_C_plus_plus) DW_AT_name (“mod2.cpp”) 0x00000177: DW_TAG_structure_type <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< DW_AT_name (“foo”) DW_AT_byte_size (0x01) 0x00000180: DW_TAG_subprogram DW_AT_linkage_name (“_ZN3foo3fffIfEEiv”) DW_AT_name (“fff”)

dblaikie · September 3, 2020, 5:56pm

Fair enough - thanks for clarifying the differences! (I’d still lean a bit towards this being dwz-esque, as you say “an extension of classic dwz”

I doubt a little about “llvm-dwz” since it might confuse people who would expect exactly the same behavior.
But if we think of it as “an extension of classic dwz” and the possible confusion is not a big deal then
I would be fine with “llvm-dwz”.

using a bit more domain knowledge (of terminators and C++ odr - though I’m not sure dsymutil does rely on the ODR, does it? It relies on it to know that two names represent the same type, I suppose, but doesn’t assume they’re already identical, instead it merges their members))

if dsymutil is able to find a full definition then it would remove all other definitions(which matched by name) and set all references to that found definition. If it is not able to find a full definition then it would do nothing. i.e. if there are two incomplete definitions(DW_AT_declaration (true)) with the same name then they would not be merged. That is a possible improvement - to teach dsymutil to merge incomplete types.

Huh, what does it do with extra member function definitions found in later definitions? (eg: struct x { template void f(); }; - in one translation unit x::f is instantiated, in another x::f is instantiated - how are the two represented with dsymutil?)

They would be considered as two not matched types. dsymutil would not merge them somehow and thus would not use single type description. There would be two separate types called “x” which would have mostly matched members but differ with x::f and x::f. No any de-duplication in that case.

Oh, that’s unfortunate. It’d be nice for C++ at least, to implement a potentially faster dsymutil mode that could get this right and not have to actually check for type equivalence, instead relying on the name of the type to determine that it must be identical.

Right. That would result in even more size reduction.

The first instance of the type that’s encountered has its fully qualified name or mangled name recorded in a map pointing to the DIE. Any future instance gets downgraded to a declaration, and /certain/ members get dropped, but other members get stuck on the declaration (same sort of DWARF you see with “struct foo { virtual void f1(); template void f2() { } }; void test(foo& f) { f.f2(); }”). Recording all the member functions of the type/static member variable types might be needed in cases where some member functions are defined in one translation unit and some defined in another - though I guess that infrastructure is already in place/that just works today.

My understanding, is that there is not such infrastructure currently. Current infrastructure allows to reference single existing type declaration(canonical) from other units. It does not allow to reference different parts(in different units) of incomplete type.

Huh, so what does the DWARF look like when you define one member function in one file, and another member function (common with inline functions) in another file?

I think it would be necessary to change the order of how compilation units are processed to implement such types merging.

Oh, I wasn’t suggesting merging them - or didn’t mean to suggest that. I meant doing something like what we do in LLVM for type homed (no-standalone) DWARF, where we attach function declarations to type declarations, eg:

struct x {

void f1();

void f2();

template

static void f3();

};

#ifdef HOME

void x::f1() {

}

#endif

#ifdef AWAY

void x::f2() {

}

#endif

#ifdef TEMPL

template

void x::f3() {

}

template void x::f3();

#endif

Building “HOME” would show the DWARF I’d expect to see the first time a type definition is encountered during dsym.
Building “AWAY” raises the question of - what does dsymutil do with this DWARF? Does it deduplicate the type, and make the definition of ‘f2’ point to the ‘f2’ declaration in the original type described in the prior CU defined in “HOME”? If it doesn’t do that, it could/that would be good to reduce the DWARF size.
Building “TEMPL” would show the DWARF I’d expect to see if a future use of that type definition was encountered but the original/home definition had no declaration of this function: we should then emit maybe an “extension” to the type (could be a straight declaration, or maybe some newer/weirder hybrid that points to the definition with some attribute) & then inject the declaration of the template/other new member into this extension definition, etc.

Currently, after the compilation unit is analyzed(scanned for types and dead info) it started to be emitted.
It looks like, to support merging, it would be necessary to analyze all CUs first(to create canonical representation) and then start to emit them.

I am going to start to work on a prototype of parallel per-compilation unit implementation of DWARFLinker.
(basing on the scenario which Jonas described in other letter in that thread).
The types merging could be the next step…

Following is the result of compilation of your example on darwin(showing that dsymutil does not merge such types):

Ah, yeah, that is unfortunate - so if there were other members of “x” they would be duplicated in this case, right?

This is a pretty common issue in C++ - there are 3 reasons I know of where LLVM would produce distinct descriptions:

member function templates, like this
member/nested types
implicit special members (not present unless instantiated - so if you copy construct an object in one file and not in another, two different types)

Alexey · September 4, 2020, 10:42am

Please check the reduced DWARF, generated by current dsymutil for above example : 0x0000000b: DW_TAG_compile_unit DW_AT_language (DW_LANG_C_plus_plus) DW_AT_name (“home.cpp”) DW_AT_stmt_list (0x00000000) DW_AT_low_pc (0x0000000100000f80) DW_AT_high_pc (0x0000000100000f8b) 0x0000002a: DW_TAG_structure_type DW_AT_name (“x”) DW_AT_byte_size (0x01) 0x00000033: DW_TAG_subprogram DW_AT_linkage_name (“_ZN1x2f1Ev”) DW_AT_name (“f1”) DW_AT_type (0x000000000000005e “int”) DW_AT_declaration (true) DW_AT_external (true) DW_AT_APPLE_optimized (true) 0x00000047: NULL 0x00000048: DW_TAG_subprogram DW_AT_linkage_name (“_ZN1x2f2Ev”) DW_AT_name (“f2”) DW_AT_type (0x000000000000005e “int”) DW_AT_declaration (true) DW_AT_external (true) DW_AT_APPLE_optimized (true) 0x0000005c: NULL 0x0000005d: NULL 0x0000006a: DW_TAG_subprogram DW_AT_low_pc (0x0000000100000f80) DW_AT_high_pc (0x0000000100000f8b) DW_AT_specification (0x0000000000000033 “_ZN1x2f1Ev”) 0x000000a0: DW_TAG_compile_unit DW_AT_language (DW_LANG_C_plus_plus) DW_AT_name (“away.cpp”) DW_AT_stmt_list (0x00000048) DW_AT_low_pc (0x0000000100000f90) DW_AT_high_pc (0x0000000100000f9b) 0x000000c6: DW_TAG_subprogram DW_AT_low_pc (0x0000000100000f90) DW_AT_high_pc (0x0000000100000f9b) DW_AT_specification (0x0000000000000048 “_ZN1x2f2Ev”) 0x000000fc: DW_TAG_compile_unit DW_AT_language (DW_LANG_C_plus_plus) DW_AT_name (“templ.cpp”) DW_AT_stmt_list (0x00000090) DW_AT_low_pc (0x0000000100000fa0) DW_AT_high_pc (0x0000000100000fab) 0x0000011b: DW_TAG_structure_type DW_AT_name (“x”) DW_AT_byte_size (0x01) 0x00000124: DW_TAG_subprogram DW_AT_linkage_name (“_ZN1x2f1Ev”) DW_AT_name (“f1”) DW_AT_type (0x0000000000000168 “int”) DW_AT_declaration (true) DW_AT_external (true) DW_AT_APPLE_optimized (true) 0x00000138: NULL 0x00000139: DW_TAG_subprogram DW_AT_linkage_name (“_ZN1x2f2Ev”) DW_AT_name (“f2”) DW_AT_type (0x0000000000000168 “int”) DW_AT_declaration (true) DW_AT_external (true) DW_AT_APPLE_optimized (true) 0x0000014d: NULL 0x0000014e: DW_TAG_subprogram DW_AT_linkage_name (“_ZN1x2f3IiEEiv”) DW_AT_name (“f3”) DW_AT_type (0x0000000000000168 “int”) DW_AT_declaration (true) DW_AT_external (true) DW_AT_APPLE_optimized (true) 0x00000166: NULL 0x00000167: NULL 0x00000174: DW_TAG_subprogram DW_AT_low_pc (0x0000000100000fa0) DW_AT_high_pc (0x0000000100000fab) DW_AT_specification (0x000000000000014e “_ZN1x2f3IiEEiv”) 0x00000190: NULL >Building “HOME” would show the DWARF I’d expect to see the first time a type definition is encountered during dsym. compile unit “home.cpp” contains the type definition(0x0000002a) and reference to its member(DW_AT_specification (0x0000000000000033 “_ZN1x2f1Ev”)). >Building “AWAY” raises the question of - what does dsymutil do with this DWARF? Does it deduplicate the type, and make the definition of ‘f2’ point to the ‘f2’ declaration in the original type described in the prior CU defined in “HOME”? If it doesn’t do that, it could/that would be good to reduce the DWARF size. compile unit “away.cpp” does not contain type definition and contains reference to type definition from compile unit “home.cpp” (DW_AT_specification (0x0000000000000048 “_ZN1x2f2Ev”)). i.e. dsymutil deduplicates the type and makes the definition of ‘f2’ point to the ‘f2’ declaration in the original type described in the prior CU “home.cpp”. >Building “TEMPL” would show the DWARF I’d expect to see if a future use of that type definition was encountered but the original/home definition had no declaration of this function: we should then emit maybe an “extension” to the type (could be a straight declaration, or maybe some newer/weirder hybrid that points to the definition with some attribute) & then inject the declaration of the template/other new member into this extension definition, etc. compile unit “templ.cpp” contains the type definition(0x0000011b) which matches with (0x0000002a) plus defines the new member 0x0000014e. It also references this new member by DW_AT_specification (0x000000000000014e “_ZN1x2f3IiEEiv”). In this case type description is not de-duplicated. Do you suggest that 0x0000011b should be transformed into something like that: 0x000000fc: DW_TAG_compile_unit DW_AT_language (DW_LANG_C_plus_plus) DW_AT_name (“templ.cpp”) DW_AT_stmt_list (0x00000090) DW_AT_low_pc (0x0000000100000fa0) DW_AT_high_pc (0x0000000100000fab) 0x0000011b: DW_TAG_structure_type DW_AT_specification (0x0000002a “x”) 0x00000124: DW_TAG_subprogram DW_AT_linkage_name (“_ZN1x2f3IiEEiv”) DW_AT_name (“f3”) DW_AT_type (0x000000000000005e “int”) DW_AT_declaration (true) DW_AT_external (true) DW_AT_APPLE_optimized (true) 0x00000138: NULL 0x00000139: NULL 0x00000140: DW_TAG_subprogram DW_AT_low_pc (0x0000000100000fa0) DW_AT_high_pc (0x0000000100000fab) DW_AT_specification (0x0000000000000124 “_ZN1x2f3IiEEiv”) 0x00000155: NULL Did I correctly get the idea?

Alexey · September 14, 2020, 11:17am

Debuginfo folks,

What is your opinion on this proposal?

Do we need to work on better DWARFLinker library for now? or Can we start to integrate llvm-dwarfutil as a series of small patches?

If it is OK to start integrating of llvm-dwarfutil, Is it OK to move llvm-objcopy implementation into separate library ObjCopy ?

Thank you, Alexey.

jh7370 · September 14, 2020, 12:02pm

I’d be happy with the body of llvm-objcopy being moved into a library regardless of the outcome of llvm-dwarfutil. The big advantage from that would be that we can write gtest unit tests for the library components rather than having to coerce an input to trigger the exact behaviour we want with lit. One of last year’s GSOC projects, worked on by Alex Brachet-Mialot (CC’ed), was to create a more general-purpose object editing library that objcopy would use. See also https://bugs.llvm.org/show_bug.cgi?id=41044.

James

Alexey · September 15, 2020, 10:24am

I'd be happy with the body of llvm-objcopy being moved into a library regardless of the outcome of llvm-dwarfutil. The big advantage from that would be that we can write gtest unit tests for the library components rather than having to coerce an input to trigger the exact behaviour we want with lit. One of last year's GSOC projects, worked on by Alex Brachet-Mialot (CC'ed), was to create a more general-purpose object editing library that objcopy would use. See also 41044 – Librarify llvm-objcopy.

Ok. I would prepare the patch for it then(so that it could be discussed whether it fits the project).

Alexey.

dblaikie · October 22, 2020, 9:22pm

Fair enough - thanks for clarifying the differences! (I’d still lean a bit towards this being dwz-esque, as you say “an extension of classic dwz”

I doubt a little about “llvm-dwz” since it might confuse people who would expect exactly the same behavior.
But if we think of it as “an extension of classic dwz” and the possible confusion is not a big deal then
I would be fine with “llvm-dwz”.

using a bit more domain knowledge (of terminators and C++ odr - though I’m not sure dsymutil does rely on the ODR, does it? It relies on it to know that two names represent the same type, I suppose, but doesn’t assume they’re already identical, instead it merges their members))

if dsymutil is able to find a full definition then it would remove all other definitions(which matched by name) and set all references to that found definition. If it is not able to find a full definition then it would do nothing. i.e. if there are two incomplete definitions(DW_AT_declaration (true)) with the same name then they would not be merged. That is a possible improvement - to teach dsymutil to merge incomplete types.

Huh, what does it do with extra member function definitions found in later definitions? (eg: struct x { template void f(); }; - in one translation unit x::f is instantiated, in another x::f is instantiated - how are the two represented with dsymutil?)

They would be considered as two not matched types. dsymutil would not merge them somehow and thus would not use single type description. There would be two separate types called “x” which would have mostly matched members but differ with x::f and x::f. No any de-duplication in that case.

Oh, that’s unfortunate. It’d be nice for C++ at least, to implement a potentially faster dsymutil mode that could get this right and not have to actually check for type equivalence, instead relying on the name of the type to determine that it must be identical.

Right. That would result in even more size reduction.

The first instance of the type that’s encountered has its fully qualified name or mangled name recorded in a map pointing to the DIE. Any future instance gets downgraded to a declaration, and /certain/ members get dropped, but other members get stuck on the declaration (same sort of DWARF you see with “struct foo { virtual void f1(); template void f2() { } }; void test(foo& f) { f.f2(); }”). Recording all the member functions of the type/static member variable types might be needed in cases where some member functions are defined in one translation unit and some defined in another - though I guess that infrastructure is already in place/that just works today.

My understanding, is that there is not such infrastructure currently. Current infrastructure allows to reference single existing type declaration(canonical) from other units. It does not allow to reference different parts(in different units) of incomplete type.

Huh, so what does the DWARF look like when you define one member function in one file, and another member function (common with inline functions) in another file?

I think it would be necessary to change the order of how compilation units are processed to implement such types merging.

Oh, I wasn’t suggesting merging them - or didn’t mean to suggest that. I meant doing something like what we do in LLVM for type homed (no-standalone) DWARF, where we attach function declarations to type declarations, eg:

struct x {

void f1();

void f2();

template

static void f3();

};

#ifdef HOME

void x::f1() {

}

#endif

#ifdef AWAY

void x::f2() {

}

#endif

#ifdef TEMPL

template

void x::f3() {

}

template void x::f3();

#endif

Building “HOME” would show the DWARF I’d expect to see the first time a type definition is encountered during dsym.
Building “AWAY” raises the question of - what does dsymutil do with this DWARF? Does it deduplicate the type, and make the definition of ‘f2’ point to the ‘f2’ declaration in the original type described in the prior CU defined in “HOME”? If it doesn’t do that, it could/that would be good to reduce the DWARF size.
Building “TEMPL” would show the DWARF I’d expect to see if a future use of that type definition was encountered but the original/home definition had no declaration of this function: we should then emit maybe an “extension” to the type (could be a straight declaration, or maybe some newer/weirder hybrid that points to the definition with some attribute) & then inject the declaration of the template/other new member into this extension definition, etc.

Please check the reduced DWARF, generated by current dsymutil for above example :

0x0000000b: DW_TAG_compile_unit
DW_AT_language (DW_LANG_C_plus_plus)
DW_AT_name (“home.cpp”)
DW_AT_stmt_list (0x00000000)
DW_AT_low_pc (0x0000000100000f80)
DW_AT_high_pc (0x0000000100000f8b)

0x0000002a: DW_TAG_structure_type
DW_AT_name (“x”)
DW_AT_byte_size (0x01)

0x00000033: DW_TAG_subprogram
DW_AT_linkage_name (“_ZN1x2f1Ev”)
DW_AT_name (“f1”)
DW_AT_type (0x000000000000005e “int”)
DW_AT_declaration (true)
DW_AT_external (true)
DW_AT_APPLE_optimized (true)

0x00000047: NULL

0x00000048: DW_TAG_subprogram
DW_AT_linkage_name (“_ZN1x2f2Ev”)
DW_AT_name (“f2”)
DW_AT_type (0x000000000000005e “int”)
DW_AT_declaration (true)
DW_AT_external (true)
DW_AT_APPLE_optimized (true)

0x0000005c: NULL
0x0000005d: NULL

0x0000006a: DW_TAG_subprogram
DW_AT_low_pc (0x0000000100000f80)
DW_AT_high_pc (0x0000000100000f8b)
DW_AT_specification (0x0000000000000033 “_ZN1x2f1Ev”)

0x000000a0: DW_TAG_compile_unit
DW_AT_language (DW_LANG_C_plus_plus)
DW_AT_name (“away.cpp”)
DW_AT_stmt_list (0x00000048)
DW_AT_low_pc (0x0000000100000f90)
DW_AT_high_pc (0x0000000100000f9b)

0x000000c6: DW_TAG_subprogram
DW_AT_low_pc (0x0000000100000f90)
DW_AT_high_pc (0x0000000100000f9b)
DW_AT_specification (0x0000000000000048 “_ZN1x2f2Ev”)

0x000000fc: DW_TAG_compile_unit
DW_AT_language (DW_LANG_C_plus_plus)
DW_AT_name (“templ.cpp”)
DW_AT_stmt_list (0x00000090)
DW_AT_low_pc (0x0000000100000fa0)
DW_AT_high_pc (0x0000000100000fab)

0x0000011b: DW_TAG_structure_type
DW_AT_name (“x”)
DW_AT_byte_size (0x01)

0x00000124: DW_TAG_subprogram
DW_AT_linkage_name (“_ZN1x2f1Ev”)
DW_AT_name (“f1”)
DW_AT_type (0x0000000000000168 “int”)
DW_AT_declaration (true)
DW_AT_external (true)
DW_AT_APPLE_optimized (true)
0x00000138: NULL

0x00000139: DW_TAG_subprogram
DW_AT_linkage_name (“_ZN1x2f2Ev”)
DW_AT_name (“f2”)
DW_AT_type (0x0000000000000168 “int”)
DW_AT_declaration (true)
DW_AT_external (true)
DW_AT_APPLE_optimized (true)
0x0000014d: NULL

0x0000014e: DW_TAG_subprogram
DW_AT_linkage_name (“_ZN1x2f3IiEEiv”)
DW_AT_name (“f3”)
DW_AT_type (0x0000000000000168 “int”)
DW_AT_declaration (true)
DW_AT_external (true)
DW_AT_APPLE_optimized (true)
0x00000166: NULL
0x00000167: NULL

0x00000174: DW_TAG_subprogram
DW_AT_low_pc (0x0000000100000fa0)
DW_AT_high_pc (0x0000000100000fab)
DW_AT_specification (0x000000000000014e “_ZN1x2f3IiEEiv”)
0x00000190: NULL

Building “HOME” would show the DWARF I’d expect to see the first time a type definition is encountered during dsym.

compile unit “home.cpp” contains the type definition(0x0000002a) and reference to its member(DW_AT_specification (0x0000000000000033 “_ZN1x2f1Ev”)).

Building “AWAY” raises the question of - what does dsymutil do with this DWARF? Does it deduplicate the type, and make the definition of ‘f2’ point to the ‘f2’ declaration in the original type described in the prior CU defined in “HOME”? If it doesn’t do that, it could/that would be good to reduce the DWARF size.

compile unit “away.cpp” does not contain type definition and contains reference to type definition from compile unit “home.cpp” (DW_AT_specification (0x0000000000000048 “_ZN1x2f2Ev”)).
i.e. dsymutil deduplicates the type and makes the definition of ‘f2’ point to the ‘f2’ declaration in the original type described in the prior CU “home.cpp”.

Building “TEMPL” would show the DWARF I’d expect to see if a future use of that type definition was encountered but the original/home definition had no declaration of this function: we should then emit maybe an “extension” to the type (could be a straight declaration, or maybe some newer/weirder hybrid that points to the definition with some attribute) & then inject the declaration of the template/other new member into this extension definition, etc.

compile unit “templ.cpp” contains the type definition(0x0000011b) which matches with (0x0000002a) plus defines the new member 0x0000014e.
It also references this new member by DW_AT_specification (0x000000000000014e “_ZN1x2f3IiEEiv”). In this case type description is not de-duplicated.

Ah, yeah - that seems like a missed opportunity - duplicating the whole type DIE. LTO does this by making monolithic types - merging all the members from different definitions of the same type into one, but that’s maybe too expensive for dsymutil (might still be interesting to know how much more expensive, etc). But I think the other way to go would be to produce a declaration of the type, with the relevant members - and let the DWARF consumer identify this declaration as matching up with the earlier definition. That’s the sort of DWARF you get from the non-MachO default -fno-standalone-debug anyway, so it’s already pretty well tested/supported (support in lldb’s a bit younger/more work-in-progress, admittedly). I wonder how much dsym size there is that could be reduced by such an implementation.

Do you suggest that 0x0000011b should be transformed into something like that:

0x000000fc: DW_TAG_compile_unit
DW_AT_language (DW_LANG_C_plus_plus)
DW_AT_name (“templ.cpp”)
DW_AT_stmt_list (0x00000090)
DW_AT_low_pc (0x0000000100000fa0)
DW_AT_high_pc (0x0000000100000fab)

0x0000011b: DW_TAG_structure_type
DW_AT_specification (0x0000002a “x”)

0x00000124: DW_TAG_subprogram
DW_AT_linkage_name (“_ZN1x2f3IiEEiv”)
DW_AT_name (“f3”)
DW_AT_type (0x000000000000005e “int”)
DW_AT_declaration (true)
DW_AT_external (true)
DW_AT_APPLE_optimized (true)
0x00000138: NULL
0x00000139: NULL

0x00000140: DW_TAG_subprogram
DW_AT_low_pc (0x0000000100000fa0)
DW_AT_high_pc (0x0000000100000fab)
DW_AT_specification (0x0000000000000124 “_ZN1x2f3IiEEiv”)
0x00000155: NULL

Did I correctly get the idea?

Yep, more or less. It’d be “safer” if 11b didn’t use DW_AT_specification to refer to 2a, but instead was only a completely independent declaration of “x” - that path is already well supported/tested (well, it’s the work-in-progress stuff for lldb to support -fno-standalone-debug, but gdb’s been consuming DWARF like this for years, Clang and GCC both produce DWARF like this (if the type is “homed” in another file, then Clang/GCC produce DWARF that emits a declaration with just the members needed to define any member functions defined/inlined/referenced in this CU)) for years.

But using DW_AT_specification, or maybe some other extension attribute might make the consumers task a bit easier (could do both - use an extension attribute to tie them up, leave DW_AT_declaration/DW_AT_name here for consumers that don’t understand the extension attribute) in finding that they’re all the same type/pieces of teh same type.

Alexey · October 23, 2020, 3:27pm

                Fair enough - thanks for clarifying the
                differences! (I'd still lean a bit towards this
                being dwz-esque, as you say "an extension of
                classic dwz"

                I doubt a little about "llvm-dwz" since it might
                confuse people who would expect exactly the same
                behavior.
                But if we think of it as "an extension of classic
                dwz" and the possible confusion is not a big deal then
                I would be fine with "llvm-dwz".

                using a bit more domain knowledge (of terminators
                and C++ odr - though I'm not sure dsymutil does
                rely on the ODR, does it? It relies on it to know
                that two names represent the same type, I suppose,
                but doesn't assume they're already identical,
                instead it merges their members))

                if dsymutil is able to find a full definition then
                it would remove all other definitions(which matched
                by name) and set all references to that found
                definition. If it is not able to find a full
                definition then it would do nothing. i.e. if there
                are two incomplete definitions(DW_AT_declaration
                (true)) with the same name then they would not be
                merged. That is a possible improvement - to teach
                dsymutil to merge incomplete types.

            Huh, what does it do with extra member function
            definitions found in later definitions? (eg: struct x {
            template<typename T> void f(); }; - in one translation
            unit x::f<int> is instantiated, in another x::f<float>
            is instantiated - how are the two represented with
            dsymutil?)

            They would be considered as two not matched types.
            dsymutil would not merge them somehow and thus would not
            use single type description. There would be two separate
            types called "x" which would have mostly matched members
            but differ with x::f<int> and x::f<float>. No any
            de-duplication in that case.

        Oh, that's unfortunate. It'd be nice for C++ at least, to
        implement a potentially faster dsymutil mode that could get
        this right and not have to actually check for type
        equivalence, instead relying on the name of the type to
        determine that it must be identical.

        Right. That would result in even more size reduction.

        The first instance of the type that's encountered has its
        fully qualified name or mangled name recorded in a map
        pointing to the DIE. Any future instance gets downgraded to
        a declaration, and /certain/ members get dropped, but other
        members get stuck on the declaration (same sort of DWARF you
        see with "struct foo { virtual void f1(); template<typename
        > void f2() { } }; void test(foo& f) { f.f2<int>(); }").
        Recording all the member functions of the type/static member
        variable types might be needed in cases where some member
        functions are defined in one translation unit and some
        defined in another - though I guess that infrastructure is
        already in place/that just works today.

        My understanding, is that there is not such infrastructure
        currently. Current infrastructure allows to reference single
        existing type declaration(canonical) from other units. It
        does not allow to reference different parts(in different
        units) of incomplete type.

    Huh, so what does the DWARF look like when you define one member
    function in one file, and another member function (common with
    inline functions) in another file?

        I think it would be necessary to change the order of how
        compilation units are processed to implement such types merging.

    Oh, I wasn't suggesting merging them - or didn't mean to suggest
    that. I meant doing something like what we do in LLVM for type
    homed (no-standalone) DWARF, where we attach function
    declarations to type declarations, eg:

    struct x {

    void f1();

    void f2();

    template<typename T>

    static void f3();

    };

    #ifdef HOME

    void x::f1() {

    }

    #endif

    #ifdef AWAY

    void x::f2() {

    }

    #endif

    #ifdef TEMPL

    template<typename T>

    void x::f3() {

    }

    template void x::f3<int>();

    #endif

    Building "HOME" would show the DWARF I'd expect to see the first
    time a type definition is encountered during dsym.
    Building "AWAY" raises the question of - what does dsymutil do
    with this DWARF? Does it deduplicate the type, and make the
    definition of 'f2' point to the 'f2' declaration in the original
    type described in the prior CU defined in "HOME"? If it doesn't
    do that, it could/that would be good to reduce the DWARF size.
    Building "TEMPL" would show the DWARF I'd expect to see if a
    future use of that type definition was encountered but the
    original/home definition had no declaration of this function: we
    should then emit maybe an "extension" to the type (could be a
    straight declaration, or maybe some newer/weirder hybrid that
    points to the definition with some attribute) & then inject the
    declaration of the template/other new member into this extension
    definition, etc.

    Please check the reduced DWARF, generated by current dsymutil for
    above example :

    0x0000000b: DW_TAG_compile_unit
     DW_AT_language (DW_LANG_C_plus_plus)
     DW_AT_name ("home.cpp")
     DW_AT_stmt_list (0x00000000)
     DW_AT_low_pc (0x0000000100000f80)
     DW_AT_high_pc (0x0000000100000f8b)

    0x0000002a: DW_TAG_structure_type
     DW_AT_name ("x")
     DW_AT_byte_size (0x01)

    0x00000033: DW_TAG_subprogram
     DW_AT_linkage_name ("_ZN1x2f1Ev")
     DW_AT_name ("f1")
     DW_AT_type (0x000000000000005e "int")
     DW_AT_declaration (true)
     DW_AT_external (true)
     DW_AT_APPLE_optimized (true)

    0x00000047: NULL

    0x00000048: DW_TAG_subprogram
     DW_AT_linkage_name ("_ZN1x2f2Ev")
     DW_AT_name ("f2")
     DW_AT_type (0x000000000000005e "int")
     DW_AT_declaration (true)
     DW_AT_external (true)
     DW_AT_APPLE_optimized (true)

    0x0000005c: NULL
    0x0000005d: NULL

    0x0000006a: DW_TAG_subprogram
     DW_AT_low_pc (0x0000000100000f80)
     DW_AT_high_pc (0x0000000100000f8b)
     DW_AT_specification (0x0000000000000033 "_ZN1x2f1Ev")

    0x000000a0: DW_TAG_compile_unit
     DW_AT_language (DW_LANG_C_plus_plus)
     DW_AT_name ("away.cpp")
     DW_AT_stmt_list (0x00000048)
     DW_AT_low_pc (0x0000000100000f90)
     DW_AT_high_pc (0x0000000100000f9b)

    0x000000c6: DW_TAG_subprogram
     DW_AT_low_pc (0x0000000100000f90)
     DW_AT_high_pc (0x0000000100000f9b)
     DW_AT_specification (0x0000000000000048 "_ZN1x2f2Ev")

    0x000000fc: DW_TAG_compile_unit
     DW_AT_language (DW_LANG_C_plus_plus)
     DW_AT_name ("templ.cpp")
     DW_AT_stmt_list (0x00000090)
     DW_AT_low_pc (0x0000000100000fa0)
     DW_AT_high_pc (0x0000000100000fab)

    0x0000011b: DW_TAG_structure_type
     DW_AT_name ("x")
     DW_AT_byte_size (0x01)

    0x00000124: DW_TAG_subprogram
     DW_AT_linkage_name ("_ZN1x2f1Ev")
     DW_AT_name ("f1")
     DW_AT_type (0x0000000000000168 "int")
     DW_AT_declaration (true)
     DW_AT_external (true)
     DW_AT_APPLE_optimized (true)
    0x00000138: NULL

    0x00000139: DW_TAG_subprogram
     DW_AT_linkage_name ("_ZN1x2f2Ev")
     DW_AT_name ("f2")
     DW_AT_type (0x0000000000000168 "int")
     DW_AT_declaration (true)
     DW_AT_external (true)
     DW_AT_APPLE_optimized (true)
    0x0000014d: NULL

    0x0000014e: DW_TAG_subprogram
     DW_AT_linkage_name ("_ZN1x2f3IiEEiv")
     DW_AT_name ("f3<int>")
     DW_AT_type (0x0000000000000168 "int")
     DW_AT_declaration (true)
     DW_AT_external (true)
     DW_AT_APPLE_optimized (true)
    0x00000166: NULL
    0x00000167: NULL

    0x00000174: DW_TAG_subprogram
     DW_AT_low_pc (0x0000000100000fa0)
     DW_AT_high_pc (0x0000000100000fab)
     DW_AT_specification (0x000000000000014e
    "_ZN1x2f3IiEEiv")
    0x00000190: NULL

    >Building "HOME" would show the DWARF I'd expect to see the first
    time a type definition is encountered during dsym.

    compile unit "home.cpp" contains the type definition(0x0000002a)
    and reference to its member(DW_AT_specification
    (0x0000000000000033 "_ZN1x2f1Ev")).

    >Building "AWAY" raises the question of - what does dsymutil do
    with this DWARF? Does it deduplicate the type, and make the
    definition of 'f2' point to the 'f2' declaration in the original
    type described in the prior CU defined in "HOME"? If it doesn't do
    that, it could/that would be good to reduce the DWARF size.

    compile unit "away.cpp" does not contain type definition and
    contains reference to type definition from compile unit "home.cpp"
    (DW_AT_specification (0x0000000000000048 "_ZN1x2f2Ev")).
    i.e. dsymutil deduplicates the type and makes the definition of
    'f2' point to the 'f2' declaration in the original type described
    in the prior CU "home.cpp".

    >Building "TEMPL" would show the DWARF I'd expect to see if a
    future use of that type definition was encountered but the
    original/home definition had no declaration of this function: we
    should then emit maybe an "extension" to the type (could be a
    straight declaration, or maybe some newer/weirder hybrid that
    points to the definition with some attribute) & then inject the
    declaration of the template/other new member into this extension
    definition, etc.

    compile unit "templ.cpp" contains the type definition(0x0000011b)
    which matches with (0x0000002a) plus defines the new member
    0x0000014e.
    It also references this new member by DW_AT_specification
    (0x000000000000014e "_ZN1x2f3IiEEiv"). In this case type
    description is not de-duplicated.

Ah, yeah - that seems like a missed opportunity - duplicating the whole type DIE. LTO does this by making monolithic types - merging all the members from different definitions of the same type into one, but that's maybe too expensive for dsymutil (might still be interesting to know how much more expensive, etc). But I think the other way to go would be to produce a declaration of the type, with the relevant members - and let the DWARF consumer identify this declaration as matching up with the earlier definition. That's the sort of DWARF you get from the non-MachO default -fno-standalone-debug anyway, so it's already pretty well tested/supported (support in lldb's a bit younger/more work-in-progress, admittedly). I wonder how much dsym size there is that could be reduced by such an implementation.

I see. Yes, that could be done and I think it would result in noticeable size reduction(I do not know exact numbers at the moment).

I work on multi-thread DWARFLinker now and it`s first version will do exactly the same type processing like current dsymutil.

Above scheme could be implemented as a next step and it would result in better size reduction(better than current state).

But I think the better scheme could be done also and it would result in even bigger size reduction and in faster execution. This scheme is something similar to what you`ve described above: "LTO does - making monolithic types - merging all the members from different definitions of the same type into one".

DWARFLinker could create additional artificial compile unit and put all merged types there. Later patch all type references to point into this additional compilation unit. No any bits would be duplicated in that case. The performance improvement could be achieved due to less amount of the copied DWARF and due to the fact that type references could be updated when DWARF is cloned(no need in additional pass for that).

Anyway, that might be the next step after multi-thread DWARFLinker would be ready.

    Do you suggest that 0x0000011b should be transformed into
    something like that:

    0x000000fc: DW_TAG_compile_unit
     DW_AT_language (DW_LANG_C_plus_plus)
     DW_AT_name ("templ.cpp")
     DW_AT_stmt_list (0x00000090)
     DW_AT_low_pc (0x0000000100000fa0)
     DW_AT_high_pc (0x0000000100000fab)

    0x0000011b: DW_TAG_structure_type
     DW_AT_specification (0x0000002a "x")

    0x00000124: DW_TAG_subprogram
     DW_AT_linkage_name ("_ZN1x2f3IiEEiv")
     DW_AT_name ("f3<int>")
     DW_AT_type (0x000000000000005e "int")
     DW_AT_declaration (true)
     DW_AT_external (true)
     DW_AT_APPLE_optimized (true)
    0x00000138: NULL
    0x00000139: NULL

    0x00000140: DW_TAG_subprogram
     DW_AT_low_pc (0x0000000100000fa0)
     DW_AT_high_pc (0x0000000100000fab)
     DW_AT_specification (0x0000000000000124
    "_ZN1x2f3IiEEiv")
    0x00000155: NULL

    Did I correctly get the idea?

Yep, more or less. It'd be "safer" if 11b didn't use DW_AT_specification to refer to 2a, but instead was only a completely independent declaration of "x" - that path is already well supported/tested (well, it's the work-in-progress stuff for lldb to support -fno-standalone-debug, but gdb's been consuming DWARF like this for years, Clang and GCC both produce DWARF like this (if the type is "homed" in another file, then Clang/GCC produce DWARF that emits a declaration with just the members needed to define any member functions defined/inlined/referenced in this CU)) for years.

But using DW_AT_specification, or maybe some other extension attribute might make the consumers task a bit easier (could do both - use an extension attribute to tie them up, leave DW_AT_declaration/DW_AT_name here for consumers that don't understand the extension attribute) in finding that they're all the same type/pieces of teh same type.

yes. would try this solution.

Thank you, Alexey.

dblaikie · October 23, 2020, 4:43pm

Ah, yeah - that seems like a missed opportunity - duplicating the whole type DIE. LTO does this by making monolithic types - merging all the members from different definitions of the same type into one, but that’s maybe too expensive for dsymutil (might still be interesting to know how much more expensive, etc). But I think the other way to go would be to produce a declaration of the type, with the relevant members - and let the DWARF consumer identify this declaration as matching up with the earlier definition. That’s the sort of DWARF you get from the non-MachO default -fno-standalone-debug anyway, so it’s already pretty well tested/supported (support in lldb’s a bit younger/more work-in-progress, admittedly). I wonder how much dsym size there is that could be reduced by such an implementation.

I see. Yes, that could be done and I think it would result in noticeable size reduction(I do not know exact numbers at the moment).

I work on multi-thread DWARFLinker now and it`s first version will do exactly the same type processing like current dsymutil.

Yeah, best to keep the behavior the same through that

Above scheme could be implemented as a next step and it would result in better size reduction(better than current state).

But I think the better scheme could be done also and it would result in even bigger size reduction and in faster execution. This scheme is something similar to what you`ve described above: “LTO does - making monolithic types - merging all the members from different definitions of the same type into one”.

I believe the reason that’s probably not been done is that it can’t be streamed - it’d lead to buffering more of the output (if two of these expandable types were in one CU - the start of the second type couldn’t be known until the end because it might keep getting pushed later due to expansion of the first type) and/or having to revisit all the type references (the offset to the second type wouldn’t be known until the end - so writing the offsets to refer to the type would have to be deferred until then).

DWARFLinker could create additional artificial compile unit and put all merged types there. Later patch all type references to point into this additional compilation unit. No any bits would be duplicated in that case. The performance improvement could be achieved due to less amount of the copied DWARF and due to the fact that type references could be updated when DWARF is cloned(no need in additional pass for that).

“later patch all type references to point into this additional compilation unit” - that’s the additional pass that people are probably talking/concerned about. Rewalking all the DWARF. The current dsymutil approach, as far as I know, is single pass - it knows the final, absolute offset to the type from the moment it emits that type/needs to refer to it.

Anyway, that might be the next step after multi-thread DWARFLinker would be ready.

Yep, be interesting to see how it all goes!

Alexey · October 25, 2020, 4:31pm

    Ah, yeah - that seems like a missed opportunity - duplicating the
    whole type DIE. LTO does this by making monolithic types -
    merging all the members from different definitions of the same
    type into one, but that's maybe too expensive for dsymutil (might
    still be interesting to know how much more expensive, etc). But I
    think the other way to go would be to produce a declaration of
    the type, with the relevant members - and let the DWARF consumer
    identify this declaration as matching up with the earlier
    definition. That's the sort of DWARF you get from the non-MachO
    default -fno-standalone-debug anyway, so it's already pretty well
    tested/supported (support in lldb's a bit younger/more
    work-in-progress, admittedly). I wonder how much dsym size there
    is that could be reduced by such an implementation.

    I see. Yes, that could be done and I think it would result in
    noticeable size reduction(I do not know exact numbers at the moment).

    I work on multi-thread DWARFLinker now and it`s first version will
    do exactly the same type processing like current dsymutil.

Yeah, best to keep the behavior the same through that

    Above scheme could be implemented as a next step and it would
    result in better size reduction(better than current state).

    But I think the better scheme could be done also and it would
    result in even bigger size reduction and in faster execution. This
    scheme is something similar to what you`ve described above: "LTO
    does - making monolithic types - merging all the members from
    different definitions of the same type into one".

I believe the reason that's probably not been done is that it can't be streamed - it'd lead to buffering more of the output

yes. The fact that DWARF should be streamed into AsmPrinter complicates parallel dwarf generation. In my prototype, I generate
several resulting files(each for one source compilation unit) and then sequentially glue them into the final resulting file.

(if two of these expandable types were in one CU - the start of the second type couldn't be known until the end because it might keep getting pushed later due to expansion of the first type) and/or having to revisit all the type references (the offset to the second type wouldn't be known until the end - so writing the offsets to refer to the type would have to be deferred until then).

That is the second problem: offsets are not known until the end of file.
dsymutil already has that situation for inter-CU references, so it has extra pass to
fixup offsets. With multi-thread implementation such situation would arise more often
for type references and so more offsets should be fixed during additional pass.

    DWARFLinker could create additional artificial compile unit and
    put all merged types there. Later patch all type references to
    point into this additional compilation unit. No any bits would be
    duplicated in that case. The performance improvement could be
    achieved due to less amount of the copied DWARF and due to the
    fact that type references could be updated when DWARF is cloned(no
    need in additional pass for that).

"later patch all type references to point into this additional compilation unit" - that's the additional pass that people are probably talking/concerned about. Rewalking all the DWARF. The current dsymutil approach, as far as I know, is single pass - it knows the final, absolute offset to the type from the moment it emits that type/needs to refer to it.

Right. Current dsymutil approach is single pass. And from that point of view, solution
which you`ve described(to produce a declaration of the type, with the relevant members)
allows to keep that single pass implementation.

But there is a restriction for current dsymutil approach: To process inter-CU references
it needs to load all DWARF into the memory(While it analyzes which part of DWARF is live,
it needs to have all CUs loaded into the memory). That leads to huge memory usage.
It is less important when source is a set of object files(like in dsymutil case) and this
become a real problem for llvm-dwarfutil utility when source is a single file(With current
implementation it needs 30G of memory for compiling clang binary).

Without loading all CU into the memory it would require two passes solution. First to analyze
which part of DWARF relates to live code and then second pass to generate the result.
If we would have a two passes solution then we could create a compilation unit with all
types at first pass and at the second pass we could generate result with correct offsets(no
need to fix up them as it is currently required by dsymutil for forward inter-CU references).
The open question currently: how expensive this two passes approach is.

Thank you, Alexey.

dblaikie · October 26, 2020, 7:38pm

Ah, yeah - that seems like a missed opportunity - duplicating the whole type DIE. LTO does this by making monolithic types - merging all the members from different definitions of the same type into one, but that’s maybe too expensive for dsymutil (might still be interesting to know how much more expensive, etc). But I think the other way to go would be to produce a declaration of the type, with the relevant members - and let the DWARF consumer identify this declaration as matching up with the earlier definition. That’s the sort of DWARF you get from the non-MachO default -fno-standalone-debug anyway, so it’s already pretty well tested/supported (support in lldb’s a bit younger/more work-in-progress, admittedly). I wonder how much dsym size there is that could be reduced by such an implementation.

I see. Yes, that could be done and I think it would result in noticeable size reduction(I do not know exact numbers at the moment).

I work on multi-thread DWARFLinker now and it`s first version will do exactly the same type processing like current dsymutil.

Yeah, best to keep the behavior the same through that

Above scheme could be implemented as a next step and it would result in better size reduction(better than current state).

But I think the better scheme could be done also and it would result in even bigger size reduction and in faster execution. This scheme is something similar to what you`ve described above: “LTO does - making monolithic types - merging all the members from different definitions of the same type into one”.

I believe the reason that’s probably not been done is that it can’t be streamed - it’d lead to buffering more of the output

yes. The fact that DWARF should be streamed into AsmPrinter complicates parallel dwarf generation. In my prototype, I generate
several resulting files(each for one source compilation unit) and then sequentially glue them into the final resulting file.

How does that help? Do you use relocations in those intermediate object files so the DWARF in them can refer across files?

(if two of these expandable types were in one CU - the start of the second type couldn’t be known until the end because it might keep getting pushed later due to expansion of the first type) and/or having to revisit all the type references (the offset to the second type wouldn’t be known until the end - so writing the offsets to refer to the type would have to be deferred until then).

That is the second problem: offsets are not known until the end of file.
dsymutil already has that situation for inter-CU references, so it has extra pass to
fixup offsets.

Oh, it does? I figured it was one-pass, and that it only ever refers back to types in previous CUs? So it doesn’t have to go back and do a second pass. But I guess if sees a declaration of T1 in CU1, then later on sees a definition of T1 in CU2, does it somehow go back to CU1 and remove the declaration/make references refer to the definition in CU2? I figured it’d just leave the declaration and references to it as-is, then add the definition and use that from CU2 onwards?

With multi-thread implementation such situation would arise more often
for type references and so more offsets should be fixed during additional pass.

DWARFLinker could create additional artificial compile unit and put all merged types there. Later patch all type references to point into this additional compilation unit. No any bits would be duplicated in that case. The performance improvement could be achieved due to less amount of the copied DWARF and due to the fact that type references could be updated when DWARF is cloned(no need in additional pass for that).

“later patch all type references to point into this additional compilation unit” - that’s the additional pass that people are probably talking/concerned about. Rewalking all the DWARF. The current dsymutil approach, as far as I know, is single pass - it knows the final, absolute offset to the type from the moment it emits that type/needs to refer to it.

Right. Current dsymutil approach is single pass. And from that point of view, solution
which you`ve described(to produce a declaration of the type, with the relevant members)
allows to keep that single pass implementation.

But there is a restriction for current dsymutil approach: To process inter-CU references
it needs to load all DWARF into the memory(While it analyzes which part of DWARF is live,
it needs to have all CUs loaded into the memory).

All DWARF for a single file (which for dsymutil is mostly a single CU, except with LTO I guess?), not all DWARF for all inputs in memory at once, yeah?

That leads to huge memory usage.
It is less important when source is a set of object files(like in dsymutil case) and this
become a real problem for llvm-dwarfutil utility when source is a single file(With current
implementation it needs 30G of memory for compiling clang binary).

Yeah, that’s where I think you’d need a fixup pass one way or another - because cross-CU references can mean that when you figure out a new layout for CU5 (because it has a duplicate type definition of something in CU1) then you might have to touch CU4 that had an absolute/cross-CU forward reference to CU5. Once you’ve got such a fixup pass (if dsymutil already has one? Which, like I said, I’m confused why it would have one/that doesn’t match my very vague understanding) then I think you could make dsymutil work on a per-CU basis streaming things out, then fixing up a few offsets.

Without loading all CU into the memory it would require two passes solution. First to analyze
which part of DWARF relates to live code and then second pass to generate the result.

Not sure it’d require any more second pass than a “fixup” pass, which it sounds like you’re saying it already has?

Alexey · October 27, 2020, 8:23am

        Ah, yeah - that seems like a missed opportunity -
        duplicating the whole type DIE. LTO does this by making
        monolithic types - merging all the members from different
        definitions of the same type into one, but that's maybe too
        expensive for dsymutil (might still be interesting to know
        how much more expensive, etc). But I think the other way to
        go would be to produce a declaration of the type, with the
        relevant members - and let the DWARF consumer identify this
        declaration as matching up with the earlier definition.
        That's the sort of DWARF you get from the non-MachO default
        -fno-standalone-debug anyway, so it's already pretty well
        tested/supported (support in lldb's a bit younger/more
        work-in-progress, admittedly). I wonder how much dsym size
        there is that could be reduced by such an implementation.

        I see. Yes, that could be done and I think it would result in
        noticeable size reduction(I do not know exact numbers at the
        moment).

        I work on multi-thread DWARFLinker now and it`s first version
        will do exactly the same type processing like current dsymutil.

    Yeah, best to keep the behavior the same through that

        Above scheme could be implemented as a next step and it would
        result in better size reduction(better than current state).

        But I think the better scheme could be done also and it would
        result in even bigger size reduction and in faster execution.
        This scheme is something similar to what you`ve described
        above: "LTO does - making monolithic types - merging all the
        members from different definitions of the same type into one".

    I believe the reason that's probably not been done is that it
    can't be streamed - it'd lead to buffering more of the output

    yes. The fact that DWARF should be streamed into AsmPrinter
    complicates parallel dwarf generation. In my prototype, I generate
    several resulting files(each for one source compilation unit) and
    then sequentially glue them into the final resulting file.

How does that help? Do you use relocations in those intermediate object files so the DWARF in them can refer across files?

It does not help with referring across the file. It helps to parallel the generation of CU bodies.
It is not possible to write two CUs in parallel into AsmPrinter. To make possible parallel generation I stream them into different AsmPrinters(this comment is for "I believe the reason that's probably not been done is that it can't be streamed". which initially was about referring across the file, but it seems I added another direction).

    (if two of these expandable types were in one CU - the start of
    the second type couldn't be known until the end because it might
    keep getting pushed later due to expansion of the first type)
    and/or having to revisit all the type references (the offset to
    the second type wouldn't be known until the end - so writing the
    offsets to refer to the type would have to be deferred until then).

    That is the second problem: offsets are not known until the end of
    file.
    dsymutil already has that situation for inter-CU references, so it
    has extra pass to
    fixup offsets.

Oh, it does? I figured it was one-pass, and that it only ever refers back to types in previous CUs? So it doesn't have to go back and do a second pass. But I guess if sees a declaration of T1 in CU1, then later on sees a definition of T1 in CU2, does it somehow go back to CU1 and remove the declaration/make references refer to the definition in CU2? I figured it'd just leave the declaration and references to it as-is, then add the definition and use that from CU2 onwards?

For the processing of the types, it do not go back.
This "I figured it was one-pass, and that it only ever refers back to types in previous CUs"
and this "I figured it'd just leave the declaration and references to it as-is, then add the definition and use that from CU2 onwards" are correct.

    With multi-thread implementation such situation would arise more
    often
    for type references and so more offsets should be fixed during
    additional pass.

        DWARFLinker could create additional artificial compile unit
        and put all merged types there. Later patch all type
        references to point into this additional compilation unit.
        No any bits would be duplicated in that case. The performance
        improvement could be achieved due to less amount of the
        copied DWARF and due to the fact that type references could
        be updated when DWARF is cloned(no need in additional pass
        for that).

    "later patch all type references to point into this additional
    compilation unit" - that's the additional pass that people are
    probably talking/concerned about. Rewalking all the DWARF. The
    current dsymutil approach, as far as I know, is single pass - it
    knows the final, absolute offset to the type from the moment it
    emits that type/needs to refer to it.

    Right. Current dsymutil approach is single pass. And from that
    point of view, solution
    which you`ve described(to produce a declaration of the type, with
    the relevant members)
    allows to keep that single pass implementation.

    But there is a restriction for current dsymutil approach: To
    process inter-CU references
    it needs to load all DWARF into the memory(While it analyzes which
    part of DWARF is live,
    it needs to have all CUs loaded into the memory).

All DWARF for a single file (which for dsymutil is mostly a single CU, except with LTO I guess?), not all DWARF for all inputs in memory at once, yeah?

right. In dsymutil case - all DWARF for a single file(not all DWARF for all inputs in memory at once).
But in llvm-dwarfutil case single file contains DWARF for all original input object files and it all becomes
loaded into memory.

    That leads to huge memory usage.
    It is less important when source is a set of object files(like in
    dsymutil case) and this
    become a real problem for llvm-dwarfutil utility when source is a
    single file(With current
    implementation it needs 30G of memory for compiling clang binary).

Yeah, that's where I think you'd need a fixup pass one way or another - because cross-CU references can mean that when you figure out a new layout for CU5 (because it has a duplicate type definition of something in CU1) then you might have to touch CU4 that had an absolute/cross-CU forward reference to CU5. Once you've got such a fixup pass (if dsymutil already has one? Which, like I said, I'm confused why it would have one/that doesn't match my very vague understanding) then I think you could make dsymutil work on a per-CU basis streaming things out, then fixing up a few offsets.

When dsymutil deduplicates types it changes local CU reference into inter-CU reference(so that CU2(next) could reference type definition from CU1(prev)). To do this change it does not need to do any fixups currently.

When dsymutil meets already existed(located in the input object file) inter-CU reference pointing into the CU which has not been processed yet(and then its offset is unknown) it marks it as "forward reference" and patches later during additional pass "fixup forward references" at a time when offsets are known.

If CUs would be processed in parallel their offsets would not be known at the moment when local type reference would be changed into inter-CU reference. So we would need to do the same fix-up processing for all references to the types like we already do for other inter-CU references.

    Without loading all CU into the memory it would require two passes
    solution. First to analyze
    which part of DWARF relates to live code and then second pass to
    generate the result.

Not sure it'd require any more second pass than a "fixup" pass, which it sounds like you're saying it already has?

It looks like it would need an additional pass to process inter-CU references(existed in incoming file) if we do not want to load all CUs into memory.
When the input file contains inter-CU references, DWARFLinker needs to follow them while doing liveness marking. i.e. if the original CU has a live part which references another CU we need to follow this new CU and mark the referenced part as life. At the current moment, while doing liveness analysis, we have all CUs in memory. That allows us to load all CUs once and analyze them all. In case llvm-dwarfutil(which loads all DWARF for input file) it leads to huge memory usage.

Let's say CU1 references CU100. And CU100 references CU1. We could not start generation for CU1 until we analyzed CU100 and marked the corresponding part of CU1 as life. At the same time, we could not load DWARF for all CUs. Then processing(in simplified form) could look like this:

1: for (CU : CU1...CU100)
load CU, do liveness analysis, remember references, unload CU

2: for (all references)
load CU, do liveness analysis, unload CU

3: for (CU : CU1...CU100)
load CU, clone CU

That is a simplified scheme, but I think it is enough to show the idea. In this scheme we have 1 and 2 which should be done before 3.

Alexey.

dblaikie · October 27, 2020, 5:32pm

Ah, yeah - that seems like a missed opportunity - duplicating the whole type DIE. LTO does this by making monolithic types - merging all the members from different definitions of the same type into one, but that’s maybe too expensive for dsymutil (might still be interesting to know how much more expensive, etc). But I think the other way to go would be to produce a declaration of the type, with the relevant members - and let the DWARF consumer identify this declaration as matching up with the earlier definition. That’s the sort of DWARF you get from the non-MachO default -fno-standalone-debug anyway, so it’s already pretty well tested/supported (support in lldb’s a bit younger/more work-in-progress, admittedly). I wonder how much dsym size there is that could be reduced by such an implementation.

I see. Yes, that could be done and I think it would result in noticeable size reduction(I do not know exact numbers at the moment).

I work on multi-thread DWARFLinker now and it`s first version will do exactly the same type processing like current dsymutil.

Yeah, best to keep the behavior the same through that

Above scheme could be implemented as a next step and it would result in better size reduction(better than current state).

But I think the better scheme could be done also and it would result in even bigger size reduction and in faster execution. This scheme is something similar to what you`ve described above: “LTO does - making monolithic types - merging all the members from different definitions of the same type into one”.

I believe the reason that’s probably not been done is that it can’t be streamed - it’d lead to buffering more of the output

yes. The fact that DWARF should be streamed into AsmPrinter complicates parallel dwarf generation. In my prototype, I generate
several resulting files(each for one source compilation unit) and then sequentially glue them into the final resulting file.

How does that help? Do you use relocations in those intermediate object files so the DWARF in them can refer across files?

It does not help with referring across the file. It helps to parallel the generation of CU bodies.
It is not possible to write two CUs in parallel into AsmPrinter. To make possible parallel generation I stream them into different AsmPrinters(this comment is for “I believe the reason that’s probably not been done is that it can’t be streamed”. which initially was about referring across the file, but it seems I added another direction).

Oh, I see - thanks for explaining, essentially buffering on-disk.

(if two of these expandable types were in one CU - the start of the second type couldn’t be known until the end because it might keep getting pushed later due to expansion of the first type) and/or having to revisit all the type references (the offset to the second type wouldn’t be known until the end - so writing the offsets to refer to the type would have to be deferred until then).

That is the second problem: offsets are not known until the end of file.
dsymutil already has that situation for inter-CU references, so it has extra pass to
fixup offsets.

Oh, it does? I figured it was one-pass, and that it only ever refers back to types in previous CUs? So it doesn’t have to go back and do a second pass. But I guess if sees a declaration of T1 in CU1, then later on sees a definition of T1 in CU2, does it somehow go back to CU1 and remove the declaration/make references refer to the definition in CU2? I figured it’d just leave the declaration and references to it as-is, then add the definition and use that from CU2 onwards?

For the processing of the types, it do not go back.
This “I figured it was one-pass, and that it only ever refers back to types in previous CUs”
and this “I figured it’d just leave the declaration and references to it as-is, then add the definition and use that from CU2 onwards” are correct.

Great - thanks for explaining/confirming!

With multi-thread implementation such situation would arise more often
for type references and so more offsets should be fixed during additional pass.

DWARFLinker could create additional artificial compile unit and put all merged types there. Later patch all type references to point into this additional compilation unit. No any bits would be duplicated in that case. The performance improvement could be achieved due to less amount of the copied DWARF and due to the fact that type references could be updated when DWARF is cloned(no need in additional pass for that).

“later patch all type references to point into this additional compilation unit” - that’s the additional pass that people are probably talking/concerned about. Rewalking all the DWARF. The current dsymutil approach, as far as I know, is single pass - it knows the final, absolute offset to the type from the moment it emits that type/needs to refer to it.

Right. Current dsymutil approach is single pass. And from that point of view, solution
which you`ve described(to produce a declaration of the type, with the relevant members)
allows to keep that single pass implementation.

But there is a restriction for current dsymutil approach: To process inter-CU references
it needs to load all DWARF into the memory(While it analyzes which part of DWARF is live,
it needs to have all CUs loaded into the memory).

All DWARF for a single file (which for dsymutil is mostly a single CU, except with LTO I guess?), not all DWARF for all inputs in memory at once, yeah?

right. In dsymutil case - all DWARF for a single file(not all DWARF for all inputs in memory at once).
But in llvm-dwarfutil case single file contains DWARF for all original input object files and it all becomes
loaded into memory.

Yeha, would be great to try to go CU-by-CU.

That leads to huge memory usage.
It is less important when source is a set of object files(like in dsymutil case) and this
become a real problem for llvm-dwarfutil utility when source is a single file(With current
implementation it needs 30G of memory for compiling clang binary).

Yeah, that’s where I think you’d need a fixup pass one way or another - because cross-CU references can mean that when you figure out a new layout for CU5 (because it has a duplicate type definition of something in CU1) then you might have to touch CU4 that had an absolute/cross-CU forward reference to CU5. Once you’ve got such a fixup pass (if dsymutil already has one? Which, like I said, I’m confused why it would have one/that doesn’t match my very vague understanding) then I think you could make dsymutil work on a per-CU basis streaming things out, then fixing up a few offsets.

When dsymutil deduplicates types it changes local CU reference into inter-CU reference(so that CU2(next) could reference type definition from CU1(prev)). To do this change it does not need to do any fixups currently.

When dsymutil meets already existed(located in the input object file) inter-CU reference pointing into the CU which has not been processed yet(and then its offset is unknown) it marks it as “forward reference” and patches later during additional pass “fixup forward references” at a time when offsets are known.

OK, so limited 2 pass system. (does it do that second pass once at the end of the whole dsymutil run, or at the end of each input file? (so if an input file has two CUs and the first CU references a type in the second CU - it could write the first CU with a “forward reference”, then write the second CU, then fixup the forward reference - and then go on to the next file and its CUs - this could improve performance by touching recently used memory/disk pages only, rather than going all the way back to the start later on when those pages have become cold)

If CUs would be processed in parallel their offsets would not be known at the moment when local type reference would be changed into inter-CU reference. So we would need to do the same fix-up processing for all references to the types like we already do for other inter-CU references.

Yeah - though the existence of this second “fixup forward references” system - yeah, could just use it much more generally as you say. Not an extra pass, just the existing second pass but having way more fixups to fixup in that pass.

Without loading all CU into the memory it would require two passes solution. First to analyze
which part of DWARF relates to live code and then second pass to generate the result.

Not sure it’d require any more second pass than a “fixup” pass, which it sounds like you’re saying it already has?

It looks like it would need an additional pass to process inter-CU references(existed in incoming file) if we do not want to load all CUs into memory.

Usually inter-CU references aren’t used, except in LTO - and in LTO all the DWARF deduplication and function discarding is already done by the IR linker anyway. (ThinLTO is a bit different, but really we’d be better off teaching it the extra tricks anyway (some can’t be fixed in ThinLTO - like emitting a “Home” definition of an inline function, only to find out other ThinLTO backend/shards managed to optimize away all uses of the function… so some cleanup may be useful there)). It might be possible to do a more dynamic/rolling cache - keep only the CUs with unresolved cross-CU references alive and only keep them alive until their cross-CU references are found/marked alive. This should make things no worse than the traditional dsymutil case - since cross-CU references are only effective/generally used within a single object file (it’s possible to create relocations for them into other files - but I know LLVM doesn’t currently do this and I don’t think GCC does it) with multiple CUs anyway - so at most you’d keep all the CUs from a single original input file alive together.

Alexey · October 27, 2020, 7:34pm

If we would be able to change the algorithm in such way : 1. analyse all CUs. 2. clone all CUs. Then we could create a merged type table(artificial CU containing types) during step1. If that type table would be written first, then all following CUs could use known offsets to the types and we would not need additional fix-up processing for type references. It would still be necessary to fix-up other inter-CU references. But it would not be necessary to fix-up type references (which constitute the vast majority). But, since it is a DWARF documented case the tool should be ready for such case(when inter-CU references are heavily used). Moreover, llvm-dwarfutil would be the tool producing exactly such situation. The resulting file(produced by llvm-dwarfutil) would contain a lot of inter-CU references. Probably, there is no practical reasons to apply llvm-dwarfutil to the same file twice but it would be a good test for the tool. Generally, I think we should not assume that inter-CU references would be used in a limited way. Anyway, if this scheme: 1. analyse all CUs. 2. clone all CUs.

dblaikie · October 27, 2020, 10:49pm

Ah, yeah - that seems like a missed opportunity - duplicating the whole type DIE. LTO does this by making monolithic types - merging all the members from different definitions of the same type into one, but that’s maybe too expensive for dsymutil (might still be interesting to know how much more expensive, etc). But I think the other way to go would be to produce a declaration of the type, with the relevant members - and let the DWARF consumer identify this declaration as matching up with the earlier definition. That’s the sort of DWARF you get from the non-MachO default -fno-standalone-debug anyway, so it’s already pretty well tested/supported (support in lldb’s a bit younger/more work-in-progress, admittedly). I wonder how much dsym size there is that could be reduced by such an implementation.

I see. Yes, that could be done and I think it would result in noticeable size reduction(I do not know exact numbers at the moment).

I work on multi-thread DWARFLinker now and it`s first version will do exactly the same type processing like current dsymutil.

Yeah, best to keep the behavior the same through that

Above scheme could be implemented as a next step and it would result in better size reduction(better than current state).

But I think the better scheme could be done also and it would result in even bigger size reduction and in faster execution. This scheme is something similar to what you`ve described above: “LTO does - making monolithic types - merging all the members from different definitions of the same type into one”.

I believe the reason that’s probably not been done is that it can’t be streamed - it’d lead to buffering more of the output

yes. The fact that DWARF should be streamed into AsmPrinter complicates parallel dwarf generation. In my prototype, I generate
several resulting files(each for one source compilation unit) and then sequentially glue them into the final resulting file.

How does that help? Do you use relocations in those intermediate object files so the DWARF in them can refer across files?

It does not help with referring across the file. It helps to parallel the generation of CU bodies.
It is not possible to write two CUs in parallel into AsmPrinter. To make possible parallel generation I stream them into different AsmPrinters(this comment is for “I believe the reason that’s probably not been done is that it can’t be streamed”. which initially was about referring across the file, but it seems I added another direction).

Oh, I see - thanks for explaining, essentially buffering on-disk.

(if two of these expandable types were in one CU - the start of the second type couldn’t be known until the end because it might keep getting pushed later due to expansion of the first type) and/or having to revisit all the type references (the offset to the second type wouldn’t be known until the end - so writing the offsets to refer to the type would have to be deferred until then).

That is the second problem: offsets are not known until the end of file.
dsymutil already has that situation for inter-CU references, so it has extra pass to
fixup offsets.

Oh, it does? I figured it was one-pass, and that it only ever refers back to types in previous CUs? So it doesn’t have to go back and do a second pass. But I guess if sees a declaration of T1 in CU1, then later on sees a definition of T1 in CU2, does it somehow go back to CU1 and remove the declaration/make references refer to the definition in CU2? I figured it’d just leave the declaration and references to it as-is, then add the definition and use that from CU2 onwards?

For the processing of the types, it do not go back.
This “I figured it was one-pass, and that it only ever refers back to types in previous CUs”
and this “I figured it’d just leave the declaration and references to it as-is, then add the definition and use that from CU2 onwards” are correct.

Great - thanks for explaining/confirming!

With multi-thread implementation such situation would arise more often
for type references and so more offsets should be fixed during additional pass.

DWARFLinker could create additional artificial compile unit and put all merged types there. Later patch all type references to point into this additional compilation unit. No any bits would be duplicated in that case. The performance improvement could be achieved due to less amount of the copied DWARF and due to the fact that type references could be updated when DWARF is cloned(no need in additional pass for that).

“later patch all type references to point into this additional compilation unit” - that’s the additional pass that people are probably talking/concerned about. Rewalking all the DWARF. The current dsymutil approach, as far as I know, is single pass - it knows the final, absolute offset to the type from the moment it emits that type/needs to refer to it.

Right. Current dsymutil approach is single pass. And from that point of view, solution
which you`ve described(to produce a declaration of the type, with the relevant members)
allows to keep that single pass implementation.

But there is a restriction for current dsymutil approach: To process inter-CU references
it needs to load all DWARF into the memory(While it analyzes which part of DWARF is live,
it needs to have all CUs loaded into the memory).

All DWARF for a single file (which for dsymutil is mostly a single CU, except with LTO I guess?), not all DWARF for all inputs in memory at once, yeah?

right. In dsymutil case - all DWARF for a single file(not all DWARF for all inputs in memory at once).
But in llvm-dwarfutil case single file contains DWARF for all original input object files and it all becomes
loaded into memory.

Yeha, would be great to try to go CU-by-CU.

That leads to huge memory usage.
It is less important when source is a set of object files(like in dsymutil case) and this
become a real problem for llvm-dwarfutil utility when source is a single file(With current
implementation it needs 30G of memory for compiling clang binary).

Yeah, that’s where I think you’d need a fixup pass one way or another - because cross-CU references can mean that when you figure out a new layout for CU5 (because it has a duplicate type definition of something in CU1) then you might have to touch CU4 that had an absolute/cross-CU forward reference to CU5. Once you’ve got such a fixup pass (if dsymutil already has one? Which, like I said, I’m confused why it would have one/that doesn’t match my very vague understanding) then I think you could make dsymutil work on a per-CU basis streaming things out, then fixing up a few offsets.

When dsymutil deduplicates types it changes local CU reference into inter-CU reference(so that CU2(next) could reference type definition from CU1(prev)). To do this change it does not need to do any fixups currently.

When dsymutil meets already existed(located in the input object file) inter-CU reference pointing into the CU which has not been processed yet(and then its offset is unknown) it marks it as “forward reference” and patches later during additional pass “fixup forward references” at a time when offsets are known.

OK, so limited 2 pass system. (does it do that second pass once at the end of the whole dsymutil run, or at the end of each input file? (so if an input file has two CUs and the first CU references a type in the second CU - it could write the first CU with a “forward reference”, then write the second CU, then fixup the forward reference - and then go on to the next file and its CUs - this could improve performance by touching recently used memory/disk pages only, rather than going all the way back to the start later on when those pages have become cold)

yes, It does it in the end of each input file.

If CUs would be processed in parallel their offsets would not be known at the moment when local type reference would be changed into inter-CU reference. So we would need to do the same fix-up processing for all references to the types like we already do for other inter-CU references.

Yeah - though the existence of this second “fixup forward references” system - yeah, could just use it much more generally as you say. Not an extra pass, just the existing second pass but having way more fixups to fixup in that pass.

If we would be able to change the algorithm in such way :

analyse all CUs.

clone all CUs.

Then we could create a merged type table(artificial CU containing types) during step1.
If that type table would be written first, then all following CUs could use known offsets
to the types and we would not need additional fix-up processing for type references.
It would still be necessary to fix-up other inter-CU references. But it would not be necessary
to fix-up type references (which constitute the vast majority).

To me, that sounds more expensive than the fixup forward references pass.

Without loading all CU into the memory it would require two passes solution. First to analyze
which part of DWARF relates to live code and then second pass to generate the result.

Not sure it’d require any more second pass than a “fixup” pass, which it sounds like you’re saying it already has?

It looks like it would need an additional pass to process inter-CU references(existed in incoming file) if we do not want to load all CUs into memory.

Usually inter-CU references aren’t used, except in LTO - and in LTO all the DWARF deduplication and function discarding is already done by the IR linker anyway. (ThinLTO is a bit different, but really we’d be better off teaching it the extra tricks anyway (some can’t be fixed in ThinLTO - like emitting a “Home” definition of an inline function, only to find out other ThinLTO backend/shards managed to optimize away all uses of the function… so some cleanup may be useful there)). It might be possible to do a more dynamic/rolling cache - keep only the CUs with unresolved cross-CU references alive and only keep them alive until their cross-CU references are found/marked alive. This should make things no worse than the traditional dsymutil case - since cross-CU references are only effective/generally used within a single object file (it’s possible to create relocations for them into other files - but I know LLVM doesn’t currently do this and I don’t think GCC does it) with multiple CUs anyway - so at most you’d keep all the CUs from a single original input file alive together.

But, since it is a DWARF documented case the tool should be ready for such case(when inter-CU
references are heavily used).

Sure - but by implementing a CU liveness window like that (keeping CUs live only so long as they need to be rather than an all-or-nothing approach) only especially quirky inputs would hit the worst case while the more normal inputs could perform better.

Moreover, llvm-dwarfutil would be the tool producing
exactly such situation. The resulting file(produced by llvm-dwarfutil) would contain a lot of
inter-CU references. Probably, there is no practical reasons to apply llvm-dwarfutil to the same
file twice but it would be a good test for the tool.

It’d be a good stress test, but not necessarily something that would need to perform the best because it wouldn’t be a common use case.

Generally, I think we should not assume that inter-CU references would be used in a limited way.

Anyway, if this scheme:

analyse all CUs.

clone all CUs.

would work slow then we would need to continue with one-pass solution and not support complex closely coupled inputs.

yeah, certainly seeing the data/experiments will be interesting, if you end up implementing some different strategies, etc.

I guess one possibility for parallel generation could be something more like Microsoft’s approach with a central debug info server that compilers communicate with - not that exact model, I mean, but if you’ve got parallel threads generating reduced DWARF into separate object files - they could communicate with a single thread responsible for type emission - the type emitter would be given types from the separate threads and compute their size, queue them up to be streamed out to the type CU (& keep the source CU alive until that work was done) - such a central type emitter could quickly determine the size of the type to be emitted and compute future type offsets (eg: if 5 types were in the queue, it could’ve figured out the offset of those types already) to answer type offset queries quickly and unblock the parallel threads to continue emitting their CUs containing type references.

Dave

Alexey · October 28, 2020, 1:01pm

                Ah, yeah - that seems like a missed opportunity -
                duplicating the whole type DIE. LTO does this by
                making monolithic types - merging all the members
                from different definitions of the same type into
                one, but that's maybe too expensive for dsymutil
                (might still be interesting to know how much more
                expensive, etc). But I think the other way to go
                would be to produce a declaration of the type,
                with the relevant members - and let the DWARF
                consumer identify this declaration as matching up
                with the earlier definition. That's the sort of
                DWARF you get from the non-MachO default
                -fno-standalone-debug anyway, so it's already
                pretty well tested/supported (support in lldb's a
                bit younger/more work-in-progress, admittedly). I
                wonder how much dsym size there is that could be
                reduced by such an implementation.

                I see. Yes, that could be done and I think it would
                result in noticeable size reduction(I do not know
                exact numbers at the moment).

                I work on multi-thread DWARFLinker now and it`s
                first version will do exactly the same type
                processing like current dsymutil.

            Yeah, best to keep the behavior the same through that

                Above scheme could be implemented as a next step
                and it would result in better size reduction(better
                than current state).

                But I think the better scheme could be done also
                and it would result in even bigger size reduction
                and in faster execution. This scheme is something
                similar to what you`ve described above: "LTO does -
                making monolithic types - merging all the members
                from different definitions of the same type into one".

            I believe the reason that's probably not been done is
            that it can't be streamed - it'd lead to buffering more
            of the output

            yes. The fact that DWARF should be streamed into
            AsmPrinter complicates parallel dwarf generation. In my
            prototype, I generate
            several resulting files(each for one source compilation
            unit) and then sequentially glue them into the final
            resulting file.

        How does that help? Do you use relocations in those
        intermediate object files so the DWARF in them can refer
        across files?

        It does not help with referring across the file. It helps to
        parallel the generation of CU bodies.
        It is not possible to write two CUs in parallel into
        AsmPrinter. To make possible parallel generation I stream
        them into different AsmPrinters(this comment is for "I
        believe the reason that's probably not been done is that it
        can't be streamed". which initially was about referring
        across the file, but it seems I added another direction).

    Oh, I see - thanks for explaining, essentially buffering on-disk.

            (if two of these expandable types were in one CU - the
            start of the second type couldn't be known until the
            end because it might keep getting pushed later due to
            expansion of the first type) and/or having to revisit
            all the type references (the offset to the second type
            wouldn't be known until the end - so writing the
            offsets to refer to the type would have to be deferred
            until then).

            That is the second problem: offsets are not known until
            the end of file.
            dsymutil already has that situation for inter-CU
            references, so it has extra pass to
            fixup offsets.

        Oh, it does? I figured it was one-pass, and that it only
        ever refers back to types in previous CUs? So it doesn't
        have to go back and do a second pass. But I guess if sees a
        declaration of T1 in CU1, then later on sees a definition of
        T1 in CU2, does it somehow go back to CU1 and remove the
        declaration/make references refer to the definition in CU2?
        I figured it'd just leave the declaration and references to
        it as-is, then add the definition and use that from CU2
        onwards?

        For the processing of the types, it do not go back.
        This "I figured it was one-pass, and that it only ever refers
        back to types in previous CUs"
        and this "I figured it'd just leave the declaration and
        references to it as-is, then add the definition and use that
        from CU2 onwards" are correct.

    Great - thanks for explaining/confirming!

            With multi-thread implementation such situation would
            arise more often
            for type references and so more offsets should be fixed
            during additional pass.

                DWARFLinker could create additional artificial
                compile unit and put all merged types there. Later
                patch all type references to point into this
                additional compilation unit. No any bits would be
                duplicated in that case. The performance
                improvement could be achieved due to less amount of
                the copied DWARF and due to the fact that type
                references could be updated when DWARF is cloned(no
                need in additional pass for that).

            "later patch all type references to point into this
            additional compilation unit" - that's the additional
            pass that people are probably talking/concerned about.
            Rewalking all the DWARF. The current dsymutil approach,
            as far as I know, is single pass - it knows the final,
            absolute offset to the type from the moment it emits
            that type/needs to refer to it.

            Right. Current dsymutil approach is single pass. And
            from that point of view, solution
            which you`ve described(to produce a declaration of the
            type, with the relevant members)
            allows to keep that single pass implementation.

            But there is a restriction for current dsymutil
            approach: To process inter-CU references
            it needs to load all DWARF into the memory(While it
            analyzes which part of DWARF is live,
            it needs to have all CUs loaded into the memory).

        All DWARF for a single file (which for dsymutil is mostly a
        single CU, except with LTO I guess?), not all DWARF for all
        inputs in memory at once, yeah?

        right. In dsymutil case - all DWARF for a single file(not all
        DWARF for all inputs in memory at once).
        But in llvm-dwarfutil case single file contains DWARF for all
        original input object files and it all becomes
        loaded into memory.

    Yeha, would be great to try to go CU-by-CU.

            That leads to huge memory usage.
            It is less important when source is a set of object
            files(like in dsymutil case) and this
            become a real problem for llvm-dwarfutil utility when
            source is a single file(With current
            implementation it needs 30G of memory for compiling
            clang binary).

        Yeah, that's where I think you'd need a fixup pass one way
        or another - because cross-CU references can mean that when
        you figure out a new layout for CU5 (because it has a
        duplicate type definition of something in CU1) then you
        might have to touch CU4 that had an absolute/cross-CU
        forward reference to CU5. Once you've got such a fixup pass
        (if dsymutil already has one? Which, like I said, I'm
        confused why it would have one/that doesn't match my very
        vague understanding) then I think you could make dsymutil
        work on a per-CU basis streaming things out, then fixing up
        a few offsets.

        When dsymutil deduplicates types it changes local CU
        reference into inter-CU reference(so that CU2(next) could
        reference type definition from CU1(prev)). To do this change
        it does not need to do any fixups currently.

        When dsymutil meets already existed(located in the input
        object file) inter-CU reference pointing into the CU which
        has not been processed yet(and then its offset is unknown) it
        marks it as "forward reference" and patches later during
        additional pass "fixup forward references" at a time when
        offsets are known.

    OK, so limited 2 pass system. (does it do that second pass once
    at the end of the whole dsymutil run, or at the end of each input
    file? (so if an input file has two CUs and the first CU
    references a type in the second CU - it could write the first CU
    with a "forward reference", then write the second CU, then fixup
    the forward reference - and then go on to the next file and its
    CUs - this could improve performance by touching recently used
    memory/disk pages only, rather than going all the way back to the
    start later on when those pages have become cold)

    yes, It does it in the end of each input file.

        If CUs would be processed in parallel their offsets would not
        be known at the moment when local type reference would be
        changed into inter-CU reference. So we would need to do the
        same fix-up processing for all references to the types like
        we already do for other inter-CU references.

    Yeah - though the existence of this second "fixup forward
    references" system - yeah, could just use it much more generally
    as you say. Not an extra pass, just the existing second pass but
    having way more fixups to fixup in that pass.

    If we would be able to change the algorithm in such way :

    1. analyse all CUs.
    2. clone all CUs.

    Then we could create a merged type table(artificial CU containing
    types) during step1.
    If that type table would be written first, then all following CUs
    could use known offsets
    to the types and we would not need additional fix-up processing
    for type references.
    It would still be necessary to fix-up other inter-CU references.
    But it would not be necessary
    to fix-up type references (which constitute the vast majority).

To me, that sounds more expensive than the fixup forward references pass.

If we would speak about direct comparison then yes loading DWARF one more time looks more expensive than fixup forward references pass. But if we would speak about the general picture then it could probably be beneficial:

1. merging types would lead to a smaller size of resulting DWARF. This would speed up the process.
f.e. If we would switch "odr types deduplication" off in current implementation then it would increase execution time two times. That is because more DWARF should be cloned and written in the result. Implementation of "merging types" would probably have a similar effect
- It would speed-up the overall process. So from one side additional step for loading DWARF would
decrease performance but a smaller amount of resulting data would increase performance.

2. When types would be put in the first CU then we would have a simple strategy for our liveness analysis algorithm: just always keep the first CU in memory. This allows us to speed up our liveness analysis step.

Anyway, all the above is just an idea for future work. Currently, I am going to implement multithread processing for CUs loaded into memory and having the same type of processing as it currently is(Which assumes that "fixup forward references pass" started to do more work by fixing types references).

            Without loading all CU into the memory it would require
            two passes solution. First to analyze
            which part of DWARF relates to live code and then second
            pass to generate the result.

        Not sure it'd require any more second pass than a "fixup"
        pass, which it sounds like you're saying it already has?

        It looks like it would need an additional pass to process
        inter-CU references(existed in incoming file) if we do not
        want to load all CUs into memory.

    Usually inter-CU references aren't used, except in LTO - and in
    LTO all the DWARF deduplication and function discarding is
    already done by the IR linker anyway. (ThinLTO is a bit
    different, but really we'd be better off teaching it the extra
    tricks anyway (some can't be fixed in ThinLTO - like emitting a
    "Home" definition of an inline function, only to find out other
    ThinLTO backend/shards managed to optimize away all uses of the
    function... so some cleanup may be useful there)). It might be
    possible to do a more dynamic/rolling cache - keep only the CUs
    with unresolved cross-CU references alive and only keep them
    alive until their cross-CU references are found/marked alive.
    This should make things no worse than the traditional dsymutil
    case - since cross-CU references are only effective/generally
    used within a single object file (it's possible to create
    relocations for them into other files - but I know LLVM doesn't
    currently do this and I don't think GCC does it) with multiple
    CUs anyway - so at most you'd keep all the CUs from a single
    original input file alive together.

    But, since it is a DWARF documented case the tool should be ready
    for such case(when inter-CU
    references are heavily used).

Sure - but by implementing a CU liveness window like that (keeping CUs live only so long as they need to be rather than an all-or-nothing approach) only especially quirky inputs would hit the worst case while the more normal inputs could perform better.

It is not clear what should be put in such CU liveness window. If CU100 references CU1 - how could we know that we need to put CU1 into CU liveness window before we processed CU100?

    Moreover, llvm-dwarfutil would be the tool producing
    exactly such situation. The resulting file(produced by
    llvm-dwarfutil) would contain a lot of
    inter-CU references. Probably, there is no practical reasons to
    apply llvm-dwarfutil to the same
    file twice but it would be a good test for the tool.

It'd be a good stress test, but not necessarily something that would need to perform the best because it wouldn't be a common use case.

I agree that we should not slow down the DWARFLinker in common cases only because we need to support the worst cases.
But we also need to implement a solution which works in some acceptable manner for the worst case. The current solution - loading everything in memory - makes it hard to use in a non-dsymutil scenario(llvm-dwarfutil).

There could be several things which could be used to decide whether we need to go on a light or heavy path:

1. If the input contains only a single CU we do not need to unload it from memory. Thus - we would not need to do an extra DWARF loading pass.
2. If abbreviations from the whole input file do not contain inter-CU references then while doing liveness analysis, we do not need to wait until other CUs are processed.

Then that scheme would be used for worst cases:

1. for (CU : CU1...CU100) {
load CU.
analyse CU.
unload CU.
}
2. for (CU : CU1...CU100) {
load CU.
clone CU.
unload CU.
}
3. fixup forward references.

and that scheme for light cases:

1. for (CU : CU1...CU100) {
load CU.
analyse CU.
clone CU.
unload CU.
}
2. fixup forward references.

    Generally, I think we should not assume that inter-CU references
    would be used in a limited way.

    Anyway, if this scheme:

    1. analyse all CUs.
    2. clone all CUs.

    would work slow then we would need to continue with one-pass
    solution and not support complex closely coupled inputs.

yeah, certainly seeing the data/experiments will be interesting, if you end up implementing some different strategies, etc.

I guess one possibility for parallel generation could be something more like Microsoft's approach with a central debug info server that compilers communicate with - not that exact model, I mean, but if you've got parallel threads generating reduced DWARF into separate object files - they could communicate with a single thread responsible for type emission - the type emitter would be given types from the separate threads and compute their size, queue them up to be streamed out to the type CU (& keep the source CU alive until that work was done) - such a central type emitter could quickly determine the size of the type to be emitted and compute future type offsets (eg: if 5 types were in the queue, it could've figured out the offset of those types already) to answer type offset queries quickly and unblock the parallel threads to continue emitting their CUs containing type references.

yes. Thank you. Would think about it.

Alexey.

Topic		Replies	Views
Reimplementing Darwin's dsymutil as an lld helper LLVM Dev List Archives	15	94	November 18, 2015
DWARF parsing code LLDB	7	83	June 23, 2014
Adding DWARF5 accelerator table support to llvm LLVM Dev List Archives	44	184	June 18, 2018
llvm-dwarfdump improvements LLVM Dev List Archives	4	95	September 3, 2014
DebugInfo work contribution and update. LLVM Dev List Archives	17	91	October 18, 2019

[Proposal][Debuginfo] dsymutil-like tool for ELF.

Related Topics