Ah, yeah - that seems like a missed opportunity -
duplicating the whole type DIE. LTO does this by making
monolithic types - merging all the members from different
definitions of the same type into one, but that's maybe too
expensive for dsymutil (might still be interesting to know
how much more expensive, etc). But I think the other way to
go would be to produce a declaration of the type, with the
relevant members - and let the DWARF consumer identify this
declaration as matching up with the earlier definition.
That's the sort of DWARF you get from the non-MachO default
-fno-standalone-debug anyway, so it's already pretty well
tested/supported (support in lldb's a bit younger/more
work-in-progress, admittedly). I wonder how much dsym size
there is that could be reduced by such an implementation.
I see. Yes, that could be done and I think it would result in
noticeable size reduction(I do not know exact numbers at the
moment).
I work on multi-thread DWARFLinker now and it`s first version
will do exactly the same type processing like current dsymutil.
Yeah, best to keep the behavior the same through that
Above scheme could be implemented as a next step and it would
result in better size reduction(better than current state).
But I think the better scheme could be done also and it would
result in even bigger size reduction and in faster execution.
This scheme is something similar to what you`ve described
above: "LTO does - making monolithic types - merging all the
members from different definitions of the same type into one".
I believe the reason that's probably not been done is that it
can't be streamed - it'd lead to buffering more of the output
yes. The fact that DWARF should be streamed into AsmPrinter
complicates parallel dwarf generation. In my prototype, I generate
several resulting files(each for one source compilation unit) and
then sequentially glue them into the final resulting file.
How does that help? Do you use relocations in those intermediate object files so the DWARF in them can refer across files?
It does not help with referring across the file. It helps to parallel the generation of CU bodies.
It is not possible to write two CUs in parallel into AsmPrinter. To make possible parallel generation I stream them into different AsmPrinters(this comment is for "I believe the reason that's probably not been done is that it can't be streamed". which initially was about referring across the file, but it seems I added another direction).
(if two of these expandable types were in one CU - the start of
the second type couldn't be known until the end because it might
keep getting pushed later due to expansion of the first type)
and/or having to revisit all the type references (the offset to
the second type wouldn't be known until the end - so writing the
offsets to refer to the type would have to be deferred until then).
That is the second problem: offsets are not known until the end of
file.
dsymutil already has that situation for inter-CU references, so it
has extra pass to
fixup offsets.
Oh, it does? I figured it was one-pass, and that it only ever refers back to types in previous CUs? So it doesn't have to go back and do a second pass. But I guess if sees a declaration of T1 in CU1, then later on sees a definition of T1 in CU2, does it somehow go back to CU1 and remove the declaration/make references refer to the definition in CU2? I figured it'd just leave the declaration and references to it as-is, then add the definition and use that from CU2 onwards?
For the processing of the types, it do not go back.
This "I figured it was one-pass, and that it only ever refers back to types in previous CUs"
and this "I figured it'd just leave the declaration and references to it as-is, then add the definition and use that from CU2 onwards" are correct.
With multi-thread implementation such situation would arise more
often
for type references and so more offsets should be fixed during
additional pass.
DWARFLinker could create additional artificial compile unit
and put all merged types there. Later patch all type
references to point into this additional compilation unit.
No any bits would be duplicated in that case. The performance
improvement could be achieved due to less amount of the
copied DWARF and due to the fact that type references could
be updated when DWARF is cloned(no need in additional pass
for that).
"later patch all type references to point into this additional
compilation unit" - that's the additional pass that people are
probably talking/concerned about. Rewalking all the DWARF. The
current dsymutil approach, as far as I know, is single pass - it
knows the final, absolute offset to the type from the moment it
emits that type/needs to refer to it.
Right. Current dsymutil approach is single pass. And from that
point of view, solution
which you`ve described(to produce a declaration of the type, with
the relevant members)
allows to keep that single pass implementation.
But there is a restriction for current dsymutil approach: To
process inter-CU references
it needs to load all DWARF into the memory(While it analyzes which
part of DWARF is live,
it needs to have all CUs loaded into the memory).
All DWARF for a single file (which for dsymutil is mostly a single CU, except with LTO I guess?), not all DWARF for all inputs in memory at once, yeah?
right. In dsymutil case - all DWARF for a single file(not all DWARF for all inputs in memory at once).
But in llvm-dwarfutil case single file contains DWARF for all original input object files and it all becomes
loaded into memory.
That leads to huge memory usage.
It is less important when source is a set of object files(like in
dsymutil case) and this
become a real problem for llvm-dwarfutil utility when source is a
single file(With current
implementation it needs 30G of memory for compiling clang binary).
Yeah, that's where I think you'd need a fixup pass one way or another - because cross-CU references can mean that when you figure out a new layout for CU5 (because it has a duplicate type definition of something in CU1) then you might have to touch CU4 that had an absolute/cross-CU forward reference to CU5. Once you've got such a fixup pass (if dsymutil already has one? Which, like I said, I'm confused why it would have one/that doesn't match my very vague understanding) then I think you could make dsymutil work on a per-CU basis streaming things out, then fixing up a few offsets.
When dsymutil deduplicates types it changes local CU reference into inter-CU reference(so that CU2(next) could reference type definition from CU1(prev)). To do this change it does not need to do any fixups currently.
When dsymutil meets already existed(located in the input object file) inter-CU reference pointing into the CU which has not been processed yet(and then its offset is unknown) it marks it as "forward reference" and patches later during additional pass "fixup forward references" at a time when offsets are known.
If CUs would be processed in parallel their offsets would not be known at the moment when local type reference would be changed into inter-CU reference. So we would need to do the same fix-up processing for all references to the types like we already do for other inter-CU references.
Without loading all CU into the memory it would require two passes
solution. First to analyze
which part of DWARF relates to live code and then second pass to
generate the result.
Not sure it'd require any more second pass than a "fixup" pass, which it sounds like you're saying it already has?
It looks like it would need an additional pass to process inter-CU references(existed in incoming file) if we do not want to load all CUs into memory.
When the input file contains inter-CU references, DWARFLinker needs to follow them while doing liveness marking. i.e. if the original CU has a live part which references another CU we need to follow this new CU and mark the referenced part as life. At the current moment, while doing liveness analysis, we have all CUs in memory. That allows us to load all CUs once and analyze them all. In case llvm-dwarfutil(which loads all DWARF for input file) it leads to huge memory usage.
Let's say CU1 references CU100. And CU100 references CU1. We could not start generation for CU1 until we analyzed CU100 and marked the corresponding part of CU1 as life. At the same time, we could not load DWARF for all CUs. Then processing(in simplified form) could look like this:
1: for (CU : CU1...CU100)
load CU, do liveness analysis, remember references, unload CU
2: for (all references)
load CU, do liveness analysis, unload CU
3: for (CU : CU1...CU100)
load CU, clone CU
That is a simplified scheme, but I think it is enough to show the idea. In this scheme we have 1 and 2 which should be done before 3.
Alexey.