I have some ideas to allow the BitcodeReader to lazy-load debug info
metadata, and wanted to air this on llvm-dev before getting too deep
into the code.
Motivation
I have some ideas to allow the BitcodeReader to lazy-load debug info
metadata, and wanted to air this on llvm-dev before getting too deep
into the code.
Motivation
+pcc, who had some other ideas/patch out for improving memory usage of debug info
+Reid, who’s responsible for the windows/CodeView/PDB debug info which is motivating some of the ideas about changes to type emission
So how does this relate, or not, to Peter’s (pcc) work trying to reduce the DIE overhead during code gen? Are you folks chasing different memory bottlenecks? Are they both relevant (perhaps in different scenarios)?
Baking into the IR more about types as units has pretty direct overlap with Reid/CodeView/etc - so, yeah, that’ll takes ome discussion (but, as you say, it’s not in your immediate plan anyway, so we can come back to that - but would be good for whoever gets there first to discuss it with the others)
+pcc, who had some other ideas/patch out for improving memory usage of debug info
+Reid, who’s responsible for the windows/CodeView/PDB debug info which is motivating some of the ideas about changes to type emission
So I discussed this with Adrian and Mehdi at the social last Thursday and I’m getting set to finish the write up. I think it’ll have some bearing on this proposal as I think it’ll change how we want to take a look at the format of DISubprogram metadata a bit more. That said, most of it is orthogonal to the changes Duncan is talking about here. Just puts the pressure on to get the other proposal written up.
-eric
+pcc, who had some other ideas/patch out for improving memory usage of debug info
+Reid, who’s responsible for the windows/CodeView/PDB debug info which is motivating some of the ideas about changes to type emissionSo how does this relate, or not, to Peter’s (pcc) work trying to reduce the DIE overhead during code gen? Are you folks chasing different memory bottlenecks? Are they both relevant (perhaps in different scenarios)?
I think this is orthogonal to Peter’s work, although (2) may help DIE overhead, I’m not sure. The primary goal is actually CPU speedup when lazy-loading only a very small number of functions from a module (e.g, a module that gets imported into 1000 others needs to be lazy loaded 1000 times), but I’m using the memory stats as rough guidance for which metadata exists and takes time to parse. The memory is also a problem, but the peak memory scales linearly so it’s not as bad.
+pcc, who had some other ideas/patch out for improving memory usage of
debug info
+Reid, who's responsible for the windows/CodeView/PDB debug info which is
motivating some of the ideas about changes to type emissionSo how does this relate, or not, to Peter's (pcc) work trying to reduce
the DIE overhead during code gen? Are you folks chasing different memory
bottlenecks? Are they both relevant (perhaps in different scenarios)?I think this is orthogonal to Peter's work, although (2) may help DIE
overhead, I'm not sure. The primary goal is actually CPU speedup when
lazy-loading only a very small number of functions from a module (e.g, a
module that gets imported into 1000 others needs to be lazy loaded 1000
times), but I'm using the memory stats as rough guidance for which metadata
exists and takes time to parse. The memory is also a problem, but the peak
memory scales linearly so it's not as bad.
I was mostly curious why you & Peter ended up with different bottlenecks if
you're both going after the same scenario. But I see you're specifically
targeting /Thin/LTO whereas he's more interested in traditional LTO. So not
surprising you're hitting the memory bottlenecks in modules linking, etc,
but less so in CodeGen.
Did you investigate much further the straight LTO memory issues before
prioritizing ThinLTO? (just curious if you got LTO to something acceptable
usable for your needs, if you knew what the next biggest problem was, etc -
not too important though, just curious about any institutional knowledge
lying around on these different axes)
We have the same issue using ThinLTO that exist with FullLTO during CodeGen, maybe even more difficult to scale than FullLTO (more redundant work because of all the imported debug info). However by design ThinLTO partitioning helps to handle the scalability somehow, provided that we apply a few fixes.
However the IR loading/linking itself is on another level worse than FullLTO here (the number of IR lazy load is ~quadratic in the number of modules instead of linear for FullLTO). So this something that needs to be addressed on top of what Peter does, and with higher priority considering the impact.
Ultimately it’ll need some bitcode format redesign to be really efficient, but that’s not gonna happen in the short term (we aiming at a couple of weeks milestone right now).
We’d like to have ThinLTO “good enough” to obsolete FullLTO for most use cases. So our effort are focused on ThinLTO right now.
I have some ideas to allow the BitcodeReader to lazy-load debug info
metadata, and wanted to air this on llvm-dev before getting too deep
into the code.Motivation
Based on some analysis Mehdi ran (ping him for details), there are three
(related) compile-time bottlenecks we're seeing with `-flto=thin -g`:a) Reading the large number of Metadata bitcode records in the global
metadata block. I'm talking about raw `BitStreamer` calls here.b) Creating unnecessary `DI*` instances (that aren't relevant to code).
Creating in the source module, or in the dest module during linking?
c) Emitting unnecessary `DI*` instances (that aren't relevant to code).
Here is my recollection of some peak memory stats on a small testcase
during thin-LTO, which should be a decent indicator of (b):- ~150MB: DILocation
- ~100MB: DISubprogram
- ~70MB: DILocalVariable
- ~50MB: (cumulative) DIType descendentsIt looks, suprisingly, like types are not the primary bottleneck.
There are caveats:
- `DISubprogram` declarations -- member function descriptors -- are
part of the type hierarchy.
- Most of the type hierarchy gets uniqued at parse time.
- As a result, these data are a poor indicator for (a).Even so, non-types are substantial.
Related work
Teresa has some post-processing in-place/in-review to avoid importing
metadata unnecessarily, but IIUC: it won't address (a) and (b), only
(c) (maybe I'm wrong?); and it only helps -flto=thin, not other
lazy-loaders.
That is D16440. It reduces the metadata imported into the dest module (not
sure whether that falls into (b) or just (c)).
It could actually help full LTO as well since I also added support for not
linking in unneeded DISubprogram for full LTO at the same time as ThinLTO
in r256003. But right now the changes in the patch are guarded so they only
happen under ThinLTO since some of the other things we prune from the
imported DICompileUnit only applies to ThinLTO. I could restructure that a
bit to get the reduced retained types importing to occur for full LTO as
well.
I heard a rumour that Eric has a grand plan to factor away the type
hierarchy -- awesome if true -- but I think most of this is worthwhile
regardless.Proposal
Short version
-------------1. Serialize metadata in Function blocks where possible.
2. Reverse the `DISubprogram`/`DICompileUnit` link.
3. Create a `METADATA_SUBPROGRAM_BLOCK`.Type-related work Eric will make unnecessary if he's fast:
4. Remove `DICompositeType`s from `retainedTypes:`, similar to (2).
5. Create a `METADATA_COMPOSITE_TYPE_BLOCK`, similar to (3).Long version
------------1. If a piece of metadata is referenced from only a single `Function`,
serialize that metadata in the function's metadata block instead of
the global metadata block.This addresses problems (a) and (b), primarily targeting
`DILocation`s. It should pick up lots of other stuff, depending on
how much inlining has happened.(I have a draft of the writer side, still working on the reader.)
2. Reverse the `DISubprogram`/`DICompileUnit` link (David and I have
talked about this in the past in barely-related threads). The
direct effect is that subprograms that are not pointed at by any
code (!dbg attachments or @llvm.dbg.value intrinsics) get dropped.This addresses problem (c). If a consumer is only linking/loading a
subset of a module's functions, this naturally filters subprograms
to the relevant ones. Also, with limited inlining (and assuming
(1)), it addresses problems (a) and (b), too.Adrian volunteered to implement this and is apparently almost ready
to post a patch (still working on testcase update script logic I
believe (probably other details, don't let me oversell it)).
As noted in the review thread for my D16440, I'll need to adjust that
handling once this link reversal goes in.
3. Create a special `METADATA_SUBPROGRAM_BLOCK` for each `DISubprogram`
in the global metadata block. Store the relevant `DISubprogram` and
all of the subprogram's `DILexicalBlock`s and `DILocalVariable`s.
The block can be lazy-loaded on an all-or-nothing basis.In combination with (2), this addresses (a) and (b) in cases that
(1) doesn't catch. A lazy-loading module will only load the
subprogram blocks that get referenced.
I'm not sure I understand this part - if the debug info for each subprogram
can be divided into separate blocks, why can't it be moved into the
function's metadata block? I.e. what happens for debug metadata that is
referenced by multiple functions, which I thought was all that was going to
remain in the global metadata block? Oh - the DISubprogram may be
referenced in other places within the global metadata so cannot move into
the function metadata block. So debug metadata only reached from that
DISubprogram is included in its block, but any debug metadata referenced by
multiple DISubprograms would not be located within one of these blocks?
(I have a basic design for this that accounts for references into
the middle of block; I'll see what happens when I flesh it out.)I think this will solve the non-type bottlenecks.
If Eric hasn't solved types by then, we can do similar things to the IR
for the debug info type hierarchy.4. Implement my proposal to remove the `DICompositeType` name map from
`retainedTypes:`.http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20160125/327936.html
Similar to (2) above, this will naturally filter the types that get
linked in to the ones actually used by the code being linked.It should also allow the reader to skip records for types that have
already been loaded in the main module.
The ValueMapper or something will need to figure out which types referenced
by UUID to map/link in to the dest module. Currently the ValueMapper does
not follow UUID references, but these are brought in when the DICompileUnit
is mapped since they are all in the retained types list.
5. Create a special `METADATA_COMPOSITE_TYPE_BLOCK`, similar to (3) but
for composite types and their members. This avoids the raw bitcode
reading overhead. (This is totally undesigned at this point.)
Ditto here - any metadata referenced by multiple composite types does not
go into a block, right?
Thanks,
Teresa
The problem is that the same subprogram may be referenced from multiple
functions. Consider:
>
>
>
> I have some ideas to allow the BitcodeReader to lazy-load debug info
> metadata, and wanted to air this on llvm-dev before getting too deep
> into the code.
>
> Motivation
> ==========
>
> Based on some analysis Mehdi ran (ping him for details), there are three
> (related) compile-time bottlenecks we're seeing with `-flto=thin -g`:
>
> a) Reading the large number of Metadata bitcode records in the global
> metadata block. I'm talking about raw `BitStreamer` calls here.
>
> b) Creating unnecessary `DI*` instances (that aren't relevant to code).
>
> Creating in the source module, or in the dest module during linking?
>
>
> c) Emitting unnecessary `DI*` instances (that aren't relevant to code).
>
> Here is my recollection of some peak memory stats on a small testcase
> during thin-LTO, which should be a decent indicator of (b):
>
> - ~150MB: DILocation
> - ~100MB: DISubprogram
> - ~70MB: DILocalVariable
> - ~50MB: (cumulative) DIType descendents
>
> It looks, suprisingly, like types are not the primary bottleneck.
>
> There are caveats:
>
> - `DISubprogram` declarations -- member function descriptors -- are
> part of the type hierarchy.
> - Most of the type hierarchy gets uniqued at parse time.
> - As a result, these data are a poor indicator for (a).
>
> Even so, non-types are substantial.
>
> Related work
> ============
>
> Teresa has some post-processing in-place/in-review to avoid importing
> metadata unnecessarily, but IIUC: it won't address (a) and (b), only
> (c) (maybe I'm wrong?); and it only helps -flto=thin, not other
> lazy-loaders.
>
> That is D16440. It reduces the metadata imported into the dest module
(not sure whether that falls into (b) or just (c)).
>
> It could actually help full LTO as well since I also added support for
not linking in unneeded DISubprogram for full LTO at the same time as
ThinLTO in r256003. But right now the changes in the patch are guarded so
they only happen under ThinLTO since some of the other things we prune from
the imported DICompileUnit only applies to ThinLTO. I could restructure
that a bit to get the reduced retained types importing to occur for full
LTO as well.
>
>
> I heard a rumour that Eric has a grand plan to factor away the type
> hierarchy -- awesome if true -- but I think most of this is worthwhile
> regardless.
>
> Proposal
> ========
>
> Short version
> -------------
>
> 1. Serialize metadata in Function blocks where possible.
> 2. Reverse the `DISubprogram`/`DICompileUnit` link.
> 3. Create a `METADATA_SUBPROGRAM_BLOCK`.
>
> Type-related work Eric will make unnecessary if he's fast:
>
> 4. Remove `DICompositeType`s from `retainedTypes:`, similar to (2).
> 5. Create a `METADATA_COMPOSITE_TYPE_BLOCK`, similar to (3).
>
> Long version
> ------------
>
> 1. If a piece of metadata is referenced from only a single `Function`,
> serialize that metadata in the function's metadata block instead of
> the global metadata block.
>
> This addresses problems (a) and (b), primarily targeting
> `DILocation`s. It should pick up lots of other stuff, depending on
> how much inlining has happened.
>
> (I have a draft of the writer side, still working on the reader.)
>
> 2. Reverse the `DISubprogram`/`DICompileUnit` link (David and I have
> talked about this in the past in barely-related threads). The
> direct effect is that subprograms that are not pointed at by any
> code (!dbg attachments or @llvm.dbg.value intrinsics) get dropped.
>
> This addresses problem (c). If a consumer is only linking/loading a
> subset of a module's functions, this naturally filters subprograms
> to the relevant ones. Also, with limited inlining (and assuming
> (1)), it addresses problems (a) and (b), too.
>
> Adrian volunteered to implement this and is apparently almost ready
> to post a patch (still working on testcase update script logic I
> believe (probably other details, don't let me oversell it)).
>
> As noted in the review thread for my D16440, I'll need to adjust that
handling once this link reversal goes in.
>
>
> 3. Create a special `METADATA_SUBPROGRAM_BLOCK` for each `DISubprogram`
> in the global metadata block. Store the relevant `DISubprogram` and
> all of the subprogram's `DILexicalBlock`s and `DILocalVariable`s.
> The block can be lazy-loaded on an all-or-nothing basis.
>
> In combination with (2), this addresses (a) and (b) in cases that
> (1) doesn't catch. A lazy-loading module will only load the
> subprogram blocks that get referenced.
>
> I'm not sure I understand this part - if the debug info for each
subprogram can be divided into separate blocks, why can't it be moved into
the function's metadata block? I.e. what happens for debug metadata that is
referenced by multiple functions, which I thought was all that was going to
remain in the global metadata block? Oh - the DISubprogram may be
referenced in other places within the global metadata so cannot move into
the function metadata block. So debug metadata only reached from that
DISubprogram is included in its block, but any debug metadata referenced by
multiple DISubprograms would not be located within one of these blocks?The problem is that the same subprogram may be referenced from multiple
functions. Consider:
--
void sink();
__attribute__((always_inline)) static inline void foo() { sink(); }
void caller1() { foo(); }
void caller2() { foo(); }
void unrelated() {}
--The IR for caller1 and caller2 will both reference the subprogram for
foo, something like:
--
define void @caller1() !dbg !2 {
call void @sink(), !dbg !5
ret void, !dbg !7
}
define void @caller1() !dbg !3 {
call void @sink(), !dbg !8
ret void, !dbg !10
}
define void @unrelated() !dbg !4 {
ret void, !dbg !11
}!1 = distinct !DISubprogram(name: "foo")
!2 = distinct !DISubprogram(name: "caller1")
!3 = distinct !DISubprogram(name: "caller2")
!4 = distinct !DISubprogram(name: "unrelated")!5 = !DILocation(line: 2, scope: !1, inlinedAt: !6)
!6 = distinct !DILocation(line: 3, scope: !2)
!7 = !DILocation(line: 3, scope: !2)!8 = !DILocation(line: 2, scope: !1, inlinedAt: !9)
!9 = distinct !DILocation(line: 4, scope: !3)
!10 = !DILocation(line: 4, scope: !3)!11 = !DILocation(line: 5, scope: !4)
--
In this example, !1 (subprogram for foo) is referenced from both
@caller1 and @caller2, so we need some common place to save it.We still want to lazy-load it (and the local variables and lexical
blocks underneath it), so that if the loader only wants @unrelated
!1 (and its vars/blocks) never gets read.
Right, that's what I realized halfway through writing the above paragraph.
But my remaining question (the last sentence) is what happens to other
debug metadata nodes that are referenced perhaps indirectly by multiple
DISubprograms? Some would be composite types, I guess they would be placed
in the METADATA_COMPOSITE_TYPE_BLOCK you mention below. But I believe there
are going to be some other uniqued nodes that are not composite types and
are referenced via multiple DISubprogram or other composite types?
Thanks,
Teresa
>
>
>
> I have some ideas to allow the BitcodeReader to lazy-load debug info
> metadata, and wanted to air this on llvm-dev before getting too deep
> into the code.
>
> Motivation
> ==========
>
> Based on some analysis Mehdi ran (ping him for details), there are three
> (related) compile-time bottlenecks we're seeing with `-flto=thin -g`:
>
> a) Reading the large number of Metadata bitcode records in the global
> metadata block. I'm talking about raw `BitStreamer` calls here.
>
> b) Creating unnecessary `DI*` instances (that aren't relevant to code).
>
> Creating in the source module, or in the dest module during linking?
>
>
> c) Emitting unnecessary `DI*` instances (that aren't relevant to code).
>
> Here is my recollection of some peak memory stats on a small testcase
> during thin-LTO, which should be a decent indicator of (b):
>
> - ~150MB: DILocation
> - ~100MB: DISubprogram
> - ~70MB: DILocalVariable
> - ~50MB: (cumulative) DIType descendents
>
> It looks, suprisingly, like types are not the primary bottleneck.
>
> There are caveats:
>
> - `DISubprogram` declarations -- member function descriptors -- are
> part of the type hierarchy.
> - Most of the type hierarchy gets uniqued at parse time.
> - As a result, these data are a poor indicator for (a).
>
> Even so, non-types are substantial.
>
> Related work
> ============
>
> Teresa has some post-processing in-place/in-review to avoid importing
> metadata unnecessarily, but IIUC: it won't address (a) and (b), only
> (c) (maybe I'm wrong?); and it only helps -flto=thin, not other
> lazy-loaders.
>
> That is D16440. It reduces the metadata imported into the dest module (not sure whether that falls into (b) or just (c)).
>
> It could actually help full LTO as well since I also added support for not linking in unneeded DISubprogram for full LTO at the same time as ThinLTO in r256003. But right now the changes in the patch are guarded so they only happen under ThinLTO since some of the other things we prune from the imported DICompileUnit only applies to ThinLTO. I could restructure that a bit to get the reduced retained types importing to occur for full LTO as well.
>
>
> I heard a rumour that Eric has a grand plan to factor away the type
> hierarchy -- awesome if true -- but I think most of this is worthwhile
> regardless.
>
> Proposal
> ========
>
> Short version
> -------------
>
> 1. Serialize metadata in Function blocks where possible.
> 2. Reverse the `DISubprogram`/`DICompileUnit` link.
> 3. Create a `METADATA_SUBPROGRAM_BLOCK`.
>
> Type-related work Eric will make unnecessary if he's fast:
>
> 4. Remove `DICompositeType`s from `retainedTypes:`, similar to (2).
> 5. Create a `METADATA_COMPOSITE_TYPE_BLOCK`, similar to (3).
>
> Long version
> ------------
>
> 1. If a piece of metadata is referenced from only a single `Function`,
> serialize that metadata in the function's metadata block instead of
> the global metadata block.
>
> This addresses problems (a) and (b), primarily targeting
> `DILocation`s. It should pick up lots of other stuff, depending on
> how much inlining has happened.
>
> (I have a draft of the writer side, still working on the reader.)
>
> 2. Reverse the `DISubprogram`/`DICompileUnit` link (David and I have
> talked about this in the past in barely-related threads). The
> direct effect is that subprograms that are not pointed at by any
> code (!dbg attachments or @llvm.dbg.value intrinsics) get dropped.
>
> This addresses problem (c). If a consumer is only linking/loading a
> subset of a module's functions, this naturally filters subprograms
> to the relevant ones. Also, with limited inlining (and assuming
> (1)), it addresses problems (a) and (b), too.
>
> Adrian volunteered to implement this and is apparently almost ready
> to post a patch (still working on testcase update script logic I
> believe (probably other details, don't let me oversell it)).
>
> As noted in the review thread for my D16440, I'll need to adjust that handling once this link reversal goes in.
>
>
> 3. Create a special `METADATA_SUBPROGRAM_BLOCK` for each `DISubprogram`
> in the global metadata block. Store the relevant `DISubprogram` and
> all of the subprogram's `DILexicalBlock`s and `DILocalVariable`s.
> The block can be lazy-loaded on an all-or-nothing basis.
>
> In combination with (2), this addresses (a) and (b) in cases that
> (1) doesn't catch. A lazy-loading module will only load the
> subprogram blocks that get referenced.
>
> I'm not sure I understand this part - if the debug info for each subprogram can be divided into separate blocks, why can't it be moved into the function's metadata block? I.e. what happens for debug metadata that is referenced by multiple functions, which I thought was all that was going to remain in the global metadata block? Oh - the DISubprogram may be referenced in other places within the global metadata so cannot move into the function metadata block. So debug metadata only reached from that DISubprogram is included in its block, but any debug metadata referenced by multiple DISubprograms would not be located within one of these blocks?The problem is that the same subprogram may be referenced from multiple
functions. Consider:
--
void sink();
__attribute__((always_inline)) static inline void foo() { sink(); }
void caller1() { foo(); }
void caller2() { foo(); }
void unrelated() {}
--The IR for caller1 and caller2 will both reference the subprogram for
foo, something like:
--
define void @caller1() !dbg !2 {
call void @sink(), !dbg !5
ret void, !dbg !7
}
define void @caller1() !dbg !3 {
call void @sink(), !dbg !8
ret void, !dbg !10
}
define void @unrelated() !dbg !4 {
ret void, !dbg !11
}!1 = distinct !DISubprogram(name: "foo")
!2 = distinct !DISubprogram(name: "caller1")
!3 = distinct !DISubprogram(name: "caller2")
!4 = distinct !DISubprogram(name: "unrelated")!5 = !DILocation(line: 2, scope: !1, inlinedAt: !6)
!6 = distinct !DILocation(line: 3, scope: !2)
!7 = !DILocation(line: 3, scope: !2)!8 = !DILocation(line: 2, scope: !1, inlinedAt: !9)
!9 = distinct !DILocation(line: 4, scope: !3)
!10 = !DILocation(line: 4, scope: !3)!11 = !DILocation(line: 5, scope: !4)
--
In this example, !1 (subprogram for foo) is referenced from both
@caller1 and @caller2, so we need some common place to save it.We still want to lazy-load it (and the local variables and lexical
blocks underneath it), so that if the loader only wants @unrelated
!1 (and its vars/blocks) never gets read.Right, that's what I realized halfway through writing the above paragraph. But my remaining question (the last sentence) is what happens to other debug metadata nodes that are referenced perhaps indirectly by multiple DISubprograms? Some would be composite types, I guess they would be placed in the METADATA_COMPOSITE_TYPE_BLOCK you mention below. But I believe there are going to be some other uniqued nodes that are not composite types and are referenced via multiple DISubprogram or other composite types?
It looks roughly like most things are either: (i) DILocations;
(ii) DISubprogram (definitions), DILexicalBocks, and
DILocalVariables; or (iii) DICompositeType, DISubprogram
declarations, and other parts of DICompositeType. I was thinking
we'd see what was left at that point, and then if necessary, find a
way to structure the rest.
It's possible we'll want to emit some uniqued nodes multiple times,
sacrificing bitcode size for lazy-loading performance. We already
do this for constants in the IR: if a constant is not referenced
globally, it's emitted in each Function that references it. We
could do the same with, e.g., DISubroutineType.
>
>
>
>
> >
> >
> >
> > I have some ideas to allow the BitcodeReader to lazy-load debug info
> > metadata, and wanted to air this on llvm-dev before getting too deep
> > into the code.
> >
> > Motivation
> > ==========
> >
> > Based on some analysis Mehdi ran (ping him for details), there are
three
> > (related) compile-time bottlenecks we're seeing with `-flto=thin -g`:
> >
> > a) Reading the large number of Metadata bitcode records in the global
> > metadata block. I'm talking about raw `BitStreamer` calls here.
> >
> > b) Creating unnecessary `DI*` instances (that aren't relevant to
code).
> >
> > Creating in the source module, or in the dest module during linking?
> >
> >
> > c) Emitting unnecessary `DI*` instances (that aren't relevant to
code).
> >
> > Here is my recollection of some peak memory stats on a small testcase
> > during thin-LTO, which should be a decent indicator of (b):
> >
> > - ~150MB: DILocation
> > - ~100MB: DISubprogram
> > - ~70MB: DILocalVariable
> > - ~50MB: (cumulative) DIType descendents
> >
> > It looks, suprisingly, like types are not the primary bottleneck.
> >
> > There are caveats:
> >
> > - `DISubprogram` declarations -- member function descriptors -- are
> > part of the type hierarchy.
> > - Most of the type hierarchy gets uniqued at parse time.
> > - As a result, these data are a poor indicator for (a).
> >
> > Even so, non-types are substantial.
> >
> > Related work
> > ============
> >
> > Teresa has some post-processing in-place/in-review to avoid importing
> > metadata unnecessarily, but IIUC: it won't address (a) and (b), only
> > (c) (maybe I'm wrong?); and it only helps -flto=thin, not other
> > lazy-loaders.
> >
> > That is D16440. It reduces the metadata imported into the dest module
(not sure whether that falls into (b) or just (c)).
> >
> > It could actually help full LTO as well since I also added support for
not linking in unneeded DISubprogram for full LTO at the same time as
ThinLTO in r256003. But right now the changes in the patch are guarded so
they only happen under ThinLTO since some of the other things we prune from
the imported DICompileUnit only applies to ThinLTO. I could restructure
that a bit to get the reduced retained types importing to occur for full
LTO as well.
> >
> >
> > I heard a rumour that Eric has a grand plan to factor away the type
> > hierarchy -- awesome if true -- but I think most of this is worthwhile
> > regardless.
> >
> > Proposal
> > ========
> >
> > Short version
> > -------------
> >
> > 1. Serialize metadata in Function blocks where possible.
> > 2. Reverse the `DISubprogram`/`DICompileUnit` link.
> > 3. Create a `METADATA_SUBPROGRAM_BLOCK`.
> >
> > Type-related work Eric will make unnecessary if he's fast:
> >
> > 4. Remove `DICompositeType`s from `retainedTypes:`, similar to (2).
> > 5. Create a `METADATA_COMPOSITE_TYPE_BLOCK`, similar to (3).
> >
> > Long version
> > ------------
> >
> > 1. If a piece of metadata is referenced from only a single `Function`,
> > serialize that metadata in the function's metadata block instead of
> > the global metadata block.
> >
> > This addresses problems (a) and (b), primarily targeting
> > `DILocation`s. It should pick up lots of other stuff, depending on
> > how much inlining has happened.
> >
> > (I have a draft of the writer side, still working on the reader.)
> >
> > 2. Reverse the `DISubprogram`/`DICompileUnit` link (David and I have
> > talked about this in the past in barely-related threads). The
> > direct effect is that subprograms that are not pointed at by any
> > code (!dbg attachments or @llvm.dbg.value intrinsics) get dropped.
> >
> > This addresses problem (c). If a consumer is only linking/loading
a
> > subset of a module's functions, this naturally filters subprograms
> > to the relevant ones. Also, with limited inlining (and assuming
> > (1)), it addresses problems (a) and (b), too.
> >
> > Adrian volunteered to implement this and is apparently almost ready
> > to post a patch (still working on testcase update script logic I
> > believe (probably other details, don't let me oversell it)).
> >
> > As noted in the review thread for my D16440, I'll need to adjust that
handling once this link reversal goes in.
> >
> >
> > 3. Create a special `METADATA_SUBPROGRAM_BLOCK` for each
`DISubprogram`
> > in the global metadata block. Store the relevant `DISubprogram`
and
> > all of the subprogram's `DILexicalBlock`s and `DILocalVariable`s.
> > The block can be lazy-loaded on an all-or-nothing basis.
> >
> > In combination with (2), this addresses (a) and (b) in cases that
> > (1) doesn't catch. A lazy-loading module will only load the
> > subprogram blocks that get referenced.
> >
> > I'm not sure I understand this part - if the debug info for each
subprogram can be divided into separate blocks, why can't it be moved into
the function's metadata block? I.e. what happens for debug metadata that is
referenced by multiple functions, which I thought was all that was going to
remain in the global metadata block? Oh - the DISubprogram may be
referenced in other places within the global metadata so cannot move into
the function metadata block. So debug metadata only reached from that
DISubprogram is included in its block, but any debug metadata referenced by
multiple DISubprograms would not be located within one of these blocks?
>
> The problem is that the same subprogram may be referenced from multiple
> functions. Consider:
> --
> void sink();
> __attribute__((always_inline)) static inline void foo() { sink(); }
> void caller1() { foo(); }
> void caller2() { foo(); }
> void unrelated() {}
> --
>
> The IR for caller1 and caller2 will both reference the subprogram for
> foo, something like:
> --
> define void @caller1() !dbg !2 {
> call void @sink(), !dbg !5
> ret void, !dbg !7
> }
> define void @caller1() !dbg !3 {
> call void @sink(), !dbg !8
> ret void, !dbg !10
> }
> define void @unrelated() !dbg !4 {
> ret void, !dbg !11
> }
>
> !1 = distinct !DISubprogram(name: "foo")
> !2 = distinct !DISubprogram(name: "caller1")
> !3 = distinct !DISubprogram(name: "caller2")
> !4 = distinct !DISubprogram(name: "unrelated")
>
> !5 = !DILocation(line: 2, scope: !1, inlinedAt: !6)
> !6 = distinct !DILocation(line: 3, scope: !2)
> !7 = !DILocation(line: 3, scope: !2)
>
> !8 = !DILocation(line: 2, scope: !1, inlinedAt: !9)
> !9 = distinct !DILocation(line: 4, scope: !3)
> !10 = !DILocation(line: 4, scope: !3)
>
> !11 = !DILocation(line: 5, scope: !4)
> --
> In this example, !1 (subprogram for foo) is referenced from both
> @caller1 and @caller2, so we need some common place to save it.
>
> We still want to lazy-load it (and the local variables and lexical
> blocks underneath it), so that if the loader only wants @unrelated
> !1 (and its vars/blocks) never gets read.
>
> Right, that's what I realized halfway through writing the above
paragraph. But my remaining question (the last sentence) is what happens to
other debug metadata nodes that are referenced perhaps indirectly by
multiple DISubprograms? Some would be composite types, I guess they would
be placed in the METADATA_COMPOSITE_TYPE_BLOCK you mention below. But I
believe there are going to be some other uniqued nodes that are not
composite types and are referenced via multiple DISubprogram or other
composite types?It looks roughly like most things are either: (i) DILocations;
(ii) DISubprogram (definitions), DILexicalBocks, and
DILocalVariables; or (iii) DICompositeType, DISubprogram
declarations, and other parts of DICompositeType. I was thinking
we'd see what was left at that point, and then if necessary, find a
way to structure the rest.It's possible we'll want to emit some uniqued nodes multiple times,
sacrificing bitcode size for lazy-loading performance. We already
do this for constants in the IR: if a constant is not referenced
globally, it's emitted in each Function that references it. We
could do the same with, e.g., DISubroutineType.
Thanks, that makes sense.
Teresa
Teresa: I had trouble make this work with the delayed metadata parsing
(the logic around "SeenModuleValuesRecord"). My WIP patch rips that
out rather unceremoniously.
(Attached the patch for reference, but it needs to be cleaned up, split
up, etc.; not ready for review (although comments always welcome)).
My understanding from Mehdi is that maybe ThinLTO isn't currently
relying on the delayed parsing. I.e., the original scheme was:
- cherry-pick functions one at a time (never reading metadata), then
- come back at the end to read the metadata all at once.
But the scheme has evolved to:
- calculate the desired functions from each module, then
- load the module and link all the functions in one go.
Is that accurate? If so, I don't see remaining benefit in delaying the
global metadata parsing, just an extra code path to maintain. Do you
agree?
function-local-metadata-v1.patch (26.9 KB)
>
> 1. If a piece of metadata is referenced from only a single `Function`,
> serialize that metadata in the function's metadata block instead of
> the global metadata block.
>
> This addresses problems (a) and (b), primarily targeting
> `DILocation`s. It should pick up lots of other stuff, depending on
> how much inlining has happened.
>
> (I have a draft of the writer side, still working on the reader.)Teresa: I had trouble make this work with the delayed metadata parsing
(the logic around "SeenModuleValuesRecord"). My WIP patch rips that
out rather unceremoniously.(Attached the patch for reference, but it needs to be cleaned up, split
up, etc.; not ready for review (although comments always welcome)).My understanding from Mehdi is that maybe ThinLTO isn't currently
relying on the delayed parsing. I.e., the original scheme was:
- cherry-pick functions one at a time (never reading metadata), then
- come back at the end to read the metadata all at once.
But the scheme has evolved to:
- calculate the desired functions from each module, then
- load the module and link all the functions in one go.
Right, that is what the FunctionImporter logic has changed to. I was
keeping that support alive on the concern that for very large modules the
overhead of keeping all the source modules materialized while importing
decisions are made would be too high. But since I added a full
reference/call graph to the summary and Mehdi has sent a patch to change
the importer to do summary-based importing, that concern should be moot.
Is that accurate? If so, I don't see remaining benefit in delaying the
global metadata parsing, just an extra code path to maintain. Do you
agree?
I think we can take this support out at this point now. Note however that
llvm-link (used for testing) is still doing function at a time importing
and therefore relying on this support. However it shouldn't be hard to
rework that to collect all the imports from a given module for batch
importing.
Let me take a stab at moving llvm-link over to using batch importing
hopefully today or tomorrow, then at least no tests will rely on the
post-pass metadata importing and I can pull that out independently.
Teresa
Great.
Thank you!
>
> 1. If a piece of metadata is referenced from only a single `Function`,
> serialize that metadata in the function's metadata block instead of
> the global metadata block.
>
> This addresses problems (a) and (b), primarily targeting
> `DILocation`s. It should pick up lots of other stuff, depending on
> how much inlining has happened.
>
> (I have a draft of the writer side, still working on the reader.)Teresa: I had trouble make this work with the delayed metadata parsing
(the logic around "SeenModuleValuesRecord"). My WIP patch rips that
out rather unceremoniously.(Attached the patch for reference, but it needs to be cleaned up, split
up, etc.; not ready for review (although comments always welcome)).My understanding from Mehdi is that maybe ThinLTO isn't currently
relying on the delayed parsing. I.e., the original scheme was:
- cherry-pick functions one at a time (never reading metadata), then
- come back at the end to read the metadata all at once.
But the scheme has evolved to:
- calculate the desired functions from each module, then
- load the module and link all the functions in one go.Right, that is what the FunctionImporter logic has changed to. I was
keeping that support alive on the concern that for very large modules the
overhead of keeping all the source modules materialized while importing
decisions are made would be too high. But since I added a full
reference/call graph to the summary and Mehdi has sent a patch to change
the importer to do summary-based importing, that concern should be moot.Is that accurate? If so, I don't see remaining benefit in delaying the
global metadata parsing, just an extra code path to maintain. Do you
agree?I think we can take this support out at this point now.
Great.
Note however that llvm-link (used for testing) is still doing function at
a time importing and therefore relying on this support. However it
shouldn't be hard to rework that to collect all the imports from a given
module for batch importing.Let me take a stab at moving llvm-link over to using batch importing
hopefully today or tomorrow, then at least no tests will rely on the
post-pass metadata importing and I can pull that out independently.Thank you!
Removed from llvm-link in r264326.
Teresa
>
> 1. If a piece of metadata is referenced from only a single `Function`,
> serialize that metadata in the function's metadata block instead of
> the global metadata block.
>
> This addresses problems (a) and (b), primarily targeting
> `DILocation`s. It should pick up lots of other stuff, depending on
> how much inlining has happened.
>
> (I have a draft of the writer side, still working on the reader.)Teresa: I had trouble make this work with the delayed metadata parsing
(the logic around "SeenModuleValuesRecord"). My WIP patch rips that
out rather unceremoniously.(Attached the patch for reference, but it needs to be cleaned up, split
up, etc.; not ready for review (although comments always welcome)).My understanding from Mehdi is that maybe ThinLTO isn't currently
relying on the delayed parsing. I.e., the original scheme was:
- cherry-pick functions one at a time (never reading metadata), then
- come back at the end to read the metadata all at once.
But the scheme has evolved to:
- calculate the desired functions from each module, then
- load the module and link all the functions in one go.Right, that is what the FunctionImporter logic has changed to. I was keeping that support alive on the concern that for very large modules the overhead of keeping all the source modules materialized while importing decisions are made would be too high. But since I added a full reference/call graph to the summary and Mehdi has sent a patch to change the importer to do summary-based importing, that concern should be moot.
Is that accurate? If so, I don't see remaining benefit in delaying the
global metadata parsing, just an extra code path to maintain. Do you
agree?I think we can take this support out at this point now.
Great.
Note however that llvm-link (used for testing) is still doing function at a time importing and therefore relying on this support. However it shouldn't be hard to rework that to collect all the imports from a given module for batch importing.
Let me take a stab at moving llvm-link over to using batch importing hopefully today or tomorrow, then at least no tests will rely on the post-pass metadata importing and I can pull that out independently.
Thank you!
Removed from llvm-link in r264326.
I removed the use of it in r264378.
>
>
>
>
>
>
>>
>>
>>
>> >
>> > 1. If a piece of metadata is referenced from only a single `Function`,
>> > serialize that metadata in the function's metadata block instead of
>> > the global metadata block.
>> >
>> > This addresses problems (a) and (b), primarily targeting
>> > `DILocation`s. It should pick up lots of other stuff, depending on
>> > how much inlining has happened.
>> >
>> > (I have a draft of the writer side, still working on the reader.)
>>
>> Teresa: I had trouble make this work with the delayed metadata parsing
>> (the logic around "SeenModuleValuesRecord"). My WIP patch rips that
>> out rather unceremoniously.
>>
>> (Attached the patch for reference, but it needs to be cleaned up, split
>> up, etc.; not ready for review (although comments always welcome)).
>>
>> My understanding from Mehdi is that maybe ThinLTO isn't currently
>> relying on the delayed parsing. I.e., the original scheme was:
>> - cherry-pick functions one at a time (never reading metadata), then
>> - come back at the end to read the metadata all at once.
>> But the scheme has evolved to:
>> - calculate the desired functions from each module, then
>> - load the module and link all the functions in one go.
>>
>> Right, that is what the FunctionImporter logic has changed to. I was
keeping that support alive on the concern that for very large modules the
overhead of keeping all the source modules materialized while importing
decisions are made would be too high. But since I added a full
reference/call graph to the summary and Mehdi has sent a patch to change
the importer to do summary-based importing, that concern should be moot.
>>
>> Is that accurate? If so, I don't see remaining benefit in delaying the
>> global metadata parsing, just an extra code path to maintain. Do you
>> agree?
>>
>>
>> I think we can take this support out at this point now.
>
> Great.
>
>> Note however that llvm-link (used for testing) is still doing function
at a time importing and therefore relying on this support. However it
shouldn't be hard to rework that to collect all the imports from a given
module for batch importing.
>>
>> Let me take a stab at moving llvm-link over to using batch importing
hopefully today or tomorrow, then at least no tests will rely on the
post-pass metadata importing and I can pull that out independently.
>
> Thank you!
>
> Removed from llvm-link in r264326.I removed the use of it in r264378.
Thanks. That is only part of the post-pass metadata linking, I'll remove
the rest.
Teresa
>
>
>
>
>
>
>>
>>
>>
>> >
>> > 1. If a piece of metadata is referenced from only a single
`Function`,
>> > serialize that metadata in the function's metadata block instead
of
>> > the global metadata block.
>> >
>> > This addresses problems (a) and (b), primarily targeting
>> > `DILocation`s. It should pick up lots of other stuff, depending
on
>> > how much inlining has happened.
>> >
>> > (I have a draft of the writer side, still working on the reader.)
>>
>> Teresa: I had trouble make this work with the delayed metadata parsing
>> (the logic around "SeenModuleValuesRecord"). My WIP patch rips that
>> out rather unceremoniously.
>>
>> (Attached the patch for reference, but it needs to be cleaned up, split
>> up, etc.; not ready for review (although comments always welcome)).
>>
>> My understanding from Mehdi is that maybe ThinLTO isn't currently
>> relying on the delayed parsing. I.e., the original scheme was:
>> - cherry-pick functions one at a time (never reading metadata), then
>> - come back at the end to read the metadata all at once.
>> But the scheme has evolved to:
>> - calculate the desired functions from each module, then
>> - load the module and link all the functions in one go.
>>
>> Right, that is what the FunctionImporter logic has changed to. I was
keeping that support alive on the concern that for very large modules the
overhead of keeping all the source modules materialized while importing
decisions are made would be too high. But since I added a full
reference/call graph to the summary and Mehdi has sent a patch to change
the importer to do summary-based importing, that concern should be moot.
>>
>> Is that accurate? If so, I don't see remaining benefit in delaying the
>> global metadata parsing, just an extra code path to maintain. Do you
>> agree?
>>
>>
>> I think we can take this support out at this point now.
>
> Great.
>
>> Note however that llvm-link (used for testing) is still doing function
at a time importing and therefore relying on this support. However it
shouldn't be hard to rework that to collect all the imports from a given
module for batch importing.
>>
>> Let me take a stab at moving llvm-link over to using batch importing
hopefully today or tomorrow, then at least no tests will rely on the
post-pass metadata importing and I can pull that out independently.
>
> Thank you!
>
> Removed from llvm-link in r264326.I removed the use of it in r264378.
Thanks. That is only part of the post-pass metadata linking, I'll remove
the rest.
Removed the rest in r264763.
Teresa
+pcc, who had some other ideas/patch out for improving memory usage of debug info
+Reid, who's responsible for the windows/CodeView/PDB debug info which is motivating some of the ideas about changes to type emissionSo I discussed this with Adrian and Mehdi at the social last Thursday and I'm getting set to finish the write up. I think it'll have some bearing on this proposal as I think it'll change how we want to take a look at the format of DISubprogram metadata a bit more.
(The interesting bit here is to make a clearer split between
DISubprogram declarations (part of the type hierarchY) and
DISubprogram definitions (part of the code/line table/variable
locations). I think that'll end up being mostly orthogonal to what
I'm trying to do.)
That said, most of it is orthogonal to the changes Duncan is talking about here. Just puts the pressure on to get the other proposal written up.
Which is now here:
http://lists.llvm.org/pipermail/llvm-dev/2016-March/097773.html
Baking into the IR more about types as units has pretty direct overlap with Reid/CodeView/etc - so, yeah, that'll takes ome discussion (but, as you say, it's not in your immediate plan anyway, so we can come back to that - but would be good for whoever gets there first to discuss it with the others)
After thinking for a few days, I don't think this will bake anything
new into the IR. If anything it removes a few special cases.
More at the bottom.
Motivation
Based on some analysis Mehdi ran (ping him for details), there are three
(related) compile-time bottlenecks we're seeing with `-flto=thin -g`:a) Reading the large number of Metadata bitcode records in the global
metadata block. I'm talking about raw `BitStreamer` calls here.b) Creating unnecessary `DI*` instances (that aren't relevant to code).
c) Emitting unnecessary `DI*` instances (that aren't relevant to code).
Here is my recollection of some peak memory stats on a small testcase
during thin-LTO, which should be a decent indicator of (b):- ~150MB: DILocation
- ~100MB: DISubprogram
- ~70MB: DILocalVariable
- ~50MB: (cumulative) DIType descendentsIt looks, suprisingly, like types are not the primary bottleneck.
(Probably wrong for (a), BTW. Caveats matter.)
There are caveats:
- `DISubprogram` declarations -- member function descriptors -- are
part of the type hierarchy.
- Most of the type hierarchy gets uniqued at parse time.
- As a result, these data are a poor indicator for (a).
((a) is the main bottleneck for compile-time of -flto=thin (since it's
quadratic in the number of files). (b) only affects memory. Also
important, but at least it scales linearly.)
Even so, non-types are substantial.
Proposal
Short version
-------------4. Remove `DICompositeType`s from `retainedTypes:`, similar to (2).
This is the part that's relevant to the new RFC Eric just posted.
Long version
-------------4. Implement my proposal to remove the `DICompositeType` name map from
`retainedTypes:`.http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20160125/327936.html
Similar to (2) above, this will naturally filter the types that get
linked in to the ones actually used by the code being linked.It should also allow the reader to skip records for types that have
already been loaded in the main module.
The essential things I want to accomplish with this part:
- Make `type:` operands less special: instead of referencing types
indirectly through MDString, point directly at the type node.
- Stop using `retainedTypes:` by default (only for -gfull, etc.).
- Avoid blowing up memory in -flto=full (which converting MDString
references back to pointers would do naively, through
re-introducing cycles). Note that this needs to be handled
somehow at BitcodeReader time.
After chatting with Eric, I don't think this conflicts at all with the
other RFC. Unifying the `type:` operands might actually help both.
One good point David mentioned last week was that we don't want to
teach the IR any more about types. Rather than inventing some new
context (as I suggested originally), I figure LTO plugins can just
pass a (StringRef -> DIType*) map to the BitcodeReader, and the Module
itself doesn't need to know anything about it.
I have no objections to any of this FWIW
-eric