[RFC] Less memory and greater maintainability for debug info IR

In r219010, I merged integer and string fields into a single header
field. By reducing the number of metadata operands used in debug info,
this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling
of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and
I've concluded that they will be insufficient.

Instead, I'd like to implement a more aggressive plan, which as a
side-effect cleans up the much "loved" debug info IR assembly syntax.

At a high-level, the idea is to create distinct subclasses of `Value`
for each debug info concept, starting with line table entries and moving
on to the DIDescriptor hierarchy. By leveraging the use-list
infrastructure for metadata operands -- i.e., only using value handles
for non-metadata operands -- we'll improve memory usage and increase
RAUW speed.

My rough plan follows. I quote some numbers for memory savings below
based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto`
on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's
-save-temps option) that currently peaks at 15.3GB.

1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s
    must all be metadata. The cost per operand is 1 pointer, vs. 4
    pointers in an `MDNode`.

2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal
    fields (not `Value`s) for the line and column, and use `Use`
    operands for the metadata operands.

    On x86-64, this will save 104B / line table entry. Linking
    `llvm-lto` uses ~7M line-table entries, so this on its own saves
    ~700MB.

    Sketch of class definition:

        class MDLineTable : public MDUser {
          unsigned Line;
          unsigned Column;
        public:
          static MDLineTable *get(unsigned Line, unsigned Column,
                                  MDNode *Scope);
          static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope);
          static MDLineTable *getBase(MDLineTable *Inlined);

          unsigned getLine() const { return Line; }
          unsigned getColumn() const { return Column; }
          bool isInlined() const { return getNumOperands() == 2; }
          MDNode *getScope() const { return getOperand(0); }
          MDNode *getInlinedAt() const { return getOperand(1); }
        };

    Proposed assembly syntax:

        ; Not inlined.
        !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9)

        ; Inlined.
        !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9,
                                   inlinedAt: metadata !10)

        ; Column defaulted to 0.
        !7 = metadata !MDLineTable(line: 45, scope: metadata !9)

    (What colour should that bike shed be?)

3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows
    that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line
    table entries. The cost of these is ~180B each, for another
    ~600MB.

    If we integrate a side-table of `MDLineTable`s into its uniquing,
    the overhead is only ~12B / line table entry, or ~80MB. This saves
    520MB.

    This is somewhat perpendicular to redesigning the metadata format,
    but IMO it's worth doing as soon as it's possible.

4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser`
    through an intermediate class `DebugMDNode` with an
    allocation-time-optional `CallbackVH` available for referencing
    non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead
    of an `MDNode`.

    This saves another ~960MB, for a running total of ~2GB.

    Proposed assembly syntax:

        !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit,
                                          fields: "0\00clang 3.6\00...",
                                          operands: { metadata !8, ... })

        !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable,
                                          fields: "global_var\00...",
                                          operands: { metadata !8, ... },
                                          handle: i32* @global_var)

    This syntax pulls the tag out of the current header-string, calls
    the rest of the header "fields", and includes the metadata operands
    in "operands".

5. Incrementally create subclasses of `DebugMDNode`, such as
    `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the
    "fields" and "operands" catch-alls with explicit names for each
    operand.

    Proposed assembly syntax:

        !7 = metadata !MDSubprogram(line: 45, name: "foo", displayName: "foo",
                                    linkageName: "_Z3foov", file: metadata !8,
                                    function: i32 (i32)* @foo)

6. Remove the dead code for `GenericDebugMDNode`.

7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW
    traffic during bitcode serialization. Now that metadata types are
    known, we can write debug info out in an order that makes it cheap
    to read back in.

    Note that using `MDUser` will make RAUW much cheaper, since we're
    using the use-list infrastructure for most of them. If RAUW isn't
    showing up in a profile, I may skip this.

Does this direction seem reasonable? Any major problems I've missed?

In r219010, I merged integer and string fields into a single header
field. By reducing the number of metadata operands used in debug info,
this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling
of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and
I've concluded that they will be insufficient.

Could you explain what your end-goal here looked like and what data you
used to evaluate its insufficiency?

Just to be clear, what I was picturing was that, starting with your initial
improvement, we'd string-ify more data in the records but eventually we'd
start stringifying across records (eg: rolling a DW_TAG_structure_type's
members into the structure type itself, one big string). In the end we'd
just pull out the non-metadata references (like the llvm::Function* in the
DW_TAG_subroutine_type metadata) into a table kept separately from a
handful of big strings of debug info (I say a handful, as we'd keep the
types separate so they could be easily deduplicated).

Instead, I'd like to implement a more aggressive plan, which as a
side-effect cleans up the much "loved" debug info IR assembly syntax.

At a high-level, the idea is to create distinct subclasses of `Value`
for each debug info concept,

My concern with this is baking parts of our current debug info
representation into IR constructs seems rather heavyweight. If we need to
add first class IR constructs to cope with debug info I'd hope to find,
ideally, one, general purpose extension we can use for this (& possibly for
other things). But maybe the bar for adding first class IR constructs is
lower than I've imagined it to be.

starting with line table entries and moving
on to the DIDescriptor hierarchy. By leveraging the use-list
infrastructure for metadata operands -- i.e., only using value handles
for non-metadata operands -- we'll improve memory usage and increase
RAUW speed.

My rough plan follows. I quote some numbers for memory savings below
based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto`
on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's
-save-temps option) that currently peaks at 15.3GB.

1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s
    must all be metadata. The cost per operand is 1 pointer, vs. 4
    pointers in an `MDNode`.

Perhaps a generic MD-only-node might be a sufficiently generically valuable
IR construct.

A similar alternative: A schematized metadata node. Much like DWARF, being
able to say "this node is of some type T, defined elsewhere in the module -
string, int, string, string, etc... ". Heck, this could even be just a
generic improvement to llvm IR, maybe? (the textual representation might
not need to change at all - IR Generation would just do much like DWARF
generation in LLVM does - create abbreviation/type descriptions on the fly
and share them rather than having every metadata node include its own
self-description)

2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal
    fields (not `Value`s) for the line and column, and use `Use`
    operands for the metadata operands.

    On x86-64, this will save 104B / line table entry. Linking
    `llvm-lto` uses ~7M line-table entries, so this on its own saves
    ~700MB.

    Sketch of class definition:

        class MDLineTable : public MDUser {
          unsigned Line;
          unsigned Column;
        public:
          static MDLineTable *get(unsigned Line, unsigned Column,
                                  MDNode *Scope);
          static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope);
          static MDLineTable *getBase(MDLineTable *Inlined);

          unsigned getLine() const { return Line; }
          unsigned getColumn() const { return Column; }
          bool isInlined() const { return getNumOperands() == 2; }
          MDNode *getScope() const { return getOperand(0); }
          MDNode *getInlinedAt() const { return getOperand(1); }
        };

    Proposed assembly syntax:

        ; Not inlined.
        !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9)

        ; Inlined.
        !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9,
                                   inlinedAt: metadata !10)

        ; Column defaulted to 0.
        !7 = metadata !MDLineTable(line: 45, scope: metadata !9)

    (What colour should that bike shed be?)

3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows
    that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line
    table entries. The cost of these is ~180B each, for another
    ~600MB.

    If we integrate a side-table of `MDLineTable`s into its uniquing,
    the overhead is only ~12B / line table entry, or ~80MB. This saves
    520MB.

    This is somewhat perpendicular to redesigning the metadata format,
    but IMO it's worth doing as soon as it's possible.

4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser`
    through an intermediate class `DebugMDNode` with an
    allocation-time-optional `CallbackVH` available for referencing
    non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead
    of an `MDNode`.

    This saves another ~960MB,

960 from what?

for a running total of ~2GB.

~2GB is the total of what? (you mention a lot of numbers in this post, but
it's not always clear what they're relative to/out of/subtracted from)

    Proposed assembly syntax:

        !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit,
                                          fields: "0\00clang 3.6\00...",
                                          operands: { metadata !8, ... })

        !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable,
                                          fields: "global_var\00...",
                                          operands: { metadata !8, ... },
                                          handle: i32* @global_var)

    This syntax pulls the tag out of the current header-string, calls
    the rest of the header "fields", and includes the metadata operands
    in "operands".

5. Incrementally create subclasses of `DebugMDNode`, such as
    `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the
    "fields" and "operands" catch-alls with explicit names for each
    operand.

I wouldn't mind seeing how expensive it would be if these schema
descriptions were within the module itself - so we didn't have to bake them
into the IR spec, but could still share them between every usage within a
module.

I think making debug info more of a first-class IR citizen is probably the way to go. Right now debug info is completely unreadable and is downright opposed to the design goals of the IR as I understand them.

Our backwards compatibility policy should give you the flexibility you need to update the debug info representation as you go along: http://llvm.org/docs/DeveloperPolicy.html#id18

I think making debug info more of a first-class IR citizen is probably the
way to go. Right now debug info is completely unreadable and is downright
opposed to the design goals of the IR as I understand them.

I'm still not sure this would produce particularly more legible, let alone
writeable, debug info IR. It's possible, certainly, if the schema was baked
into IR reading and writing, that we could pretty print it with annotated
field names and allow writing the debug info with omitted fields (because
the parser would know that this was, say, a subprogram record, and be able
to reorder fields to the required schema or add default values for omitted
fields), but I'm not sure we'd get that far nor whether it would really tip
debug info to the point of writeability - it's still necessarily a format
that describes code, which tends towards being more ungainly than the code
itself. ("this thing is on line 42" rather than "thing" written on line 42)

I'd have to see examples & promises of where this would go/what value it
would add, but I'd still be fairly concerned about the ongoing costs.

Our backwards compatibility policy should give you the flexibility you
need to update the debug info representation as you go along:
LLVM Developer Policy — LLVM 18.0.0git documentation

It's a rather heavy burden to carry. Currently we have a much lighter cost
to changing the debug info schema (rev the version number - any debug info
with an older version number is dropped on sight).

In r219010, I merged integer and string fields into a single header
field. By reducing the number of metadata operands used in debug info,
this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling
of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and
I've concluded that they will be insufficient.

Could you explain what your end-goal here looked like and what data you used to evaluate its insufficiency?

In the links of C++ programs I've looked at, most `Value`s are line
tables and local variables. E.g., for the `llvm-lto.lto.bc` case
I've used for memory numbers:

  - 23967800 Value
      - 16837368 MDNode
          - 7611669 DIDescriptor
              - 4373879 DW_TAG_arg_variable
              - 1341021 DW_TAG_subprogram
              - 554992 DW_TAG_auto_variable
              - 360390 DW_TAG_lexical_block
              - 354166 DW_TAG_subroutine_type
          - 7500000 line table entries
      - 5850877 User
      - 693869 MDString

IIUC, line tables and local variables need to be referenced directly
from the rest of the IR, so they can't be sunk into other nodes.

Relevant to your question, I didn't a way to sufficiently decrease
the numbers of these (or the number of their operands).

Just to be clear, what I was picturing was that, starting with your initial improvement, we'd string-ify more data in the records but eventually we'd start stringifying across records (eg: rolling a DW_TAG_structure_type's members into the structure type itself, one big string). In the end we'd just pull out the non-metadata references (like the llvm::Function* in the DW_TAG_subroutine_type metadata) into a table kept separately from a handful of big strings of debug info (I say a handful, as we'd keep the types separate so they could be easily deduplicated).

I was thinking along the same lines. Unfortunately, there aren't
enough types left for that to make a big impact.

Unless you envisioned a completely different way of dealing with
`@llvm.dbg.value` and `!dbg` references?

Instead, I'd like to implement a more aggressive plan, which as a
side-effect cleans up the much "loved" debug info IR assembly syntax.

At a high-level, the idea is to create distinct subclasses of `Value`
for each debug info concept,

My concern with this is baking parts of our current debug info representation into IR constructs seems rather heavyweight. If we need to add first class IR constructs to cope with debug info I'd hope to find, ideally, one, general purpose extension we can use for this (& possibly for other things). But maybe the bar for adding first class IR constructs is lower than I've imagined it to be.

Since 75% of all `Value`s are debug info, representing them well
seems worthwhile to me.

starting with line table entries and moving
on to the DIDescriptor hierarchy. By leveraging the use-list
infrastructure for metadata operands -- i.e., only using value handles
for non-metadata operands -- we'll improve memory usage and increase
RAUW speed.

My rough plan follows.

(Note the following sentence, which I think you missed.)

I quote some numbers for memory savings below
based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto`
on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's
-save-temps option) that currently peaks at 15.3GB.

1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s
    must all be metadata. The cost per operand is 1 pointer, vs. 4
    pointers in an `MDNode`.

Perhaps a generic MD-only-node might be a sufficiently generically valuable IR construct.

A similar alternative: A schematized metadata node. Much like DWARF, being able to say "this node is of some type T, defined elsewhere in the module - string, int, string, string, etc... ". Heck, this could even be just a generic improvement to llvm IR, maybe? (the textual representation might not need to change at all - IR Generation would just do much like DWARF generation in LLVM does - create abbreviation/type descriptions on the fly and share them rather than having every metadata node include its own self-description)

"Being generic" seems like a defect to me, not a feature. If you need
to add support for every IR construct to the backend to emit DIEs, etc.,
then what's the benefit in being able to express arbitrary other things?

2. Create `MDLineTable` as the first subclass of `MDUser`. Use normal
    fields (not `Value`s) for the line and column, and use `Use`
    operands for the metadata operands.

    On x86-64, this will save 104B / line table entry. Linking
    `llvm-lto` uses ~7M line-table entries, so this on its own saves
    ~700MB.

    Sketch of class definition:

        class MDLineTable : public MDUser {
          unsigned Line;
          unsigned Column;
        public:
          static MDLineTable *get(unsigned Line, unsigned Column,
                                  MDNode *Scope);
          static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope);
          static MDLineTable *getBase(MDLineTable *Inlined);

          unsigned getLine() const { return Line; }
          unsigned getColumn() const { return Column; }
          bool isInlined() const { return getNumOperands() == 2; }
          MDNode *getScope() const { return getOperand(0); }
          MDNode *getInlinedAt() const { return getOperand(1); }
        };

    Proposed assembly syntax:

        ; Not inlined.
        !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9)

        ; Inlined.
        !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9,
                                   inlinedAt: metadata !10)

        ; Column defaulted to 0.
        !7 = metadata !MDLineTable(line: 45, scope: metadata !9)

    (What colour should that bike shed be?)

3. (Optional) Rewrite `DebugLoc` lookup tables. My profiling shows
    that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line
    table entries. The cost of these is ~180B each, for another
    ~600MB.

    If we integrate a side-table of `MDLineTable`s into its uniquing,
    the overhead is only ~12B / line table entry, or ~80MB. This saves
    520MB.

    This is somewhat perpendicular to redesigning the metadata format,
    but IMO it's worth doing as soon as it's possible.

4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser`
    through an intermediate class `DebugMDNode` with an
    allocation-time-optional `CallbackVH` available for referencing
    non-metadata. Change `DIDescriptor` to wrap a `DebugMDNode` instead
    of an `MDNode`.

    This saves another ~960MB,

960 from what?

This number references the sentence noted above.

for a running total of ~2GB.

~2GB is the total of what? (you mention a lot of numbers in this post, but it's not always clear what they're relative to/out of/subtracted from)

This number references the sentence noted above.

    Proposed assembly syntax:

        !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit,
                                          fields: "0\00clang 3.6\00...",
                                          operands: { metadata !8, ... })

        !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable,
                                          fields: "global_var\00...",
                                          operands: { metadata !8, ... },
                                          handle: i32* @global_var)

    This syntax pulls the tag out of the current header-string, calls
    the rest of the header "fields", and includes the metadata operands
    in "operands".

5. Incrementally create subclasses of `DebugMDNode`, such as
    `MDCompileUnit` and `MDSubprogram`. Sub-classed nodes replace the
    "fields" and "operands" catch-alls with explicit names for each
    operand.

I wouldn't mind seeing how expensive it would be if these schema descriptions were within the module itself - so we didn't have to bake them into the IR spec, but could still share them between every usage within a module.

It's already baked into the IR spec, since the backend needs to
understand debug info to emit it. We might as well understand what
exactly we're representing by formalizing it.

This has been convenient while we pay off some of the technical debt in the area of debug info, but at some point, we need to do better than that. Sooner or later, we will need to support a stable debug format in the IR. This proposal seems like a step in the right direction.

Personally, I find this:

    !7 = metadata !MDLineTable(line: 45, column: 7, scope: !9)

significantly more readable (and writeable) than this:

    !7 = metadata !{i32 45, i32 7, metadata !9, null}

The gains are even greater for the DIDescriptors, which have more
operands and are more connected.

Or have I missed your point?

>
>
>
>> In r219010, I merged integer and string fields into a single header
>> field. By reducing the number of metadata operands used in debug info,
>> this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling
>> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and
>> I've concluded that they will be insufficient.
>>
> Could you explain what your end-goal here looked like and what data you
used to evaluate its insufficiency?

In the links of C++ programs I've looked at, most `Value`s are line
tables and local variables. E.g., for the `llvm-lto.lto.bc` case
I've used for memory numbers:

  - 23967800 Value
      - 16837368 MDNode
          - 7611669 DIDescriptor
              - 4373879 DW_TAG_arg_variable
              - 1341021 DW_TAG_subprogram
              - 554992 DW_TAG_auto_variable
              - 360390 DW_TAG_lexical_block
              - 354166 DW_TAG_subroutine_type
          - 7500000 line table entries
      - 5850877 User
      - 693869 MDString

I would like to see the same thing, but where the numbers indicate total
memory used in each category, instead of the count of entries in each
category.

-- Sean Silva

For those interested, I’ve attached some pie charts based on Duncan’s data in one of the other posts; successive slides break down the usage increasingly finely. To my understanding, they represent the number of Value’s (and subclasses) allocated.

DebugInfoSize.pdf (106 KB)

For those interested, I've attached some pie charts based on Duncan's data
in one of the other posts; successive slides break down the usage
increasingly finely. To my understanding, they represent the number of
Value's (and subclasses) allocated.

In r219010, I merged integer and string fields into a single header
field. By reducing the number of metadata operands used in debug info,
this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling
of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and
I've concluded that they will be insufficient.

Instead, I'd like to implement a more aggressive plan, which as a
side-effect cleans up the much "loved" debug info IR assembly syntax.

At a high-level, the idea is to create distinct subclasses of `Value`
for each debug info concept, starting with line table entries and moving
on to the DIDescriptor hierarchy. By leveraging the use-list
infrastructure for metadata operands -- i.e., only using value handles
for non-metadata operands -- we'll improve memory usage and increase
RAUW speed.

My rough plan follows. I quote some numbers for memory savings below
based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto`
on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's
-save-temps option) that currently peaks at 15.3GB.

Stupid question, but when I was working on LTO last Summer the primary
culprit for excessive memory use was due to us not being smart when linking
the IR together (Espindola would know more details). Do we still have that
problem? For starters, how does the memory usage of just llvm-link compare
to the memory usage of the actual LTO run? If the issue I was seeing last
Summer is still there, you should see that the invocation of llvm-link is
actually the most memory-intensive part of the LTO step, by far.

This is vague. Could you be more specific on where you saw all of the memory?

-eric

Stupid question, but when I was working on LTO last Summer the primary culprit for excessive memory use was due to us not being smart when linking the IR together (Espindola would know more details). Do we still have that problem? For starters, how does the memory usage of just llvm-link compare to the memory usage of the actual LTO run? If the issue I was seeing last Summer is still there, you should see that the invocation of llvm-link is actually the most memory-intensive part of the LTO step, by far.

To be clear, I'm running the command-line:

    $ llvm-lto -exported-symbol _main llvm-lto.lto.bc

Since this is a pre-linked bitcode file, we shouldn't be wasting much
memory from the linking stage.

Running ld64 directly gives a peak memory footprint of ~30GB for the
full link, so there's something else going on there that I'll be
digging into later.

2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have a single pie slice near 40% of the # of Value's allocated and another at 21%. Especially this being "step 4".

15.3GB is the peak memory of `llvm-lto`. This comes late in the
process, after DIEs have been created. I haven't looked in detail past
debug info metadata, but here's a sketch of what I imagine is in memory
at this point.

  - The IR, including uniquing side-tables.
  - Optimization and backend passes.
  - Parts of SelectionDAG that haven't been freed.
  - `MachineFunction`s and everything inside them.
  - Whatever state the `AsmPrinter`, etc., need.

I expect to look at a couple of other debug-info-related memory usage
areas once I've shrunk the metadata:

  - What's the total footprint of DIEs? This run has 4M of them, whose
    allocated footprint is ~1GB. I'm hoping that a deeper look will
    reveal an even larger attack surface.

  - How much do debug info intrinsics cost? They show up in at least
    three forms -- IR-level, SDNodes, and MachineInstrs -- and there
    can be a lot of them. How many? What's their footprint?

For now, I'm focusing on the problem I've already identified.

You need more data. Right now you have essentially one data point,

I looked at a number of internal C and C++ programs with -flto -g, and
dug deeply into llvm-lto.lto.bc because it's small enough that it's easy
to analyze (and its runtime profile was representative of the other C++
programs I was looking at).

I didn't look deeply at a broad spectrum, but memory usage and runtime
for building clang with -flto -g is something we care a fair bit about.

and it's not even clear what you measured really. If your goal is saving memory, I would expect at least a pie chart that breaks down LLVM's memory usage (not just # of allocations of different sorts; an approximation is fine, as long as you explain how you arrived at it and in what sense it approximates the true number).

I'm not sure there's value in diving deeply into everything at once.
I've identified one of the bottlenecks, so I'd like to improve it before
digging into the others.

Here's some visibility into where my numbers come from.

I got the 15.3GB from a profile of memory usage vs. time. Peak usage
comes late in the process, around when DIEs are being dealt with.

Metadata node counts stabilize much earlier in the process. The rest of
the numbers are based on counting `MDNodes` and their respective
`MDNodeOperands`, and multiplying by the cost of their operands. Here's
a dump from around the peak metadata node count:

    LineTables = 7500000[30000000], InlinedLineTables = 6756182, Directives = 7611669[42389128], Arrays = 570609[577447], Others = 1176556[5133065]
    Tag = 256, Count = 554992, Ops = 2531428, Name = DW_TAG_auto_variable
    Tag = 16647, Count = 988, Ops = 4940, Name = DW_TAG_GNU_template_parameter_pack
    Tag = 52, Count = 9933, Ops = 59598, Name = DW_TAG_variable
    Tag = 33, Count = 190, Ops = 190, Name = DW_TAG_subrange_type
    Tag = 59, Count = 1, Ops = 3, Name = DW_TAG_unspecified_type
    Tag = 40, Count = 24731, Ops = 24731, Name = DW_TAG_enumerator
    Tag = 21, Count = 354166, Ops = 2833328, Name = DW_TAG_subroutine_type
    Tag = 2, Count = 77999, Ops = 623992, Name = DW_TAG_class_type
    Tag = 47, Count = 27122, Ops = 108488, Name = DW_TAG_template_type_parameter
    Tag = 28, Count = 8491, Ops = 33964, Name = DW_TAG_inheritance
    Tag = 66, Count = 10930, Ops = 43720, Name = DW_TAG_rvalue_reference_type
    Tag = 16, Count = 54680, Ops = 218720, Name = DW_TAG_reference_type
    Tag = 23, Count = 624, Ops = 4992, Name = DW_TAG_union_type
    Tag = 4, Count = 5344, Ops = 42752, Name = DW_TAG_enumeration_type
    Tag = 11, Count = 360390, Ops = 1081170, Name = DW_TAG_lexical_block
    Tag = 258, Count = 1, Ops = 1, Name = DW_TAG_expression
    Tag = 13, Count = 73880, Ops = 299110, Name = DW_TAG_member
    Tag = 58, Count = 1387, Ops = 4161, Name = DW_TAG_imported_module
    Tag = 1, Count = 2747, Ops = 21976, Name = DW_TAG_array_type
    Tag = 46, Count = 1341021, Ops = 12069189, Name = DW_TAG_subprogram
    Tag = 257, Count = 4373879, Ops = 20785065, Name = DW_TAG_arg_variable
    Tag = 8, Count = 2246, Ops = 6738, Name = DW_TAG_imported_declaration
    Tag = 53, Count = 57, Ops = 228, Name = DW_TAG_volatile_type
    Tag = 15, Count = 55163, Ops = 220652, Name = DW_TAG_pointer_type
    Tag = 41, Count = 3382, Ops = 6764, Name = DW_TAG_file_type
    Tag = 22, Count = 158479, Ops = 633916, Name = DW_TAG_typedef
    Tag = 48, Count = 486, Ops = 2430, Name = DW_TAG_template_value_parameter
    Tag = 36, Count = 15, Ops = 45, Name = DW_TAG_base_type
    Tag = 17, Count = 1164, Ops = 8148, Name = DW_TAG_compile_unit
    Tag = 31, Count = 19, Ops = 95, Name = DW_TAG_ptr_to_member_type
    Tag = 57, Count = 2034, Ops = 6102, Name = DW_TAG_namespace
    Tag = 38, Count = 32133, Ops = 128532, Name = DW_TAG_const_type
    Tag = 19, Count = 72995, Ops = 583960, Name = DW_TAG_structure_type

(Note: the InlinedLineTables stat is included in LineTables stat.)

You can determine the rough memory footprint of each type of node by
multiplying the "Count" by `sizeof(MDNode)` (x86-64: 56B) and the "Ops"
by `sizeof(MDNodeOperand)` (x86-64: 32B).

Overall, there are 7.5M linetables with 30M operands, so by this method
their footprint is ~1.3GB. There are 7.6M descriptors with 42.4M
operands, so their footprint is ~1.7GB.

I dumped another stat periodically to tell me the peak size of the
side-tables for line table entries, which are split into "Scopes" (for
non-inlined) and "Inlined" (these counts are disjoint, unlike the
previous stats):

    Scopes = 203166 [203166], Inlined = 3500000 [3500000]

I assumed that both `DenseMap` and `std::vector` over-allocate by 50%
to estimate the current (and planned) costs for the side-tables.

Another stat I dumped periodically was the breakdown between V(alues),
U(sers), C(onstants), M(etadata nodes), and (metadata) S(trings).
Here's a sample from nearby:

    V = 23967800 (40200000 - 16232200)
    U = 5850877 ( 7365503 - 1514626)
    C = 205491 ( 279134 - 73643)
    M = 16837368 (31009291 - 14171923)
    S = 693869 ( 693869 - 0)

Lastly, I dumped a breakdown of the types of MDNodeOperands. This is
also a sample from nearby:

    MDOps = 77644750 (100%)
    Const = 14947077 ( 19%)
    Node = 41749475 ( 53%)
    Str = 9553581 ( 12%)
    Null = 10976693 ( 14%)
    Other = 417924 ( 0%)

While I didn't use this breakdown for my memory estimates, it was
interesting nevertheless. Note the following:

  - The number of constants is just under 15M. This dump came less than
    a second before the dump above, where we have 7.5M line table
    entries. Line table entries have 2 operands of `ConstantInt`. This
    lines up nicely.
    
    Note: this checked `isa<Constant>(Op) && !isa<GlobalValue>(Op)`.

  - There are a lot of null operands. By making subclasses for the
    various types of debug info IR, we can probably shed some of these
    altogether.

  - There are few "Other" operands. These are likely all `GlobalValue`
    references, and are the only operands that need to be referenced
    using value handles.

Stupid question, but when I was working on LTO last Summer the primary
culprit for excessive memory use was due to us not being smart when linking
the IR together (Espindola would know more details). Do we still have that
problem? For starters, how does the memory usage of just llvm-link compare
to the memory usage of the actual LTO run? If the issue I was seeing last
Summer is still there, you should see that the invocation of llvm-link is
actually the most memory-intensive part of the LTO step, by far.

This is vague. Could you be more specific on where you saw all of the memory?

I think Sean is referring to the old problem of nodes not being merged
because of cycles. It has been fixed by breaking the cycles by having
some of the edges be represented with stable mangled names.

The problem that Duncan is trying to solve is that the debug info is
still very large, even with the duplicate information removed.

Cheers,
Rafael

Stupid question, but when I was working on LTO last Summer the primary
culprit for excessive memory use was due to us not being smart when linking
the IR together (Espindola would know more details). Do we still have that
problem? For starters, how does the memory usage of just llvm-link compare
to the memory usage of the actual LTO run? If the issue I was seeing last
Summer is still there, you should see that the invocation of llvm-link is
actually the most memory-intensive part of the LTO step, by far.

This is vague. Could you be more specific on where you saw all of the memory?

I think Sean is referring to the old problem of nodes not being merged
because of cycles. It has been fixed by breaking the cycles by having
some of the edges be represented with stable mangled names.

That's what I was thinking and why I was asking :slight_smile:

The problem that Duncan is trying to solve is that the debug info is
still very large, even with the duplicate information removed.

Very much so. Duncan and I (and others) have been chatting about this for a bit.

-eric

similar alternative: A schematized metadata node. Much like DWARF, being able to say "this node is of some type T, defined elsewhere in the module - string, int, string, string, etc... ". Heck, this could even be just a generic improvement to llvm IR, maybe? (the textual representation might not need to change at all - IR Generation would just do much like DWARF generation in LLVM does - create abbreviation/type descriptions on the fly and share them rather than having every metadata node include its own self-description)

"Being generic" seems like a defect to me, not a feature. If you need
to add support for every IR construct to the backend to emit DIEs, etc.,
then what's the benefit in being able to express arbitrary other things?

I fully agree. In general I find it better to have an IR that can
represent only the features we need. The gains on readability and
verification from having a dedicated parser for the .ll representation
are also too awesome to ignore.

Cheers,
Rafael

> For those interested, I've attached some pie charts based on Duncan's
data
> in one of the other posts; successive slides break down the usage
> increasingly finely. To my understanding, they represent the number of
> Value's (and subclasses) allocated.
>
>>
>> In r219010, I merged integer and string fields into a single header
>> field. By reducing the number of metadata operands used in debug info,
>> this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling
>> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and
>> I've concluded that they will be insufficient.
>>
>> Instead, I'd like to implement a more aggressive plan, which as a
>> side-effect cleans up the much "loved" debug info IR assembly syntax.
>>
>> At a high-level, the idea is to create distinct subclasses of `Value`
>> for each debug info concept, starting with line table entries and moving
>> on to the DIDescriptor hierarchy. By leveraging the use-list
>> infrastructure for metadata operands -- i.e., only using value handles
>> for non-metadata operands -- we'll improve memory usage and increase
>> RAUW speed.
>>
>> My rough plan follows. I quote some numbers for memory savings below
>> based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto`
>> on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's
>> -save-temps option) that currently peaks at 15.3GB.
>
>
> Stupid question, but when I was working on LTO last Summer the primary
> culprit for excessive memory use was due to us not being smart when
linking
> the IR together (Espindola would know more details). Do we still have
that
> problem? For starters, how does the memory usage of just llvm-link
compare
> to the memory usage of the actual LTO run? If the issue I was seeing last
> Summer is still there, you should see that the invocation of llvm-link is
> actually the most memory-intensive part of the LTO step, by far.
>

This is vague. Could you be more specific on where you saw all of the
memory?

Running `llvm-link *.bc` would OOM a machine with 64GB of RAM (with -g;
without -g it completed with much less). The increasing could be easily
watched on the system "process monitor" in real time.

-- Sean Silva

> For those interested, I've attached some pie charts based on Duncan's
> data
> in one of the other posts; successive slides break down the usage
> increasingly finely. To my understanding, they represent the number of
> Value's (and subclasses) allocated.
>
>>
>> In r219010, I merged integer and string fields into a single header
>> field. By reducing the number of metadata operands used in debug info,
>> this saved 2.2GB on an `llvm-lto` bootstrap. I've done some profiling
>> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and
>> I've concluded that they will be insufficient.
>>
>> Instead, I'd like to implement a more aggressive plan, which as a
>> side-effect cleans up the much "loved" debug info IR assembly syntax.
>>
>> At a high-level, the idea is to create distinct subclasses of `Value`
>> for each debug info concept, starting with line table entries and
>> moving
>> on to the DIDescriptor hierarchy. By leveraging the use-list
>> infrastructure for metadata operands -- i.e., only using value handles
>> for non-metadata operands -- we'll improve memory usage and increase
>> RAUW speed.
>>
>> My rough plan follows. I quote some numbers for memory savings below
>> based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto`
>> on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's
>> -save-temps option) that currently peaks at 15.3GB.
>
>
> Stupid question, but when I was working on LTO last Summer the primary
> culprit for excessive memory use was due to us not being smart when
> linking
> the IR together (Espindola would know more details). Do we still have
> that
> problem? For starters, how does the memory usage of just llvm-link
> compare
> to the memory usage of the actual LTO run? If the issue I was seeing
> last
> Summer is still there, you should see that the invocation of llvm-link
> is
> actually the most memory-intensive part of the LTO step, by far.
>

This is vague. Could you be more specific on where you saw all of the
memory?

Running `llvm-link *.bc` would OOM a machine with 64GB of RAM (with -g;
without -g it completed with much less). The increasing could be easily
watched on the system "process monitor" in real time.

This is likely what we've already discussed and was handled a long
while ago now.

-eric

>
>
>>
>> > For those interested, I've attached some pie charts based on Duncan's
>> > data
>> > in one of the other posts; successive slides break down the usage
>> > increasingly finely. To my understanding, they represent the number of
>> > Value's (and subclasses) allocated.
>> >
>> >>
>> >> In r219010, I merged integer and string fields into a single header
>> >> field. By reducing the number of metadata operands used in debug
info,
>> >> this saved 2.2GB on an `llvm-lto` bootstrap. I've done some
profiling
>> >> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next,
and
>> >> I've concluded that they will be insufficient.
>> >>
>> >> Instead, I'd like to implement a more aggressive plan, which as a
>> >> side-effect cleans up the much "loved" debug info IR assembly syntax.
>> >>
>> >> At a high-level, the idea is to create distinct subclasses of `Value`
>> >> for each debug info concept, starting with line table entries and
>> >> moving
>> >> on to the DIDescriptor hierarchy. By leveraging the use-list
>> >> infrastructure for metadata operands -- i.e., only using value
handles
>> >> for non-metadata operands -- we'll improve memory usage and increase
>> >> RAUW speed.
>> >>
>> >> My rough plan follows. I quote some numbers for memory savings below
>> >> based on an -flto -g bootstrap of `llvm-lto` (i.e., running
`llvm-lto`
>> >> on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's
>> >> -save-temps option) that currently peaks at 15.3GB.
>> >
>> >
>> > Stupid question, but when I was working on LTO last Summer the primary
>> > culprit for excessive memory use was due to us not being smart when
>> > linking
>> > the IR together (Espindola would know more details). Do we still have
>> > that
>> > problem? For starters, how does the memory usage of just llvm-link
>> > compare
>> > to the memory usage of the actual LTO run? If the issue I was seeing
>> > last
>> > Summer is still there, you should see that the invocation of llvm-link
>> > is
>> > actually the most memory-intensive part of the LTO step, by far.
>> >
>>
>> This is vague. Could you be more specific on where you saw all of the
>> memory?
>
>
> Running `llvm-link *.bc` would OOM a machine with 64GB of RAM (with -g;
> without -g it completed with much less). The increasing could be easily
> watched on the system "process monitor" in real time.
>

This is likely what we've already discussed and was handled a long
while ago now.

I was reading the thread in sequential order (and replying without
finishing). derp.

-- Sean Silva

>
>
>>
>> > For those interested, I've attached some pie charts based on Duncan's
>> > data
>> > in one of the other posts; successive slides break down the usage
>> > increasingly finely. To my understanding, they represent the number
>> > of
>> > Value's (and subclasses) allocated.
>> >
>> >>
>> >> In r219010, I merged integer and string fields into a single header
>> >> field. By reducing the number of metadata operands used in debug
>> >> info,
>> >> this saved 2.2GB on an `llvm-lto` bootstrap. I've done some
>> >> profiling
>> >> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next,
>> >> and
>> >> I've concluded that they will be insufficient.
>> >>
>> >> Instead, I'd like to implement a more aggressive plan, which as a
>> >> side-effect cleans up the much "loved" debug info IR assembly
>> >> syntax.
>> >>
>> >> At a high-level, the idea is to create distinct subclasses of
>> >> `Value`
>> >> for each debug info concept, starting with line table entries and
>> >> moving
>> >> on to the DIDescriptor hierarchy. By leveraging the use-list
>> >> infrastructure for metadata operands -- i.e., only using value
>> >> handles
>> >> for non-metadata operands -- we'll improve memory usage and increase
>> >> RAUW speed.
>> >>
>> >> My rough plan follows. I quote some numbers for memory savings
>> >> below
>> >> based on an -flto -g bootstrap of `llvm-lto` (i.e., running
>> >> `llvm-lto`
>> >> on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by
>> >> ld64's
>> >> -save-temps option) that currently peaks at 15.3GB.
>> >
>> >
>> > Stupid question, but when I was working on LTO last Summer the
>> > primary
>> > culprit for excessive memory use was due to us not being smart when
>> > linking
>> > the IR together (Espindola would know more details). Do we still have
>> > that
>> > problem? For starters, how does the memory usage of just llvm-link
>> > compare
>> > to the memory usage of the actual LTO run? If the issue I was seeing
>> > last
>> > Summer is still there, you should see that the invocation of
>> > llvm-link
>> > is
>> > actually the most memory-intensive part of the LTO step, by far.
>> >
>>
>> This is vague. Could you be more specific on where you saw all of the
>> memory?
>
>
> Running `llvm-link *.bc` would OOM a machine with 64GB of RAM (with -g;
> without -g it completed with much less). The increasing could be easily
> watched on the system "process monitor" in real time.
>

This is likely what we've already discussed and was handled a long
while ago now.

I was reading the thread in sequential order (and replying without
finishing). derp.

No worries, and hey, you might have had something else which we'd
definitely want to hear about :slight_smile:

Heck, for that matter we know there are other things so numbers are awesome.

-eric

As all of these transforms are 1-to-1, can we still support the older metadata and convert it on the fly?

Alex

As all of these transforms are 1-to-1, can we still support the older metadata and convert it on the fly?

I'd prefer not to keep all of that code around to interpret both
versions without a very good reason.

-eric