RFC: Up front type information generation in clang and llvm

Hi All,

This is something that’s been talked about for some time and it’s probably time to propose it.

The “We” in this document is everyone on the cc line plus me.

Please go ahead and take a look.

Thanks!

-eric

Objective (and TL;DR)

(To be clear: Reid, Adrian, Duncan, Dave, and myself.)

Thanks for sharing this. Mostly seems like a reasonable plan to me. A few
comments below.

Hi All,

This is something that's been talked about for some time and it's probably
time to propose it.

The "We" in this document is everyone on the cc line plus me.

Please go ahead and take a look.

Thanks!

-eric

Objective (and TL;DR)

Migrate debug type information generation from the backends to the front
end.

This will enable:
1. Separation of concerns and maintainability: LLVM shouldn’t have to know
about C preprocessor macros, Obj-C properties, or extensive details about
debug information binary formats.
2. Performance: Skipping a serialization should speed up normal
compilations.
3. Memory usage: The DI metadata structures are smaller than they were,
but are still fairly large and pointer heavy.

Motivation

Currently, types in LLVM debug info are described by the DIType class
hierarchy. This hierarchy evolved organically from a more flexible
sea-of-nodes representation into what it is today - a large, only somewhat
format neutral representation of debug types. Making this more format
neutral will only increase the memory use - and for no reason as type
information is static (or nearly so). Debug formats already have a memory
efficient serialization, their own binary format so we should support a
front end emitting type information with sufficient representation to allow
the backend to emit debug information based on the more normal IR features:
functions, scopes, variables, etc.

Scope/Impact

This is going to involve large scale changes across both LLVM and clang.
This will also affect any out-of-tree front ends, however, we expect the
impact to be on the order of a large API change rather than needing massive
infrastructure changes.

Related work

This is related to the efforts to support CodeView in LLVM and clang as
well as efforts to reduce overall memory consumption when compiling with
debug information enabled; in particular efforts to prune LTO memory usage.

Concerns

We need a good story for transitioning all the debug info testcases in the
backend without giving up coverage and/or readability. David believes he
has a plan here.

Proposal

Short version
-----------------

1. Split the DIBuilder API into Types (+Macros, Imports, …) and Line Table.
2. Split the clang CGDebugInfo API into Types and Line Table to match.
3. Add a LLVM DWARF emission library similar to the existing CodeView one.
4. Migrate the Types API into a clang internal API taking clang AST
structures and use the LLVM binary emission libraries to produce type
information.
5. Remove the old binary emission out of LLVM.

Questions/Thoughts/Elaboration
-------------------------------------------

Splitting the DIBuilder API
~~~~~~~~~~~~~~~~~~~~
Will DISubprogram be part of both?
   * We should split it in two: Full declarations with type and a slimmed
down version with an abstract origin.

How will we reference types in the DWARF blob?
   * ODR types can be referenced by name
   * Non-odr types by full DWARF hash
   * Each type can be a pair(tuple) of identifier (DITypeRef today) and
blob.
   * For < DWARF4 we can emit each type as a unit, but not a DWARF Type
Unit and use references and module relocations for the offsets. (See below)

How will we handle references in DWARF2 or global relocations for non-type
template parameters?
   * We can use a “relocation” metadata as part of the format.
   * Representable as a tuple that has the DIType and the offset within
the DIBlob as where to write the final relocation/offset for the reference
at emission time.

Why break up the types at all?
   * To enable non-debug format aware linking and type uniquing for LTO
that won’t be huge in size. We break up the types so we don’t need to parse
debug information to link two modules together efficiently.

How do you plan to handle abbreviations? You wouldn't necessarily be able
to embed them directly in the blob, as when doing LTO each compilation unit
would have its own set of abbreviations. I suppose you could do something
like treat them as a special sort of reference to an abbreviation table
entry, or maybe pre-allocate in the frontend (but would complicate
cross-frontend LTO) but curious what you have in mind.

Any other concerns there?

   * Debug information without type units might be slightly larger in this
scheme due to parents being duplicated (declarations and abstract origin,
not full parents). It may be possible to extend dsymutil/etc to merge all
siblings into a common parent. Open question for better ways to solve this.

When we were thinking about teaching the backend to produce blobs from IR
metadata we were thinking about cases where the debug info emitter would
discover special member functions during IR traversal. I guess since we're
moving all of that to the frontend we can just ask the frontend directly
which special members are needed on the class. That solves the problem for
a single translation unit. But what do you plan to do in the multiple
translation unit case where two TUs declare different special members on a
class? Would it be fine to just emit the two definitions and let the
debugger sort it out? I guess this is the type of thing that debuggers
normally deal with in the non-LTO case, so I suppose so?

Thanks Peter!

Thanks for reminding me, I knew I was forgetting something I’d talked about when writing all of this down. :slight_smile:

Basically to handle abbreviations you can do them the similarly to types by creating a blob with an index/hash/etc and then reference that as part of the type tuple, e.g.:

$1 = { DIAbbrev: 0x1234, DIBlob: }
$2 = { DIType: , DIAbbrev: $1, DIBlob: }

and keep them uniqued during emission and remember to merge these as well during module merge time.

Pretty much. This is one area where I have… disagreements with the DWARF committee and I don’t think there’s anything else we can do here. TBH right now I think we’d have issues with type units and special member functions since we’re using ODR-ness to unique.

-eric

Thanks for sharing this. Mostly seems like a reasonable plan to me. A few
comments below.

Thanks Peter!

Hi All,

This is something that's been talked about for some time and it's
probably time to propose it.

The "We" in this document is everyone on the cc line plus me.

Please go ahead and take a look.

Thanks!

-eric

Objective (and TL;DR)

Migrate debug type information generation from the backends to the front
end.

This will enable:
1. Separation of concerns and maintainability: LLVM shouldn’t have to
know about C preprocessor macros, Obj-C properties, or extensive details
about debug information binary formats.
2. Performance: Skipping a serialization should speed up normal
compilations.
3. Memory usage: The DI metadata structures are smaller than they were,
but are still fairly large and pointer heavy.

Motivation

Currently, types in LLVM debug info are described by the DIType class
hierarchy. This hierarchy evolved organically from a more flexible
sea-of-nodes representation into what it is today - a large, only somewhat
format neutral representation of debug types. Making this more format
neutral will only increase the memory use - and for no reason as type
information is static (or nearly so). Debug formats already have a memory
efficient serialization, their own binary format so we should support a
front end emitting type information with sufficient representation to allow
the backend to emit debug information based on the more normal IR features:
functions, scopes, variables, etc.

Scope/Impact

This is going to involve large scale changes across both LLVM and clang.
This will also affect any out-of-tree front ends, however, we expect the
impact to be on the order of a large API change rather than needing massive
infrastructure changes.

Related work

This is related to the efforts to support CodeView in LLVM and clang as
well as efforts to reduce overall memory consumption when compiling with
debug information enabled; in particular efforts to prune LTO memory usage.

Concerns

We need a good story for transitioning all the debug info testcases in
the backend without giving up coverage and/or readability. David believes
he has a plan here.

Proposal

Short version
-----------------

1. Split the DIBuilder API into Types (+Macros, Imports, …) and Line
Table.
2. Split the clang CGDebugInfo API into Types and Line Table to match.
3. Add a LLVM DWARF emission library similar to the existing CodeView
one.
4. Migrate the Types API into a clang internal API taking clang AST
structures and use the LLVM binary emission libraries to produce type
information.
5. Remove the old binary emission out of LLVM.

Questions/Thoughts/Elaboration
-------------------------------------------

Splitting the DIBuilder API
~~~~~~~~~~~~~~~~~~~~
Will DISubprogram be part of both?
   * We should split it in two: Full declarations with type and a
slimmed down version with an abstract origin.

How will we reference types in the DWARF blob?
   * ODR types can be referenced by name
   * Non-odr types by full DWARF hash
   * Each type can be a pair(tuple) of identifier (DITypeRef today) and
blob.
   * For < DWARF4 we can emit each type as a unit, but not a DWARF Type
Unit and use references and module relocations for the offsets. (See below)

How will we handle references in DWARF2 or global relocations for
non-type template parameters?
   * We can use a “relocation” metadata as part of the format.
   * Representable as a tuple that has the DIType and the offset within
the DIBlob as where to write the final relocation/offset for the reference
at emission time.

Why break up the types at all?
   * To enable non-debug format aware linking and type uniquing for LTO
that won’t be huge in size. We break up the types so we don’t need to parse
debug information to link two modules together efficiently.

How do you plan to handle abbreviations? You wouldn't necessarily be able
to embed them directly in the blob, as when doing LTO each compilation unit
would have its own set of abbreviations. I suppose you could do something
like treat them as a special sort of reference to an abbreviation table
entry, or maybe pre-allocate in the frontend (but would complicate
cross-frontend LTO) but curious what you have in mind.

Thanks for reminding me, I knew I was forgetting something I'd talked
about when writing all of this down. :slight_smile:

Basically to handle abbreviations you can do them the similarly to types
by creating a blob with an index/hash/etc and then reference that as part
of the type tuple, e.g.:

$1 = { DIAbbrev: 0x1234, DIBlob: <blah> }
$2 = { DIType: <ID>, DIAbbrev: $1, DIBlob: <blah> }

and keep them uniqued during emission and remember to merge these as well
during module merge time.

Makes sense, but wouldn't you need multiple abbreviations for each DIType,
in order to represent DITypes formed of multiple DIEs (e.g. enums, records)?

Maybe something like this would work:

$1 = { DIAbbrev: 0x1234, DIBlob: DW_TAG_enumeration_type<blah> }
$2 = { DIAbbrev: 0x5678, DIBlob: DW_TAG_enumerator<blah> }
$3 = { DIType: <ID>, DIAbbrev: [(0, $1), (8, $2), (16, $2)], DIBlob: <8
bytes of DW_TAG_enumeration_type attrs><8 bytes of DW_TAG_enumerator

<8 bytes of DW_TAG_enumerator attrs><0> }

?

nod That (or something similar) will work.

-eric

Skipping a serialization and doing something clever about LTO uniquing sounds awesome. I’m guessing you achieve this by extracting types out of DI metadata and packaging them as lumps-o-DWARF that the back-end can then paste together? Reading between the lines a bit here.

Can you share data about how much “pure” types dominate the size of debug info? Or at least the current metadata scheme? (Channeling Sean Silva here: show me the data!) Does this hold for C as well as C++?

Not much discussion of data objects and code objects (other than concrete subprograms), is that because they basically aren’t changing? Still defined in the metadata and still managed/emitted by the back-end?

Please say something about types (which you’re thinking of as a front-end thing) defined within scopes (which it looks like you’re thinking of as a back-end thing). Not seeing how to get the scoping right.

Thanks!

–paulr

How will this affect other languages that generate debug info - not that you should care about those, I’m just curious - my Pascal compiler does not generate clang-style AST, and does not use clang at all. I currently have code that in uses DIBuilder directly…

Skipping a serialization and doing something clever about LTO uniquing sounds awesome. I’m guessing you achieve this by extracting types out of DI metadata and packaging them as lumps-o-DWARF that the back-end can then paste together? Reading between the lines a bit here.

Pretty much, yes.

Can you share data about how much “pure” types dominate the size of debug info? Or at least the current metadata scheme? (Channeling Sean Silva here: show me the data!) Does this hold for C as well as C++?

They’re huge. It’s ridiculous. Take a look at the size of the metadata and then the size of the stuff we put in there versus dwarf.

And yes, it also trivially holds for C.

Not much discussion of data objects and code objects (other than concrete subprograms), is that because they basically aren’t changing? Still defined in the metadata and still managed/emitted by the back-end?

Yep. A way of looking at it is more that it is related to things in the IR and so needs IR to represent it.

Please say something about types (which you’re thinking of as a front-end thing) defined within scopes (which it looks like you’re thinking of as a back-end thing). Not seeing how to get the scoping right.

Basic idea is non-defining declarations holding types and be the abstract origin for the concrete function? Honestly, I wish they were type unitable at the moment, but that might be something to look into. The current plan at least. This will make some debug info a little bit larger, but only for things like nested types where we need to throw an extra declaration (i.e. the same sorts of places that type units make things larger).

At any rate, the first thing is to get the APIs split anyhow.

-eric

How will this affect other languages that generate debug info - not that you should care about those, I’m just curious - my Pascal compiler does not generate clang-style AST, and does not use clang at all. I currently have code that in uses DIBuilder directly…

I don’t think that the code for generating DWARF types should move into Clang, but rather in a separate library that can be shared by multiple frontends. It can even keep most of the existing DIBuilder interface (but we may need to split DIBuilder in a types vs. everything else part).

– adrian

Probably don’t want to package the lumps into actual type units before the backend can figure out what’s going on; referencing a type unit is clumsy and inefficient in both space and time. Partial units are probably a better fit in many cases, especially for non-file-scope types. Also LTO would probably want to bias against type units, again that feels like a decision better made by the backend than the frontend.

Just a thought.

–paulr

How will this affect other languages that generate debug info - not that you should care about those, I’m just curious - my Pascal compiler does not generate clang-style AST, and does not use clang at all. I currently have code that in uses DIBuilder directly…

I don’t think that the code for generating DWARF types should move into Clang, but rather in a separate library that can be shared by multiple frontends. It can even keep most of the existing DIBuilder interface (but we may need to split DIBuilder in a types vs. everything else part).

There will need to be an API split between front end and backend. We can attempt to keep a lot of the current DIBuilder interface, but it’s going to make sense to have a split in the front end as well that can directly call an emission library. Ideally we’ll make it look a lot more like dwarf than the current abstract interface, but we’ll see.

-eric

Hi All,

This is something that's been talked about for some time and it's probably time to propose it.

The "We" in this document is everyone on the cc line plus me.

Please go ahead and take a look.

Thanks!

-eric

Objective (and TL;DR)

Migrate debug type information generation from the backends to the front end.

This will enable:
1. Separation of concerns and maintainability: LLVM shouldn’t have to know about C preprocessor macros, Obj-C properties, or extensive details about debug information binary formats.

This is a bit of an overstatement: This proposal is only about the debug *type* information. The back end still needs to know about line tables, subprograms, etc., and in order to produce the Apple/DWARF5 accelerator tables it will even need to have a basic understanding of the type info.

2. Performance: Skipping a serialization should speed up normal compilations.
3. Memory usage: The DI metadata structures are smaller than they were, but are still fairly large and pointer heavy.

We should back up this claim with some numbers, but the idea is that the expected savings come from the “type units” being variable-length records with abbreviations not unlike LLVM bitcode. In contrast to LLVM metadata, however, there is also some additional overhead due to each “type unit” containing a redundant declcontext, the support for relocations, and potentially for supporting the accelerator tables.

Motivation

Currently, types in LLVM debug info are described by the DIType class hierarchy. This hierarchy evolved organically from a more flexible sea-of-nodes representation into what it is today - a large, only somewhat format neutral representation of debug types. Making this more format neutral will only increase the memory use - and for no reason as type information is static (or nearly so). Debug formats already have a memory efficient serialization, their own binary format so we should support a front end emitting type information with sufficient representation to allow the backend to emit debug information based on the more normal IR features: functions, scopes, variables, etc.

Scope/Impact

This is going to involve large scale changes across both LLVM and clang. This will also affect any out-of-tree front ends, however, we expect the impact to be on the order of a large API change rather than needing massive infrastructure changes.

Related work

This is related to the efforts to support CodeView in LLVM and clang as well as efforts to reduce overall memory consumption when compiling with debug information enabled; in particular efforts to prune LTO memory usage.

Concerns

We need a good story for transitioning all the debug info testcases in the backend without giving up coverage and/or readability. David believes he has a plan here.

David, can you elaborate on this?

Proposal

Short version
-----------------

1. Split the DIBuilder API into Types (+Macros, Imports, …) and Line Table.
2. Split the clang CGDebugInfo API into Types and Line Table to match.
3. Add a LLVM DWARF emission library similar to the existing CodeView one.
4. Migrate the Types API into a clang internal API taking clang AST structures and use the LLVM binary emission libraries to produce type information.
5. Remove the old binary emission out of LLVM.

Questions/Thoughts/Elaboration
-------------------------------------------

Splitting the DIBuilder API
~~~~~~~~~~~~~~~~~~~~
Will DISubprogram be part of both?
   * We should split it in two: Full declarations with type and a slimmed down version with an abstract origin.

How will we reference types in the DWARF blob?
   * ODR types can be referenced by name
   * Non-odr types by full DWARF hash
   * Each type can be a pair(tuple) of identifier (DITypeRef today) and blob.
   * For < DWARF4 we can emit each type as a unit, but not a DWARF Type Unit and use references and module relocations for the offsets. (See below)

How will we handle references in DWARF2 or global relocations for non-type template parameters?
   * We can use a “relocation” metadata as part of the format.
   * Representable as a tuple that has the DIType and the offset within the DIBlob as where to write the final relocation/offset for the reference at emission time.

Why break up the types at all?
   * To enable non-debug format aware linking and type uniquing for LTO that won’t be huge in size. We break up the types so we don’t need to parse debug information to link two modules together efficiently.

Any other concerns there?
   * Debug information without type units might be slightly larger in this scheme due to parents being duplicated (declarations and abstract origin, not full parents). It may be possible to extend dsymutil/etc to merge all siblings into a common parent. Open question for better ways to solve this.

How should we handle DWARF5/Apple Accelerator Tables?
   * Thoughts:
   * We can parse the dwarf in the back end and generate them.
   * We can emit in the front end for the base case of non-LTO (with help from the backend for relocation aspects).
   * We can use dsymutil on LTO debug information to generate them.

I realized that the last two bullet points would not work well with ThinLTO. One of its selling points is (fast) incremental rebuilds, and requiring dsymutil to make a second pass over all the object files is negating this benefit.

-- adrian

Hi Eric,

I can understand the need for improving the current design of debug info representation and emission in LLVM.

However, let’s not forget that the motivation was and still to support CodeView debug info emission.

I am wondering if it is right to spend the huge effort needed to implement the below proposal while knowing these facts:

  1. It would be more clear how to improve the design when we have a working CodeView support.

You said it yourself, that we still do not know what challenges we will face while implementing this proposal.

  1. I understand that CodeView will need some extra extensions to current dwarf debug info, like ‘this’ adjustment.

However, it is doable to introduce a CodeView wrapper data structures that can be created from current dwarf debug info IR.

And this can be done in CodeGen (e.g. CodeViewDebug.cpp) while emitting the code/debug info.

Again, I understand that your proposal is trying to improve a lot of things, but it seems that we should first try support CodeView debug info with the current debug info IR.

The advantages:

  1. It works, even though you still have doubts about few issues, I believe we can resolve them with minimum modification to the LLVM IR/Clang FE.

  2. It requires much smaller effort.

  3. It is much clean.

  4. We will understand more the requirements needed by CodeView that can be used to improve the below proposal (before diving into implementing it).

I suggest that we start with:

  1. Define the CodeView wrapper data structure. (CodeViewDebugIR)

  2. Build the CodeView wrapper data structure based on dwarf debug info IR. (CodeViewDebugBuilder)

  3. Emit the CodeView wrapper data structure into COFF object file. (CodeViewDebugEmitter)

  4. Figure out what modification/extension need to be done to dwarf debug info IR/Clang FE.

What do you think?

Thanks,

Amjad

Forgot to add llvm-dev mailing list.

Hi Aboud,

Hi Eric,
I can understand the need for improving the current design of debug info representation and emission in LLVM.
However, let’s not forget that the motivation was and still to support CodeView debug info emission.

Well, that is *one* motivation.

I am wondering if it is right to spend the huge effort needed to implement the below proposal while knowing these facts:
1. It would be more clear how to improve the design when we have a working CodeView support.
You said it yourself, that we still do not know what challenges we will face while implementing this proposal.
2. I understand that CodeView will need some extra extensions to current dwarf debug info, like ‘this’ adjustment.
However, it is doable to introduce a CodeView wrapper data structures that can be created from current dwarf debug info IR.
And this can be done in CodeGen (e.g. CodeViewDebug.cpp) while emitting the code/debug info.

Again, I understand that your proposal is trying to improve a lot of things

Yes, and to give some different perspective: some of these "things" are a lot higher priority than CodeView (for other people/use cases of course), because DebugInfo cost is prohibitive for some use cases.

, but it seems that we should first try support CodeView debug info with the current debug info IR.
The advantages:
1. It works, even though you still have doubts about few issues, I believe we can resolve them with minimum modification to the LLVM IR/Clang FE.
2. It requires much smaller effort.
3. It is much clean.

If it is "much cleaner" in the IR, I understand that you have insights about Eric's proposal being "less clean", independently of adding CodeView before or after this change. If so it's worth elaborating on this.

4. We will understand more the requirements needed by CodeView that can be used to improve the below proposal (before diving into implementing it).

Don't you forget the "Cons":

1) It is easier to perform large refactoring/changes to the debug info flow *before* complexifying the problem.
2) This is adding more stuff that will need to go through all these changes, wasting effort in the process.
3) It will limit forward progress for people who don't care about CodeView but want to move forward with restructuring DI deeply, like Eric's proposal is doing.

That is not to say that your points are not valid, but that it's not that clear cut either.

Skipping a serialization and doing something clever about LTO uniquing
sounds awesome. I'm guessing you achieve this by extracting types out of
DI metadata and packaging them as lumps-o-DWARF that the back-end can then
paste together? Reading between the lines a bit here.

Pretty much, yes.

Can you share data about how much "pure" types dominate the size of debug
info? Or at least the current metadata scheme? (Channeling Sean Silva
here: show me the data!) Does this hold for C as well as C++?

They're huge. It's ridiculous. Take a look at the size of the metadata and
then the size of the stuff we put in there versus dwarf.

Because numbers are nice to have, I modified Clang to generate every type
as 'int' (patch attached - I may've screwed some things up) & then compiled
llvm-tblgen's object files with -flto (I would've used all of clang, but I
don't have the lto plugin setup, so I couldn't get past tblgen)

Without debug info: 77 MB of bitcode files
With debug info: 24 MB
With debug info, but no types: 46 MB

so... 59% is pure type descriptions (these are the pure ones, the same
things we put in type units - I didn't even remove the injected
declarations (so if you compile example programs with this - you'll find
that the DW_TAG_base_type for "int" has a child for every member function
declaration that's defined (even used inline functions) in this translation
unit) for this particular test, at least. Clang would be a larger/more
representative sample.

I confirmed that both with and without types, there were the same number
(48542) of subprogram definitions and without types there were no instances
of DICompositeType (both of these were confirmed with xargs/llvm-dis/grep,
nothing fancy)

notypes.diff (3.11 KB)

Skipping a serialization and doing something clever about LTO uniquing sounds awesome. I’m guessing you achieve this by extracting types out of DI metadata and packaging them as lumps-o-DWARF that the back-end can then paste together? Reading between the lines a bit here.

Pretty much, yes.

Can you share data about how much “pure” types dominate the size of debug info? Or at least the current metadata scheme? (Channeling Sean Silva here: show me the data!) Does this hold for C as well as C++?

They’re huge. It’s ridiculous. Take a look at the size of the metadata and then the size of the stuff we put in there versus dwarf.

Because numbers are nice to have, I modified Clang to generate every type as ‘int’ (patch attached - I may’ve screwed some things up) & then compiled llvm-tblgen’s object files with -flto (I would’ve used all of clang, but I don’t have the lto plugin setup, so I couldn’t get past tblgen)

Without debug info: 77 MB of bitcode files
With debug info: 24 MB
With debug info, but no types: 46 MB

so… 59% is pure type descriptions

To clarify, I mean 59% of the debug info ((46-24)/(77-24) == without type info is 41% the size of total with type info), not of the total file size. If that makes sense.

I guess you have a non-LTO build somewhere, so you should be able to build other tools by bypassing the llvm-tblgen build using:

cmake -DLLVM_TABLEGEN=path/to/llvm-tblgen …

To be clear: that was meant as FYI / good to know, I was not asking you for more data.