RFC: Up front type information generation in clang and llvm

Skipping a serialization and doing something clever about LTO uniquing sounds awesome. I’m guessing you achieve this by extracting types out of DI metadata and packaging them as lumps-o-DWARF that the back-end can then paste together? Reading between the lines a bit here.

Pretty much, yes.

Can you share data about how much “pure” types dominate the size of debug info? Or at least the current metadata scheme? (Channeling Sean Silva here: show me the data!) Does this hold for C as well as C++?

They’re huge. It’s ridiculous. Take a look at the size of the metadata and then the size of the stuff we put in there versus dwarf.

Because numbers are nice to have, I modified Clang to generate every type as ‘int’ (patch attached - I may’ve screwed some things up) & then compiled llvm-tblgen’s object files with -flto (I would’ve used all of clang, but I don’t have the lto plugin setup, so I couldn’t get past tblgen)

Without debug info: 77 MB of bitcode files
With debug info: 24 MB

Oh, and I got these ^ numbers jumbled up. 77 with, 24 without.

Thanks - will keep that in mind!

Thanks a lot for the numbers! That certainly helps, even with a small sample, was not at all clear to me how to get this data.

–paulr

Hi All,

This is something that's been talked about for some time and it's
probably time to propose it.

The "We" in this document is everyone on the cc line plus me.

Please go ahead and take a look.

Thanks!

-eric

Motivation

Currently, types in LLVM debug info are described by the DIType class
hierarchy. This hierarchy evolved organically from a more flexible
sea-of-nodes representation into what it is today - a large, only
somewhat format neutral representation of debug types. Making this more
format neutral will only increase the memory use - and for no reason as
type information is static (or nearly so). Debug formats already have a
memory efficient serialization, their own binary format so we should
support a front end emitting type information with sufficient
representation to allow the backend to emit debug information based on
the more normal IR features: functions, scopes, variables, etc.

Scope/Impact

This is going to involve large scale changes across both LLVM and clang.
This will also affect any out-of-tree front ends, however, we expect the
impact to be on the order of a large API change rather than needing
massive infrastructure changes.

How will you make it on the order of a large api change? At the moment we build bitcode ourselves and generate dibuilder equivalent structures. wouldn't frontends need to do their own well, DWARF and CodeView writing? Especially the ones that are tied to the C only apis.

Proposal

Short version
-----------------

1. Split the DIBuilder API into Types (+Macros, Imports, …) and Line Table.
2. Split the clang CGDebugInfo API into Types and Line Table to match.
3. Add a LLVM DWARF emission library similar to the existing CodeView one.
4. Migrate the Types API into a clang internal API taking clang AST
structures and use the LLVM binary emission libraries to produce type
information.
5. Remove the old binary emission out of LLVM.

What about allow multiple debug info formats at once? The current format could potentially
allow such an option in the future (i know it doesn't actually do it now), will the new option
hardcode it to a single format?

Hi All,

This is something that’s been talked about for some time and it’s
probably time to propose it.

The “We” in this document is everyone on the cc line plus me.

Please go ahead and take a look.

Thanks!

-eric

Motivation

Currently, types in LLVM debug info are described by the DIType class
hierarchy. This hierarchy evolved organically from a more flexible
sea-of-nodes representation into what it is today - a large, only
somewhat format neutral representation of debug types. Making this more
format neutral will only increase the memory use - and for no reason as
type information is static (or nearly so). Debug formats already have a
memory efficient serialization, their own binary format so we should
support a front end emitting type information with sufficient
representation to allow the backend to emit debug information based on
the more normal IR features: functions, scopes, variables, etc.

Scope/Impact

This is going to involve large scale changes across both LLVM and clang.
This will also affect any out-of-tree front ends, however, we expect the
impact to be on the order of a large API change rather than needing
massive infrastructure changes.

How will you make it on the order of a large api change? At the moment
we build bitcode ourselves and generate dibuilder equivalent structures.
wouldn’t frontends need to do their own well, DWARF and CodeView
writing? Especially the ones that are tied to the C only apis.

There will be some backend support. The hope is that it’ll be a fairly direct translation from the existing APIs.

I make no claims about C API as we don’t handle types at the C API level currently.

Proposal

Short version

  1. Split the DIBuilder API into Types (+Macros, Imports, …) and Line Table.
  2. Split the clang CGDebugInfo API into Types and Line Table to match.
  3. Add a LLVM DWARF emission library similar to the existing CodeView one.
  4. Migrate the Types API into a clang internal API taking clang AST
    structures and use the LLVM binary emission libraries to produce type
    information.
  5. Remove the old binary emission out of LLVM.

What about allow multiple debug info formats at once? The current format
could potentially
allow such an option in the future (i know it doesn’t actually do it
now), will the new option
hardcode it to a single format?

What option?

I don’t think this is much of a worry - at the very least it’s probably not much more difficult than doing it right now.

-eric

The Open Dylan compiler doesn't link with any LLVM libraries; its only interface with LLVM is through bitcode, using a bitcode writer that I wrote myself in Dylan. Frontends that write textual LLVM assembly are in the same situation.

The type information that the Open Dylan LLVM support generates within debug information is very simple, mostly amounting to void* (and function signatures containing varying numbers of void* arguments). It sometimes goes beyond this when foreign (C) function support is used within Dylan programs.

We would prefer if some level of support were maintained for generating least-common-denominator debug info (both DWARF and CodeView) from structured metadata. The potential performance improvements from implementing this proposal don't really apply to our compiler's use case, since the debug types for foreign C structs that are generated generally only appear in a single translation unit across the entire program. I'd prefer to avoid having to maintain code that deals with DWARF and CodeView directly.

-Peter S. Housel-

My general feeling is that this design represents a mid-point between our current metadata design, and a future design where frontends just emit type information and LTO links it in a format-aware way.

I don’t think it’s an imminent priority for anyone to do this for DWARF, so I worry that if we start building infrastructure for it, it will end up overengineered.

Also, people seem to agree that in the long term, we really need a format-aware linker, and maybe LTO should just use one. Supposedly Frédéric has patches to llvm-dsymutil to make one for DWARF, but he hasn’t found the time to upstream them.

Together, these reasons make me feel that we should limit the short-term scope to just CodeView, and add utilities to lib/Linker for performing basic tasks like type stream merging or type extraction, possibly with forward declaration of composite types.

In the future, when we do this work for DWARF, we can add a new DIType* stand-in similar to what you are describing.

The working patch that I have for just CodeView, all types as a single blob, is up here: http://reviews.llvm.org/D19236 While it doesn’t deal with type blobs or LTO type merging yet, I think it shows that there is surprisingly little need to bifurcate other parts of LLVM.

Thoughts?

Somehow I managed to respond without being explicit about the difference
between your design and mine: I'm saying we should just have one type blob
per TU. This will avoid the need for cross-blob references, but it will
necessitate format-aware type handing during LTO and LTO-like use-cases
(ThinLTO, llvm-extract, etc).

I don’t agree in general here because of:

a) maintainability - there isn’t a one true path through things and now is scattering more windows knowledge through debug info and lto

b) higher bar for implementing similar dwarf functionality - there’s nothing here that makes it at any point better for our general debug info support. Incrementally updating to an intermediate step is much easier and a lower bar than needing to implement everything up to and including a format aware linker and support that through ThinLTO, the JIT, and full LTO.

c) if there’s no reason to do this for dwarf there’s no reason to do it for windows. The existing proposal was a way to get you type emission in the front end so that you’d have to do less work. Ultimately though I don’t see a reason to do this if all of the platforms don’t look the same.

d) ThinLTO/ORC won’t support the debug info you have in your proposal right now without patches

e) You’re regressing LTO linking performance hugely for windows with debug until you write the patches that enable format aware linking of code view information

I’m open to arguments on any of these points from anyone.

-eric

My general feeling is that this design represents a mid-point between our current metadata design, and a future design where frontends just emit type information and LTO links it in a format-aware way.

I don’t think it’s an imminent priority for anyone to do this for DWARF, so I worry that if we start building infrastructure for it, it will end up overengineered.

Also, people seem to agree that in the long term, we really need a format-aware linker, and maybe LTO should just use one. Supposedly Frédéric has patches to llvm-dsymutil to make one for DWARF, but he hasn’t found the time to upstream them.

There are pieces missing upstream — mostly the accelerator tables — and I’m really struggling to find time to upstream these. However, the DIE tree linking part of upstream llvm-dsymutil is complete. That’s not to say that it would be easy to use it as a generic DWARF linker. I tried to make it as agnostic to the platform as I could, but it was designed to be bit-for-bit compatible with the original dsymutil and that surely made it a lot less generic.

Would you envision the format-aware link to take place during the LTO link? This would seem pretty expensive to me (DWARF linking is not really cheap, as it’s not a format designed for this). I think it would make more sense to leave the type info in the object files and to somehow have the LTO link emit external references to it (ala module debugging). Then have the debug info link happen as an explicit step; this matches the Darwin model, but not the usual *nix model.

Fred

Hi,

I tend to agree with Eric, but since I’m too busy to compute data or to sign-up for doing the work at that time, I won’t weight strongly on this.

Especially Eric’s point b) is worrying to me: unless the work to “correctly” design is unbearable, having work performed on CodeView that would make things harder to do for Dwarf later is a red flag IMO.

I’d like to answer especially:
“I don’t think it’s an imminent priority for anyone to do this for DWARF, so I worry that if we start building infrastructure for it, it will end up over engineered.”

LTO is impacted a lot by debug info memory size and CPU time. ThinLTO is impacted even on a larger scale. So it should be an “imminent priority” to address that, and we (well Adrian and Duncan) worked a lot on the “non-type” part of it recently, improving the current state significantly. However debug info is still a bottleneck and we aim at doing better. The plan about moving type emission in the front-end is definitively appealing to me because of that.

Another aspect is that when building LLVM, we are linking 56 separate binaries (I’m not talking about archive) from largely overlapping sets of object files.
It means that any work that is performed during LTO/Codegen on these files but could be moved during the compile phase is already almost a x56 speedup win (and lower peak memory during LTO).
Knowing that the peak memory is reached during CodeGen and that the Dwarf emission is a large part of it, this is a major candidate to be moved in the compile phase.

Hi Eric,

I’m coming back on this topic after discussing this offline quickly with Reid, and at length with Adrian, Duncan, and Fred.
I may have to take back some of my words from my previous email, especially as it is not clear how and why what Reid is proposing to do is hurting a future path for Dwarf.

Especially, if my understanding is correct, the key point that differentiate what Reid is trying to do from what you envision is that he would emit a single type blob per Module. Following up on what Fred mentioned, i.e. " it would make more sense to leave the type info in the object files and to somehow have the LTO link emit external references to it (ala module debugging)", it seems to be quite LTO friendly and very efficient. I like the fact that you don’t pay the price of building a type hierarchy graph when you don’t need it, and I’m not sure why we should clobbered the IR with all the graph when it is not relevant (i.e. outside of debug-info linking).

On the other hand, it seems that what you’re proposing is basically “optimized” for “type units” (which are not supported on Darwin anyway) and the only advantage we could see is to have an easy way of type-uniquing directly in the IR.

Our conclusion was that for us, a single type blob with somehow “smart reference” to be able to point inside the blob from the outside is the most efficient things we can built upon. However the cost/benefit of getting there is too high for us to prioritize working this at this point.
(If I misrepresented anything, please Adrian/Duncan/Fred correct me)

Responses to Mehdi and Eric below.

I don't agree in general here because of:

a) maintainability - there isn't a one true path through things and now is
scattering more windows knowledge through debug info and lto

There was never going to be one true way to generate LLVM debug info
for both formats. We need some help from the frontend.

b) higher bar for implementing similar dwarf functionality - there's nothing
here that makes it at any point better for our general debug info support.
Incrementally updating to an intermediate step is much easier and a lower
bar than needing to implement everything up to and including a format aware
linker and support that through ThinLTO, the JIT, and full LTO.

I claim that everything does not have to be format aware. All it has
to do is call out to a library which is format aware. We can come up
with reasonable high-level abstractions for operations that we'll want
to do on types, such as "extract this type and everything it
references".

c) if there's no reason to do this for dwarf there's no reason to do it for
windows. The existing proposal was a way to get you type emission in the
front end so that you'd have to do less work. Ultimately though I don't see
a reason to do this if all of the platforms don't look the same.

There are reasons to do this for DWARF, but they are not compelling
enough to do a total rewrite of our type information support.

d) ThinLTO/ORC won't support the debug info you have in your proposal right
now without patches

e) You're regressing LTO linking performance hugely for windows with debug
until you write the patches that enable format aware linking of code view
information

The way I see it, there is no existing CodeView debug info
functionality to regress for any of ORC, LTO, or ThinLTO. Apparently
we don't see this the same way.

And I've already written the patch to do type merging:
http://reviews.llvm.org/D20122 Regular LTO can call this code, and
rewrite the DITypeIndex numbers with the map produced. While this may
not be directly applicable to ORC and ThinLTO, I don't expect that
supporting them will be much more work.

On the other hand, it seems that what you're proposing is basically
"optimized" for "type units" (which are not supported on Darwin anyway) and
the only advantage we could see is to have an easy way of type-uniquing
directly in the IR.

Splitting up the type information into opaque units lets you do
format-agnostic type uniquing, but it doesn't let you extract forward
declarations like ThinLTO wants to do.

Our conclusion was that for us, a single type blob with somehow "smart
reference" to be able to point inside the blob from the outside is the most
efficient things we can built upon. However the cost/benefit of getting
there is too high for us to prioritize working this at this point.
(If I misrepresented anything, please Adrian/Duncan/Fred correct me)

Yeah, this is kind of where I am. Having one blob per module is
probably the most efficient thing possible that I could do for
CodeView, but I estimate that the cost of also doing it for DWARF is
very high. We have a lot of dependencies on the existing
representation. We can attempt to try and generalize up-front emission
to DWARF, but I think if we don't pay the full cost, we will end up
with something half-baked for DWARF. I don't think I have the time to
do it justice.

Speaking of the idea of smart references that point out of the IR into
separate type info, my current approach (DITypeIndex) is very
CV-specific. However, I think if we allow one kind of smart reference,
we can add support for more, and they can be format-specific. As long
as we're OK making DITypeRefs opaque, adding new kinds of type refs is
cheap.

From: cfe-dev [mailto:cfe-dev-bounces@lists.llvm.org] On Behalf Of Reid
Kleckner via cfe-dev
Sent: Wednesday, May 11, 2016 10:40 AM
To: Mehdi Amini <mehdi.amini@apple.com>
Cc: llvm-dev <llvm-dev@lists.llvm.org>; Clang Dev <cfe-dev@lists.llvm.org>
Subject: Re: [cfe-dev] [llvm-dev] RFC: Up front type information generation in
clang and llvm

Responses to Mehdi and Eric below.

I don't agree in general here because of:

a) maintainability - there isn't a one true path through things and now is
scattering more windows knowledge through debug info and lto

There was never going to be one true way to generate LLVM debug info
for both formats. We need some help from the frontend.

I believe that Amjad Aboud has argued several times that there could be one true way to generate LLVM debug info such that both
windows and DWARF debug info could be generated from it. I know for a fact that within the Intel Compiler that the FE generates a single
set of debug info representation, that then gets translated into either MS PDB format, or DWARF depending on the target platform.

Architecturally, that is very desirable. You really do not want to have every FE have to know about, and generate different debug info depending
on whether they are targeting windows or a DWARF enabled target, do you?

If we go with the existing metadata representation, we will need to
extend it to be the union of DWARF and CodeView, and that will require
frontends to feed us more information specific to CodeView. In other
words, "we need help from the frontend." Depending on your
perspective, you could see this as spreading Windows knowledge across
the codebase.

I think extending the DI metadata is definitely workable. As you say,
it is obviously very useful for other frontends. I just feel that the
representation shift is needlessly inefficient and stands in our way
when we need to express things that it can't yet represent.

This is a bit blurry to me as it seems a bit orthogonal: the fact that there is an interface exposed to the frontends to emit debug info should be almost independent from where we actually emit the blob.
So yes, such an interface would require the frontends to expose the union of the information needed to emit Dwarf and CodeView, but it does imply that the metadata representation need to be extended (i.e. behind such an interface you could get the current metadatas for Dwarf and the single blob for CodeView).
Did I miss something?