Plans for module debugging

Plans for module debugging

Plans for module debugging

I recently had a chat with Eric Christopher and David Blaikie to discuss
ideas for debug info for Clang modules and this is what we came up with.

Goals
-----

Clang modules [1], (and their siblings C++ modules and precompiled header
files) are a method for improving compile time by making the serialized AST
for commonly-used headers files directly available to the compiler.

Currently debug info is totally oblivious to this, when the developer
compiles a file that uses a type from a module, clang simply emits a copy
of the full definition (some exceptions apply for C++) of this type in
DWARF into the debug info section of the resulting object file. That's a
lot of copies.

The key idea is to emit DWARF for types defined in modules only once, and
then only emit references to these types in all the individual compile
units that import this module. We are going to build on the split DWARF and
type unit facilities provided by DWARF for this. DWARF consumers can follow
the type references into module debug info section quite similar to how
they resolve types in external type units today. Additionally, the format
will allow consumers that support clang modules natively (such as LLDB) to
directly look up types in the module, without having to go through the
usual translation from AST to DWARF and back to AST.

The primary benefit from doing all this is performance. This change is
expected to reduce the size of the debug info in object files significantly
by
- emitting only references to the full types and thus
- implicitly uniquing types that are defined in modules.
The smaller object files will result in faster compile times and faster
llvm::Module load times when doing LTO. The type uniquing will also result
in significantly smaller debug info for the finished executables,
especially for C and Objective-C, which do not support ODR-based type
uniquing. This comes at the price of longer initial module build times, as
debug info is emitted alongside the module.

Design
------

Clang modules are designed to be ephemeral build artifacts that live in a
shared module cache. Compiling a source file that imports `MyModule`
results in `Module.pcm` to be generated to the module cache directory,
which contains the serialized AST of the declarations found in the header
files that comprise the module.

We will change the binary clang module format to became a container (ELF,
Mach-O, depending on the platform). Inside the container there will be
multiple sections: one containing the serialized AST, and ones containing
DWARF5 split debug type information for all types defined in the module
that can be encoded in DWARF. By virtue of using type units, each type is
emitted into its own type unit which can be identified via a unique type
signature. DWARF consumers can use the type signatures to look up type
definitions in the module debug info section. For module-aware consumers
(LLDB), we will add an index that maps type signatures directly to an
offset in the AST section.

For an object file that was built using modules, we need to record the
fact that a module has been imported. To this end, we add a
DW_TAG_compile_unit into a COMDAT .debug_info.dwo section that references
the split DWARF inside the module. Similar to split DWARF objects, the
module will be identified by its filename and a checksum. The imported unit
also contains a couple of extra attributes holding all the information
necessary to recreate the module in case the module cache has been flushed.
Platforms that treat modules as an explicit build artifact do not have this
problem. In the .debug_info section all types that are defined in the
module are referenced via their unique type signature using
DW_FORM_ref_sig8, just as they would be if this were types from a regular
DWARF type unit.

Example
-------

Let's say we have a module `MyModule` that defines a type `MyStruct`::
$ cat foo.c
#include <MyModule.h>
MyStruct x;

when compiling `foo.c` like this::
clang -fmodules -gmodules foo.c -c

clang produces `foo.o` and an ELF or Mach-O container for the module::
/path/to/module-cache/MyModule.pcm

In the module container, we have a section for the serialized AST and a
split DWARF sections for the debug type info. The exact format is likely
still going to evolve a little, but this should give a rough idea::

MyModule.pcm:
  .debug_info.dwo:
    DW_TAG_compile_unit
      DW_AT_dwo_name ("/path/to/MyModule.pcm")
      DW_AT_dwo_id ([unique AST signature])

    DW_TAG_type_unit ([hash for MyStruct])
       DW_TAG_structure_type
          DW_AT_signature ([hash for MyStruct])
          DW_AT_name “MyStruct”
          ...

  .debug_abbrev.dwo:
    // abbrevs referenced by .debug_info.dwo
  .debug_line.dwo:
    // filenames referenced by .debug_info.dwo
  .debug_str.dwo:
    // strings referenced by .debug_info.dwo

  .ast
    // Index at the top of the AST section sorted by hash value.
    [hash for MyStruct] -> [offset for MyStruct in this section]
    ...
    // Serialized AST follows
    ...

The debug info in foo.o will look like this::

.debug_info.dwo

(so if this goes in debug_info.dwo then it would be in foo.dwo, not
foo.o... but I had some further thoughts about this... )

So - imagining a future in which modules are real object files that get
linked into the final executable because they contain things like
definitions of linkonce_odr functions (so that any object file that has all
the linkonce_odr calls inlined doesn't have to carry around a (probably
duplicate) definition of the function) - then that object file could also
contain the skeleton CU unit (& associated line table, string table, etc)
for not only these functions, but for all the types, etc, as well.

In that world, we would have exactly fission, nothing new (no two-level
fission, where some static-data-only skeletons appear in the .dwo file and
the skeletons with non-static data (ie: with relocations, such as those
describing concrete function definitions or global variables) appear in the
.o file).

We can reach that same output today by adding these skeletons into the .o
file (in debug_info, not debug_info.dwo) and using comdat to unique them
during linking.

This option would be somewhat wasteful for now (& in the future for any
module that had /no/ concrete code that could be kept in the module - such
as would be the case in pure template libraries with no explicit
instantiation decl/defs, etc) because it would put module references in the
.o, but it would mean not having to teach tools new fission tricks
immediately.

Then, if we wanted to add an optimization of double-indirection fission
(having skeleton CUs in .dwo files that reference further .dwo files) we
could do that as a separate step on top.

It's just a thought - Maybe it's an unnecessary extra step and we should
just go for the double-indirection from the get-go, I'm not sure?

Opinions?

   DW_TAG_compile_unit

The debug info in foo.o will look like this::

.debug_info.dwo

(so if this goes in debug_info.dwo then it would be in foo.dwo, not foo.o... but I had some further thoughts about this... )

So - imagining a future in which modules are real object files that get linked into the final executable because they contain things like definitions of linkonce_odr functions (so that any object file that has all the linkonce_odr calls inlined doesn't have to carry around a (probably duplicate) definition of the function) - then that object file could also contain the skeleton CU unit (& associated line table, string table, etc) for not only these functions, but for all the types, etc, as well.

In that world, we would have exactly fission, nothing new (no two-level fission, where some static-data-only skeletons appear in the .dwo file and the skeletons with non-static data (ie: with relocations, such as those describing concrete function definitions or global variables) appear in the .o file).

We can reach that same output today by adding these skeletons into the .o file (in debug_info, not debug_info.dwo) and using comdat to unique them during linking.

Just to be sure, could you clarify what exactly would go into these skeletons? I’m a little worried that this may increase the size of the .o files quite a bit and thus eat into our performance gains.

This option would be somewhat wasteful for now (& in the future for any module that had /no/ concrete code that could be kept in the module - such as would be the case in pure template libraries with no explicit instantiation decl/defs, etc) because it would put module references in the .o, but it would mean not having to teach tools new fission tricks immediately.

At least as far as LLDB is concerned, it currently doesn’t support fission at all, so we will have to start fresh there anyway.

Then, if we wanted to add an optimization of double-indirection fission (having skeleton CUs in .dwo files that reference further .dwo files) we could do that as a separate step on top.

It's just a thought - Maybe it's an unnecessary extra step and we should just go for the double-indirection from the get-go, I'm not sure?

Given our plans for a more efficient “bag of dwarf”+index format, which will need work on the consumer side anyway, I’m leaning more towards the latter, but I can see the attractiveness of having a format that works with existing dwarf consumers out-of-the-box. It looks like that the pure fission format would make a better default for platforms that use, e.g., gdb as their default debugger.

-- adrian

>>
>> The debug info in foo.o will look like this::
>>
>> .debug_info.dwo
>
> (so if this goes in debug_info.dwo then it would be in foo.dwo, not
foo.o... but I had some further thoughts about this... )
>
> So - imagining a future in which modules are real object files that get
linked into the final executable because they contain things like
definitions of linkonce_odr functions (so that any object file that has all
the linkonce_odr calls inlined doesn't have to carry around a (probably
duplicate) definition of the function) - then that object file could also
contain the skeleton CU unit (& associated line table, string table, etc)
for not only these functions, but for all the types, etc, as well.
>
> In that world, we would have exactly fission, nothing new (no two-level
fission, where some static-data-only skeletons appear in the .dwo file and
the skeletons with non-static data (ie: with relocations, such as those
describing concrete function definitions or global variables) appear in the
.o file).
>
> We can reach that same output today by adding these skeletons into the
.o file (in debug_info, not debug_info.dwo) and using comdat to unique them
during linking.

Just to be sure, could you clarify what exactly would go into these
skeletons? I’m a little worried that this may increase the size of the .o
files quite a bit and thus eat into our performance gains.

Just the basic fission stuff, in this case I think it would just be an
abbreviation (which we could also comdat fold) and a single CU:

DW_TAG_compile_unit [1]
  DW_AT_GNU_dwo_name [DW_FORM_strp] ( .debug_str[0x00000000] = "foo.dwo")
  DW_AT_comp_dir [DW_FORM_strp] ( .debug_str[0x00000008] =
"/tmp/dbginfo")
  DW_AT_GNU_dwo_id [DW_FORM_data8] (0xbc33964b89b37bb0)

There shouldn't be a need for a line table if the (non-dwo) CU has no
decl_file attributes (which it won't, because it won't have anything other
than the DW_TAG_compile_unit tag).

You could attach your fancy-modules tags (for reconstituting ASTs, etc) to
either this CU or to the 'full' CU in the .dwo file/module (if you attach
them here, then you save the DWARF consumer from having to read more DWARF
in the module/dwo entirely for now (when we have real code in modules
that's linked in, you won't be able to make that shortcut - since some of
the DWARF you'll still actually need, even if you go to the AST file for
most of the types, etc))

>
> This option would be somewhat wasteful for now (& in the future for any
module that had /no/ concrete code that could be kept in the module - such
as would be the case in pure template libraries with no explicit
instantiation decl/defs, etc) because it would put module references in the
.o, but it would mean not having to teach tools new fission tricks
immediately.

At least as far as LLDB is concerned, it currently doesn’t support fission
at all, so we will have to start fresh there anyway.

Sure, but even for LLDB there might be some benefit in starting with fewer
moving parts - implementing existing Fission first where already have
known-good output from LLVM (in the sense that GDB can correctly consume
it) - but I do realize this is perhaps not a very strong
argument/motivation.

>
> Then, if we wanted to add an optimization of double-indirection fission
(having skeleton CUs in .dwo files that reference further .dwo files) we
could do that as a separate step on top.
>
> It's just a thought - Maybe it's an unnecessary extra step and we should
just go for the double-indirection from the get-go, I'm not sure?

Given our plans for a more efficient “bag of dwarf”+index format, which
will need work on the consumer side anyway, I’m leaning more towards the
latter, but I can see the attractiveness of having a format that works with
existing dwarf consumers out-of-the-box. It looks like that the pure
fission format would make a better default for platforms that use, e.g.,
gdb as their default debugger.

At least initially it'll be easier for GDB since it already works there. We
might want to take advantage of it in GDB too if there's a big savings to
be had (or if it'll be a long time before we're linking in object files
from modules - we could have the savings in the interim until we have to
move debug info for modules back into the linked executable anyway) because
there are many modules that don't have any actual code or globals to link
into the final executable.

- David

The debug info in foo.o will look like this::

.debug_info.dwo

(so if this goes in debug_info.dwo then it would be in foo.dwo, not foo.o… but I had some further thoughts about this… )

So - imagining a future in which modules are real object files that get linked into the final executable because they contain things like definitions of linkonce_odr functions (so that any object file that has all the linkonce_odr calls inlined doesn’t have to carry around a (probably duplicate) definition of the function) - then that object file could also contain the skeleton CU unit (& associated line table, string table, etc) for not only these functions, but for all the types, etc, as well.

In that world, we would have exactly fission, nothing new (no two-level fission, where some static-data-only skeletons appear in the .dwo file and the skeletons with non-static data (ie: with relocations, such as those describing concrete function definitions or global variables) appear in the .o file).

We can reach that same output today by adding these skeletons into the .o file (in debug_info, not debug_info.dwo) and using comdat to unique them during linking.

Just to be sure, could you clarify what exactly would go into these skeletons? I’m a little worried that this may increase the size of the .o files quite a bit and thus eat into our performance gains.

FWIW skeletons are very small and completely dwarfed (get it?) by the size of the line table or anything else.

-eric

>>
>> The debug info in foo.o will look like this::
>>
>> .debug_info.dwo
>
> (so if this goes in debug_info.dwo then it would be in foo.dwo, not
foo.o... but I had some further thoughts about this... )
>
> So - imagining a future in which modules are real object files that get
linked into the final executable because they contain things like
definitions of linkonce_odr functions (so that any object file that has all
the linkonce_odr calls inlined doesn't have to carry around a (probably
duplicate) definition of the function) - then that object file could also
contain the skeleton CU unit (& associated line table, string table, etc)
for not only these functions, but for all the types, etc, as well.
>
> In that world, we would have exactly fission, nothing new (no two-level
fission, where some static-data-only skeletons appear in the .dwo file and
the skeletons with non-static data (ie: with relocations, such as those
describing concrete function definitions or global variables) appear in the
.o file).
>
> We can reach that same output today by adding these skeletons into the
.o file (in debug_info, not debug_info.dwo) and using comdat to unique them
during linking.

Just to be sure, could you clarify what exactly would go into these
skeletons? I’m a little worried that this may increase the size of the .o
files quite a bit and thus eat into our performance gains.

FWIW skeletons are very small and completely dwarfed (get it?) by the size
of the line table or anything else.

Yeah, just reserving some judgment until actual numbers/providing an 'out'
if necessary/hedging bets/etc - it was worth removing type unit skeletons
(which, between debug_types and .rela.debug_types made up about 30% of
debug info in .o files) & module skeletons would probably be one or two
orders of magnitude smaller (10-100 public types per module?) that's
something like 4% (for 1:10) or less, probably much less.

Plans for module debugging

I recently had a chat with Eric Christopher and David Blaikie to discuss
ideas for debug info for Clang modules and this is what we came up with.

Goals
-----

Clang modules [1], (and their siblings C++ modules and precompiled header
files) are a method for improving compile time by making the serialized AST
for commonly-used headers files directly available to the compiler.

Currently debug info is totally oblivious to this, when the developer
compiles a file that uses a type from a module, clang simply emits a copy
of the full definition (some exceptions apply for C++) of this type in
DWARF into the debug info section of the resulting object file. That's a
lot of copies.

The key idea is to emit DWARF for types defined in modules only once, and
then only emit references to these types in all the individual compile
units that import this module. We are going to build on the split DWARF and
type unit facilities provided by DWARF for this. DWARF consumers can follow
the type references into module debug info section quite similar to how
they resolve types in external type units today. Additionally, the format
will allow consumers that support clang modules natively (such as LLDB) to
directly look up types in the module, without having to go through the
usual translation from AST to DWARF and back to AST.

The primary benefit from doing all this is performance. This change is
expected to reduce the size of the debug info in object files significantly
by
- emitting only references to the full types and thus
- implicitly uniquing types that are defined in modules.
The smaller object files will result in faster compile times and faster
llvm::Module load times when doing LTO. The type uniquing will also result
in significantly smaller debug info for the finished executables,
especially for C and Objective-C, which do not support ODR-based type
uniquing. This comes at the price of longer initial module build times, as
debug info is emitted alongside the module.

Design
------

Clang modules are designed to be ephemeral build artifacts that live in a
shared module cache. Compiling a source file that imports `MyModule`
results in `Module.pcm` to be generated to the module cache directory,
which contains the serialized AST of the declarations found in the header
files that comprise the module.

We will change the binary clang module format to became a container (ELF,
Mach-O, depending on the platform). Inside the container there will be
multiple sections: one containing the serialized AST, and ones containing
DWARF5 split debug type information for all types defined in the module
that can be encoded in DWARF. By virtue of using type units, each type is
emitted into its own type unit which can be identified via a unique type
signature. DWARF consumers can use the type signatures to look up type
definitions in the module debug info section. For module-aware consumers
(LLDB), we will add an index that maps type signatures directly to an
offset in the AST section.

For an object file that was built using modules, we need to record the
fact that a module has been imported. To this end, we add a
DW_TAG_compile_unit into a COMDAT .debug_info.dwo section that references
the split DWARF inside the module. Similar to split DWARF objects, the
module will be identified by its filename and a checksum. The imported unit
also contains a couple of extra attributes holding all the information
necessary to recreate the module in case the module cache has been flushed.

How does the debugging experience work in this case? When do you trigger
the (possibly-lengthy) rebuild of the source in order to recreate the DWARF
for the module (is it possible to delay that until the information is
needed)? How much knowledge does the debugger have/need of Clang's modules
to do this? Are we just embedding an arbitrary command that can be run to
rebuild the .dwo if it's missing? And if so, how do we make that safe when
(say) root attaches a debugger to an arbitrary process?

Platforms that treat modules as an explicit build artifact do not have this

problem. In the .debug_info section all types that are defined in the
module are referenced via their unique type signature using
DW_FORM_ref_sig8, just as they would be if this were types from a regular
DWARF type unit.

Example
-------

Let's say we have a module `MyModule` that defines a type `MyStruct`::
$ cat foo.c
#include <MyModule.h>
MyStruct x;

when compiling `foo.c` like this::
clang -fmodules -gmodules foo.c -c

clang produces `foo.o` and an ELF or Mach-O container for the module::
/path/to/module-cache/MyModule.pcm

In the module container, we have a section for the serialized AST and a
split DWARF sections for the debug type info. The exact format is likely
still going to evolve a little, but this should give a rough idea::

MyModule.pcm:
  .debug_info.dwo:
    DW_TAG_compile_unit
      DW_AT_dwo_name ("/path/to/MyModule.pcm")
      DW_AT_dwo_id ([unique AST signature])

    DW_TAG_type_unit ([hash for MyStruct])
       DW_TAG_structure_type
          DW_AT_signature ([hash for MyStruct])
          DW_AT_name “MyStruct”
          ...

  .debug_abbrev.dwo:
    // abbrevs referenced by .debug_info.dwo
  .debug_line.dwo:
    // filenames referenced by .debug_info.dwo
  .debug_str.dwo:
    // strings referenced by .debug_info.dwo

  .ast
    // Index at the top of the AST section sorted by hash value.
    [hash for MyStruct] -> [offset for MyStruct in this section]
    ...
    // Serialized AST follows
    ...

The debug info in foo.o will look like this::

.debug_info.dwo
   DW_TAG_compile_unit
      // For DWARF consumers
      DW_AT_dwo_name ("/path/to/module-cache/MyModule.pcm")
      DW_AT_dwo_id ([unique AST signature])

      // For LLDB / dsymutil so they can recreate the module
      DW_AT_name “MyModule"
      DW_AT_LLVM_system_root "/"
      DW_AT_LLVM_preprocessor_defines "-DNDEBUG"
      DW_AT_LLVM_include_path "/path/to/MyModule.map"

.debug_info
   DW_TAG_compile_unit
     DW_TAG_variable
       DW_AT_name "x"
       DW_AT_type (DW_FORM_ref_sig8) ([hash for MyStruct])

Type signatures
---------------

We are going to deviate from the DWARF spec by using a more efficient
hashing function that uses the type's unique mangled name and the name of
the module as input.

Why do you need/want the name of the module here? Modules are not a
namespacing mechanism. How would you compute this name when the same type
is defined in multiple imported modules?

For languages that do not have mangled type names or an ODR,

The people working on C modules have expressed an intent to apply the ODR
there too, so it's not clear that Clang modules will support any such
language in the longer term.

we will use the unique identifiers produces by the clang indexer (USRs) as

input instead.

Extension: Replacing type units with a more efficient storage format
--------------------------------------------------------------------

As an extension to this proposal, we are thinking of replacing the type
units within the module debug info with a more efficient format: Instead of
emitting each type into its own type unit (complete with its entire
declcontext), it would be much more more efficient to emit one large bag of
DWARF together with an index that maps hash values (type signatures) to DIE
offsets.

Next steps
----------

In order to implement this, the next steps would be as follows:
1. Change the clang module format to be an ELF/Mach-O container.
2. Teach clang to emit debug info for module types (e.g., by passing an
empty compile unit with retained types to LLVM) into the module container.
3a. Add a -gmodules switch to clang that triggers the emission of type
signatures for types coming from a module.

Can you clarify what this flag would do? Does this turn on adding DWARF to
the .pcm file? Does it turn off generating DWARF for imported modules in
the current IR module? Both?

I assume this means that the default remains that we build debug
information for modules as if we didn't have modules (that is, put complete
DWARF with the object code). Do you think that's the right long-term
default? I think it's possibly not.

How does this interact with explicit module builds? Can I use a module
built without -g in a compile that uses -g? And if I do, do I get complete
debug information, or debug info just for the parts that aren't in the
module? Does -gmodules let me choose between these?

3b. Implement type-signature-based lookup in llvm-dsymutil and lldb.

The module debugging scenario is primarily aimed at providing a better/faster edit-compile-debug cycle. In this scenario, the module would most likely still be in the cache. In a case were the binary was build so long ago that the module cache has since been flushed it is generally more likely the the user also used a DWARF linking step (such as dsymutil on Darwin, and maybe dwz on Linux?) because they did a release/archive build which would just copy the DWARF out of the module and store it alongside the binary. For this reason I’m not very concerned about the time necessary for rebuilding the module. But this is all very platform-specific, and different platforms may need different defaults.
Delaying the module DWARF output until needed (maybe even by the debugger!) is an interesting idea. We should definitely measure how expensive it is to emit DWARF for an entire module with of types to see if this is worthwhile.

I think it is reasonable to assume that a consumer that can make use of clang modules also knows how to rebuild clang modules, which is why the example only contained the name of the module, sysroot, include path, and defines; not an arbitrary command. On platforms were the debugger does not understand clang modules, the whole problem can be dodged by treating the modules as explicit build artifacts.

Great point! I’m mostly concerned about non-ODR languages …

… and this may be the answer to the question!

+Doug: do Objective-C modules have an ODR?

It would emit references to the type from imported modules instead of the types themselves.
Since the module cache is shared, we could — depending on just expensive this is — turn on DWARF generation for .pcm files by default. I’d like to measure this first, though.

I think you’re absolutely right about the long term. In the short term, it may be better to have compatibility by default, but I don’t know what the official LLVM policy on new features is, if there is one.

Personally I would expect old-style (full copy of the types) debug information if I build agains a module that does not have embedded debug information.

thanks,
adrian

Thanks for the below; I think these are good answers, at least for now.

Plans for module debugging

I recently had a chat with Eric Christopher and David Blaikie to discuss
ideas for debug info for Clang modules and this is what we came up with.

Goals
-----

Clang modules [1], (and their siblings C++ modules and precompiled header
files) are a method for improving compile time by making the serialized AST
for commonly-used headers files directly available to the compiler.

Currently debug info is totally oblivious to this, when the developer
compiles a file that uses a type from a module, clang simply emits a copy
of the full definition (some exceptions apply for C++) of this type in
DWARF into the debug info section of the resulting object file. That's a
lot of copies.

The key idea is to emit DWARF for types defined in modules only once, and
then only emit references to these types in all the individual compile
units that import this module. We are going to build on the split DWARF and
type unit facilities provided by DWARF for this. DWARF consumers can follow
the type references into module debug info section quite similar to how
they resolve types in external type units today. Additionally, the format
will allow consumers that support clang modules natively (such as LLDB) to
directly look up types in the module, without having to go through the
usual translation from AST to DWARF and back to AST.

The primary benefit from doing all this is performance. This change is
expected to reduce the size of the debug info in object files significantly
by
- emitting only references to the full types and thus
- implicitly uniquing types that are defined in modules.
The smaller object files will result in faster compile times and faster
llvm::Module load times when doing LTO. The type uniquing will also result
in significantly smaller debug info for the finished executables,
especially for C and Objective-C, which do not support ODR-based type
uniquing. This comes at the price of longer initial module build times, as
debug info is emitted alongside the module.

Design
------

Clang modules are designed to be ephemeral build artifacts that live in a
shared module cache. Compiling a source file that imports `MyModule`
results in `Module.pcm` to be generated to the module cache directory,
which contains the serialized AST of the declarations found in the header
files that comprise the module.

We will change the binary clang module format to became a container (ELF,
Mach-O, depending on the platform). Inside the container there will be
multiple sections: one containing the serialized AST, and ones containing
DWARF5 split debug type information for all types defined in the module
that can be encoded in DWARF. By virtue of using type units, each type is
emitted into its own type unit which can be identified via a unique type
signature. DWARF consumers can use the type signatures to look up type
definitions in the module debug info section. For module-aware consumers
(LLDB), we will add an index that maps type signatures directly to an
offset in the AST section.

For an object file that was built using modules, we need to record the
fact that a module has been imported. To this end, we add a
DW_TAG_compile_unit into a COMDAT .debug_info.dwo section that references
the split DWARF inside the module. Similar to split DWARF objects, the
module will be identified by its filename and a checksum. The imported unit
also contains a couple of extra attributes holding all the information
necessary to recreate the module in case the module cache has been flushed.

How does the debugging experience work in this case? When do you trigger
the (possibly-lengthy) rebuild of the source in order to recreate the DWARF
for the module (is it possible to delay that until the information is
needed)?

The module debugging scenario is primarily aimed at providing a
better/faster edit-compile-debug cycle. In this scenario, the module would
most likely still be in the cache. In a case were the binary was build so
long ago that the module cache has since been flushed it is generally more
likely the the user also used a DWARF linking step (such as dsymutil on
Darwin, and maybe dwz on Linux?) because they did a release/archive build
which would just copy the DWARF out of the module and store it alongside
the binary. For this reason I’m not very concerned about the time necessary
for rebuilding the module. But this is all very platform-specific, and
different platforms may need different defaults.
Delaying the module DWARF output until needed (maybe even by the
debugger!) is an interesting idea. We should definitely measure how
expensive it is to emit DWARF for an entire module with of types to see if
this is worthwhile.

How much knowledge does the debugger have/need of Clang's modules to do
this? Are we just embedding an arbitrary command that can be run to rebuild
the .dwo if it's missing? And if so, how do we make that safe when (say)
root attaches a debugger to an arbitrary process?

I think it is reasonable to assume that a consumer that can make use of
clang modules also knows how to rebuild clang modules, which is why the
example only contained the name of the module, sysroot, include path, and
defines; not an arbitrary command. On platforms were the debugger does not
understand clang modules, the whole problem can be dodged by treating the
modules as explicit build artifacts.

I think you're essentially saying "you can't reliably and transparently use
implicit modules + module DWARF + gdb" (at least, not if your module might
get rebuilt or the cache might get purged before you debug). That seems a
little unsatisfying, but if we make it easy to copy the DWARF out of the
module cache and into the binary / a .dwo file in your build area at link
time, I think it's not too bad. Can we add such functionality to the link
step somehow?

One case I'm worried about is that the user does a build, then debugs their
program a bit, then reboots their machine (which happens to wipe out their
/tmp and their module cache), and then they can't debug any more.

Platforms that treat modules as an explicit build artifact do not have this

The module debugging scenario is primarily aimed at providing a better/faster edit-compile-debug cycle. In this scenario, the module would most likely still be in the cache. In a case were the binary was build so long ago that the module cache has since been flushed it is generally more likely the the user also used a DWARF linking step (such as dsymutil on Darwin, and maybe dwz on Linux?) because they did a release/archive build which would just copy the DWARF out of the module and store it alongside the binary. For this reason I’m not very concerned about the time necessary for rebuilding the module. But this is all very platform-specific, and different platforms may need different defaults.

This description is in terms of building a module that has gone missing, but just to be clear: a modules-aware debugger probably also needs to rebuild modules that have gone out of date, such as when one of their headers is modified.

Delaying the module DWARF output until needed (maybe even by the debugger!) is an interesting idea. We should definitely measure how expensive it is to emit DWARF for an entire module with of types to see if this is worthwhile.

I think it is reasonable to assume that a consumer that can make use of clang modules also knows how to rebuild clang modules, which is why the example only contained the name of the module, sysroot, include path, and defines; not an arbitrary command. On platforms were the debugger does not understand clang modules, the whole problem can be dodged by treating the modules as explicit build artifacts.

You are probably already aware, but you will need a bunch more information (language options, target options, header search options) to rebuild a module.

The module debugging scenario is primarily aimed at providing a better/faster edit-compile-debug cycle. In this scenario, the module would most likely still be in the cache. In a case were the binary was build so long ago that the module cache has since been flushed it is generally more likely the the user also used a DWARF linking step (such as dsymutil on Darwin, and maybe dwz on Linux?) because they did a release/archive build which would just copy the DWARF out of the module and store it alongside the binary. For this reason I’m not very concerned about the time necessary for rebuilding the module. But this is all very platform-specific, and different platforms may need different defaults.

This description is in terms of building a module that has gone missing, but just to be clear: a modules-aware debugger probably also needs to rebuild modules that have gone out of date, such as when one of their headers is modified.

In this case were the module is out of date, the debugger should probably fall back to the DWARF types, because it cannot guarantee that the modifications to the header files did not change the types it wants to look up.

Delaying the module DWARF output until needed (maybe even by the debugger!) is an interesting idea. We should definitely measure how expensive it is to emit DWARF for an entire module with of types to see if this is worthwhile.

I think it is reasonable to assume that a consumer that can make use of clang modules also knows how to rebuild clang modules, which is why the example only contained the name of the module, sysroot, include path, and defines; not an arbitrary command. On platforms were the debugger does not understand clang modules, the whole problem can be dodged by treating the modules as explicit build artifacts.

You are probably already aware, but you will need a bunch more information (language options, target options, header search options) to rebuild a module.

Thanks, language options and target options were absent from the list previously!

– adrian

The module debugging scenario is primarily aimed at providing a better/faster edit-compile-debug cycle. In this scenario, the module would most likely still be in the cache. In a case were the binary was build so long ago that the module cache has since been flushed it is generally more likely the the user also used a DWARF linking step (such as dsymutil on Darwin, and maybe dwz on Linux?) because they did a release/archive build which would just copy the DWARF out of the module and store it alongside the binary. For this reason I’m not very concerned about the time necessary for rebuilding the module. But this is all very platform-specific, and different platforms may need different defaults.

This description is in terms of building a module that has gone missing, but just to be clear: a modules-aware debugger probably also needs to rebuild modules that have gone out of date, such as when one of their headers is modified.

In this case were the module is out of date, the debugger should probably fall back to the DWARF types, because it cannot guarantee that the modifications to the header files did not change the types it wants to look up.

Are you also worried about this when the debugger builds a module that has gone missing? At that point there is nothing to fall back to, and rebuilding the module could produce incorrect information.

The module debugging scenario is primarily aimed at providing a better/faster edit-compile-debug cycle. In this scenario, the module would most likely still be in the cache. In a case were the binary was build so long ago that the module cache has since been flushed it is generally more likely the the user also used a DWARF linking step (such as dsymutil on Darwin, and maybe dwz on Linux?) because they did a release/archive build which would just copy the DWARF out of the module and store it alongside the binary. For this reason I’m not very concerned about the time necessary for rebuilding the module. But this is all very platform-specific, and different platforms may need different defaults.

This description is in terms of building a module that has gone missing, but just to be clear: a modules-aware debugger probably also needs to rebuild modules that have gone out of date, such as when one of their headers is modified.

I see this as largely similar to the “source is newer than the program” warning that you get from some debuggers. No particular preference on what should be redone, but if it’s particularly problematic the debugger could error here.

-eric

The module debugging scenario is primarily aimed at providing a better/faster edit-compile-debug cycle. In this scenario, the module would most likely still be in the cache. In a case were the binary was build so long ago that the module cache has since been flushed it is generally more likely the the user also used a DWARF linking step (such as dsymutil on Darwin, and maybe dwz on Linux?) because they did a release/archive build which would just copy the DWARF out of the module and store it alongside the binary. For this reason I’m not very concerned about the time necessary for rebuilding the module. But this is all very platform-specific, and different platforms may need different defaults.

This description is in terms of building a module that has gone missing, but just to be clear: a modules-aware debugger probably also needs to rebuild modules that have gone out of date, such as when one of their headers is modified.

In this case were the module is out of date, the debugger should probably fall back to the DWARF types, because it cannot guarantee that the modifications to the header files did not change the types it wants to look up.

Sorry, I just realized that this doesn’t make any sense if the DWARF is stored in the module. The behavior should be:

  1. If the module is missing, recreate the module.
  2. If the module signature does not match the signature in the .o file, either print a large warning that types from that module may be bogus, or categorically refuse to use them.

For long-term debugging users are expected to use a DWARF linker (dsymutil, dwz), which archives all types in a future-proof format (DWARF).

– adrian

The module debugging scenario is primarily aimed at providing a better/faster edit-compile-debug cycle. In this scenario, the module would most likely still be in the cache. In a case were the binary was build so long ago that the module cache has since been flushed it is generally more likely the the user also used a DWARF linking step (such as dsymutil on Darwin, and maybe dwz on Linux?) because they did a release/archive build which would just copy the DWARF out of the module and store it alongside the binary. For this reason I’m not very concerned about the time necessary for rebuilding the module. But this is all very platform-specific, and different platforms may need different defaults.

This description is in terms of building a module that has gone missing, but just to be clear: a modules-aware debugger probably also needs to rebuild modules that have gone out of date, such as when one of their headers is modified.

In this case were the module is out of date, the debugger should probably fall back to the DWARF types, because it cannot guarantee that the modifications to the header files did not change the types it wants to look up.

Sorry, I just realized that this doesn’t make any sense if the DWARF is stored in the module. The behavior should be:

  1. If the module is missing, recreate the module.
  2. If the module signature does not match the signature in the .o file, either print a large warning that types from that module may be bogus, or categorically refuse to use them.

Maybe this is described elsewhere, but what is the “signature” being used here? Assuming it depends on the detailed contents of the serialized AST: currently ASTWriter output is nondeterministic and things like the ID#s for identifiers, types, etc. will change every time you build the module; until that gets fixed, we would always hit case (2).

The module debugging scenario is primarily aimed at providing a better/faster edit-compile-debug cycle. In this scenario, the module would most likely still be in the cache. In a case were the binary was build so long ago that the module cache has since been flushed it is generally more likely the the user also used a DWARF linking step (such as dsymutil on Darwin, and maybe dwz on Linux?) because they did a release/archive build which would just copy the DWARF out of the module and store it alongside the binary. For this reason I’m not very concerned about the time necessary for rebuilding the module. But this is all very platform-specific, and different platforms may need different defaults.

This description is in terms of building a module that has gone missing, but just to be clear: a modules-aware debugger probably also needs to rebuild modules that have gone out of date, such as when one of their headers is modified.

In this case were the module is out of date, the debugger should probably fall back to the DWARF types, because it cannot guarantee that the modifications to the header files did not change the types it wants to look up.

Sorry, I just realized that this doesn’t make any sense if the DWARF is stored in the module. The behavior should be:

  1. If the module is missing, recreate the module.
  2. If the module signature does not match the signature in the .o file, either print a large warning that types from that module may be bogus, or categorically refuse to use them.

For long-term debugging users are expected to use a DWARF linker (dsymutil, dwz), which archives all types in a future-proof format (DWARF).

This is how I’m envisioning it, yes.

-eric

The module debugging scenario is primarily aimed at providing a better/faster edit-compile-debug cycle. In this scenario, the module would most likely still be in the cache. In a case were the binary was build so long ago that the module cache has since been flushed it is generally more likely the the user also used a DWARF linking step (such as dsymutil on Darwin, and maybe dwz on Linux?) because they did a release/archive build which would just copy the DWARF out of the module and store it alongside the binary. For this reason I’m not very concerned about the time necessary for rebuilding the module. But this is all very platform-specific, and different platforms may need different defaults.

This description is in terms of building a module that has gone missing, but just to be clear: a modules-aware debugger probably also needs to rebuild modules that have gone out of date, such as when one of their headers is modified.

In this case were the module is out of date, the debugger should probably fall back to the DWARF types, because it cannot guarantee that the modifications to the header files did not change the types it wants to look up.

Sorry, I just realized that this doesn’t make any sense if the DWARF is stored in the module. The behavior should be:

  1. If the module is missing, recreate the module.
  2. If the module signature does not match the signature in the .o file, either print a large warning that types from that module may be bogus, or categorically refuse to use them.

Maybe this is described elsewhere, but what is the “signature” being used here? Assuming it depends on the detailed contents of the serialized AST: currently ASTWriter output is nondeterministic and things like the ID#s for identifiers, types, etc. will change every time you build the module; until that gets fixed, we would always hit case (2).

I was actually hoping that we could rely on deterministic output from clang. If it is infeasible make ASTWriter output deterministic, we can fall back to something like the DWARF dwo_id signature here.

– adrian

Plans for module debugging

I recently had a chat with Eric Christopher and David Blaikie to discuss
ideas for debug info for Clang modules and this is what we came up with.

Goals
-----

Clang modules [1], (and their siblings C++ modules and precompiled header
files) are a method for improving compile time by making the serialized AST
for commonly-used headers files directly available to the compiler.

Currently debug info is totally oblivious to this, when the developer
compiles a file that uses a type from a module, clang simply emits a copy
of the full definition (some exceptions apply for C++) of this type in
DWARF into the debug info section of the resulting object file. That's a
lot of copies.

The key idea is to emit DWARF for types defined in modules only once, and
then only emit references to these types in all the individual compile
units that import this module. We are going to build on the split DWARF and
type unit facilities provided by DWARF for this. DWARF consumers can follow
the type references into module debug info section quite similar to how
they resolve types in external type units today. Additionally, the format
will allow consumers that support clang modules natively (such as LLDB) to
directly look up types in the module, without having to go through the
usual translation from AST to DWARF and back to AST.

The primary benefit from doing all this is performance. This change is
expected to reduce the size of the debug info in object files significantly
by
- emitting only references to the full types and thus
- implicitly uniquing types that are defined in modules.
The smaller object files will result in faster compile times and faster
llvm::Module load times when doing LTO. The type uniquing will also result
in significantly smaller debug info for the finished executables,
especially for C and Objective-C, which do not support ODR-based type
uniquing. This comes at the price of longer initial module build times, as
debug info is emitted alongside the module.

Design
------

Clang modules are designed to be ephemeral build artifacts that live in a
shared module cache. Compiling a source file that imports `MyModule`
results in `Module.pcm` to be generated to the module cache directory,
which contains the serialized AST of the declarations found in the header
files that comprise the module.

We will change the binary clang module format to became a container (ELF,
Mach-O, depending on the platform). Inside the container there will be
multiple sections: one containing the serialized AST, and ones containing
DWARF5 split debug type information for all types defined in the module
that can be encoded in DWARF. By virtue of using type units, each type is
emitted into its own type unit which can be identified via a unique type
signature. DWARF consumers can use the type signatures to look up type
definitions in the module debug info section. For module-aware consumers
(LLDB), we will add an index that maps type signatures directly to an
offset in the AST section.

For an object file that was built using modules, we need to record the
fact that a module has been imported. To this end, we add a
DW_TAG_compile_unit into a COMDAT .debug_info.dwo section that references
the split DWARF inside the module. Similar to split DWARF objects, the
module will be identified by its filename and a checksum. The imported unit
also contains a couple of extra attributes holding all the information
necessary to recreate the module in case the module cache has been flushed.

How does the debugging experience work in this case? When do you trigger
the (possibly-lengthy) rebuild of the source in order to recreate the DWARF
for the module (is it possible to delay that until the information is
needed)?

The module debugging scenario is primarily aimed at providing a
better/faster edit-compile-debug cycle. In this scenario, the module would
most likely still be in the cache. In a case were the binary was build so
long ago that the module cache has since been flushed it is generally more
likely the the user also used a DWARF linking step (such as dsymutil on
Darwin, and maybe dwz on Linux?) because they did a release/archive build
which would just copy the DWARF out of the module and store it alongside
the binary. For this reason I’m not very concerned about the time necessary
for rebuilding the module. But this is all very platform-specific, and
different platforms may need different defaults.

This description is in terms of building a module that has gone missing,
but just to be clear: a modules-aware debugger probably also needs to
rebuild modules that have gone out of date, such as when one of their
headers is modified.

In this case were the module is out of date, the debugger should probably
fall back to the DWARF types, because it cannot guarantee that the
modifications to the header files did not change the types it wants to look
up.

Sorry, I just realized that this doesn’t make any sense if the DWARF is
stored in the module. The behavior should be:
1. If the module is missing, recreate the module.
2. If the module signature does not match the signature in the .o file,
either print a large warning that types from that module may be bogus, or
categorically refuse to use them.

FWIW, GDB currently does the latter for mismatched fission files (if the
signatures don't match it prints an error and doesn't use them).

Side note/thought: We could imagine a semi-module-aware debugger that knew
enough to know to rerun the compiler, but not enough to parse clang ASTs (I
imagine it wouldn't be /that/ hard to teach GDB this) - and then it would
just load the file as normal and actually hit the same fission signature
validation path it already has (and refuse to load mismatched debug info).

Just so I'm clear - in any case (fully Clang-AST-loading-aware-debugger, or
the semi-aware case I described above) the only reason to rebuild the debug
info is for cases where the source hasn't changed, but the module cache has
been flushed for some reason (such as a system restart)?

- David

The module debugging scenario is primarily aimed at providing a better/faster edit-compile-debug cycle. In this scenario, the module would most likely still be in the cache. In a case were the binary was build so long ago that the module cache has since been flushed it is generally more likely the the user also used a DWARF linking step (such as dsymutil on Darwin, and maybe dwz on Linux?) because they did a release/archive build which would just copy the DWARF out of the module and store it alongside the binary. For this reason I’m not very concerned about the time necessary for rebuilding the module. But this is all very platform-specific, and different platforms may need different defaults.

This description is in terms of building a module that has gone missing, but just to be clear: a modules-aware debugger probably also needs to rebuild modules that have gone out of date, such as when one of their headers is modified.

In this case were the module is out of date, the debugger should probably fall back to the DWARF types, because it cannot guarantee that the modifications to the header files did not change the types it wants to look up.

Sorry, I just realized that this doesn’t make any sense if the DWARF is stored in the module. The behavior should be:

  1. If the module is missing, recreate the module.
  2. If the module signature does not match the signature in the .o file, either print a large warning that types from that module may be bogus, or categorically refuse to use them.

Maybe this is described elsewhere, but what is the “signature” being used here? Assuming it depends on the detailed contents of the serialized AST: currently ASTWriter output is nondeterministic and things like the ID#s for identifiers, types, etc. will change every time you build the module; until that gets fixed, we would always hit case (2).

Perhaps it’s a reason to be more rigorous about the ID you put in the module? :slight_smile:

That said, yes - if you rebuild the module and the number changes then you can’t rely upon the contents. If we can get a number that assures us that the contents haven’t changed then it’ll make sense to use that instead. Algorithm is up to you on how to determine this.

-eric

That sounds reasonable.

Yes. According to Modules — Clang 3.6 documentation, by default modules live for 7 days, and they live in $TMPDIR.