[RFC] Placing profile name data, and coverage data, outside of object files

vedantk · July 1, 2017, 12:54am

Problem

davidxl · July 1, 2017, 3:16am

Problem
-------

Instrumentation for PGO and frontend-based coverage places a large amount
of
data in object files, even though the majority of this data is not needed
at
run-time. All the data is needlessly duplicated while generating archives,
and
again while linking. PGO name data is written out into raw profiles by
instrumented programs, slowing down the training and code coverage
workflows.

Here are some numbers from a coverage + RA build of ToT clang:

  * Size of the build directory: 4.3 GB

  * Wall time needed to run "clang -help" with an SSD: 0.5 seconds

  * Size of the clang binary: 725.24 MB

  * Space wasted on duplicate name/coverage data (*.o + *.a): 923.49 MB
    - Size contributed by __llvm_covmap sections: 1.02 GB
      \_ Just within clang: 340.48 MB

    - Size contributed by __llvm_prf_names sections: 327.46 MB
      \_ Just within clang: 106.76 MB

    => Space wasted within the clang binary: 447.24 MB

Running an instrumented clang binary triggers a 143MB raw profile write
which
is slow even with an SSD. This problem is particularly bad for
frontend-based
coverage because it generates a lot of extra name data: however, the
situation
can also be improved for PGO instrumentation.

I want to point out that this is a problem with FE instrumentation with
coverage turned on. Without coverage turned on, the name section size will
be significantly smaller.

With IR PGO, the name section size is even smaller. For instance, the IR
instrumented clang size is 122.3MB, while the name section size is only
2.3MB, the space wasted is < 2%.

Proposal
--------

Place PGO name data and coverage data outside of object files. This would
eliminate data duplication in *.a/*.o files, shrink binaries, shrink raw
profiles, and speed up instrumented programs.

This sounds fine as long as the behavior (for name) is controlled under an
option. For IR PGO, name size is *not* an issue, so keeping the name data
in binary and dumped with profile data has advantage in terms of usability
-- the profile data is self contained. Turning on coverage trigger the
behavior difference is one possible choice.

As for coverage mapping data, splitting it out by default seems to be a
more desirable behavior. The data embedded in the binary is not even used
by the profile runtime (of course the runtime can choose to dump it so that
the llvm-cov data does not need to look for the executable binary). The
sole purpose of emitting it with the object file is to treat the
executable/object as the mapping data container. The usability of llvm-cov
won't reduce with the proposed change.

In more detail:

1. The frontends get a new `-fprofile-metadata-dir=<path>` option. This
lets
users specify where llvm will store profile metadata. If the metadata
starts to
take up too much space, there's just one directory to clean.

Why not leverage -fcoverage-mapping option -- i.e. add a new flavor of this
option that accepts the meta data path: -fcoverage-mapping=<path>. If the
path is not specified, the data will be emitted with object files.

2. The frontends continue emitting PGO name data and coverage data in the
same
llvm::Module. So does LLVM's IR-based PGO implementation. No change here.

3. If the InstrProf lowering pass sees that a metadata directory is
available,
it constructs a new module, copies the name/coverage data into it, hashes
the
module, and attempts to write that module to:

<metadata-dir>/<module-hash>.bc (the metadata module)

If this write operation fails, it scraps the new module: it keeps all the
metadata in the original module, and there are no changes from the current
process. I.e with this proposal we preserve backwards compatibility.

Or simply emit the raw file as coverage notes files (gcno).

4. Once the metadata module is written, the name/coverage data are entirely
stripped out of the original module. They are replaced by a path to the
metadata module:

@__llvm_profiling_metadata = "<metadata-dir>/<module-hash>.bc",
section "__llvm_prf_link"

This allows incremental builds to work properly, which is an important use
case
for code coverage users. When an object is rebuilt, it gets a fresh link
to a
fresh profiling metadata file. Although stale files can accumulate in the
metadata directory, the stale files cannot ever be used.

Why is this needed for incremental build? The file emitted is simply a
build artifact, not an input to the build.

In an IDE like Xcode, since there's just one target binary per scheme, it's
possible to clean the metadata directory by removing the modules which
aren't
referenced by the target binary.

5. The raw profile format is updated so that links to metadata files are
written
out in each profile. This makes it possible for all existing llvm-profdata
and
llvm-cov commands to work, seamlessly.

It may not be as smooth as you hope: the directory containing the build
artifact may not be accessible when llvm-profdata tool is run. This is
especially true for for distributed build system -- without telling the
build system , the meta data won't even be copied back to the user.

Since user explicitly asks for emitting the data into a directory, it won't
be a usability regression to require the user to specify the path to locate
the meta data -- this is especially true for llvm-cov which requires user
to specify the binary path anyway.

This requirement can simplify the implementation even more as there seems
no need to write any link data in the binary.

The indexed profile format will *not* be updated: i.e, it will contain a
full
symbol table, and no links. This simplifies the coverage mapping reader,
because
a full symbol table is guaranteed to exist before any function records are
parsed. It also reduces the amount of coding, and makes it easier to
preserve
backwards compatibility :).

6. The raw profile reader will learn how to read links, open up the
metadata
modules it finds links to, and collect name data from those modules.

See above, I think it is better to explicitly pass the directory to the
reader.

7. The coverage reader will learn how to read the __llvm_prf_link section,
open
up metadata modules, and lazily read coverage mapping data.

Alternate Solutions
-------------------

1. Instead of copying name data into an external metadata module, just
copy the
coverage mapping data.

I've actually prototyped this. This might be a good way to split up
patches,
although I don't see why we wouldn't want to tackle the name data problem
eventually.

I think this can be a good first step.

2. Instead of emitting links to external metadata modules, modify llvm-cov
and
llvm-profdata so that they require a path to the metadata directory.

I second this.

The issue with this is that it's way too easy to read stale metadata. It's
also
less user-friendly, which hurts adoption.

I don't think it will be less user-friendly. See reasons mentioned above.

3. Use something other than llvm bitcode for the metadata module format.

Since we're mostly writing large binary blobs (compressed name data or
pre-encoded source range mapping info), using bitcode shouldn't be too
slow, and
we're not likely to get better compression with a different format.

Bitcode is also convenient, and is nice for backwards compatibility.

Or a simpler wrapper format. Some data is probably needed to justify the
decision.

David

_sean_silva · July 1, 2017, 5:04am

Problem
-------

Instrumentation for PGO and frontend-based coverage places a large amount
of
data in object files, even though the majority of this data is not needed
at
run-time. All the data is needlessly duplicated while generating archives,
and
again while linking. PGO name data is written out into raw profiles by
instrumented programs, slowing down the training and code coverage
workflows.

Here are some numbers from a coverage + RA build of ToT clang:

  * Size of the build directory: 4.3 GB

  * Wall time needed to run "clang -help" with an SSD: 0.5 seconds

  * Size of the clang binary: 725.24 MB

  * Space wasted on duplicate name/coverage data (*.o + *.a): 923.49 MB
    - Size contributed by __llvm_covmap sections: 1.02 GB
      \_ Just within clang: 340.48 MB

We live with this duplication for debug info. In some sense, if the
overhead is small compared to debug info, should we even bother (i.e., we
assume that users accommodate debug builds, so that is a reasonable bound
on the tolerable build directory size). (I don't know the numbers; this
seems pretty large so maybe it is significant compared to debug info; just
saying that looking at absolute numbers is misleading here; numbers
compared to debug info are a closer measure to the user's perceptions)

In fact, one overall architectural observation I have is that the most
complicated part of all this is simply establishing the workflow to plumb
together data emitted per-TU to a tool that needs that information to do
some post-processing step on the results of running the binary. That sounds
a lot like the role of debug info. In fact, having a debugger open a core
file is precisely equivalent to what llvm-profdata needs to do in this
regard AFAICT.

So it would be best if possible to piggyback on all the effort that has
gone into plumbing that data to make debug info work. For example, I know
that on Darwin there's a fair amount of system-level integration to make
split dwarf "just work" while keeping debug info out of final binaries.

If there is a not-too-hacky way to piggyback on debug info, that's likely
to be a really slick solution. For example, debug info could in principle
(if it doesn't already) contain information about the name of each counter
in the counter array, so in principle it would be a complete enough
description to identify each counter.

I'm not very familiar with DWARF, but I'm imagining something like
reserving an LLVM vendor-specific DWARF opcode/attribute/whatever and then
stick a blob of data in there. Presumably we have code somewhere in LLDB
that is "here's a binary, find debug info for it", and in principle we
could factor out that code and lift it into an LLVM library
(libFindDebugInfo) that llvm-profdata could use.

    - Size contributed by __llvm_prf_names sections: 327.46 MB
      \_ Just within clang: 106.76 MB

    => Space wasted within the clang binary: 447.24 MB

Running an instrumented clang binary triggers a 143MB raw profile write
which
is slow even with an SSD. This problem is particularly bad for
frontend-based
coverage because it generates a lot of extra name data: however, the
situation
can also be improved for PGO instrumentation.

Proposal
--------

Place PGO name data and coverage data outside of object files. This would
eliminate data duplication in *.a/*.o files, shrink binaries, shrink raw
profiles, and speed up instrumented programs.

In more detail:

1. The frontends get a new `-fprofile-metadata-dir=<path>` option. This
lets
users specify where llvm will store profile metadata. If the metadata
starts to
take up too much space, there's just one directory to clean.

2. The frontends continue emitting PGO name data and coverage data in the
same
llvm::Module. So does LLVM's IR-based PGO implementation. No change here.

3. If the InstrProf lowering pass sees that a metadata directory is
available,
it constructs a new module, copies the name/coverage data into it, hashes
the
module, and attempts to write that module to:

  <metadata-dir>/<module-hash>.bc (the metadata module)

If this write operation fails, it scraps the new module: it keeps all the
metadata in the original module, and there are no changes from the current
process. I.e with this proposal we preserve backwards compatibility.

Based at my experience with Clang's implicit modules, I'm *extremely* wary
of anything that might cause the compiler to emit a file that the build
system cannot guess the name of. In fact, having the compiler emit a file
that is not explicitly listed on the command line is basically just as bad
in practice (in terms of feasibility of informing the build system about
it).

As a simple example, ninja simply cannot represent a dependency of this
type, so if you delete a <metadata-dir>/<module-hash>.bc it won't know
things need to be rebuilt (and it won't know how to clean it, etc.).

So I would really strongly recommend against doing this.

Again, these problems of system integration (in particular build system
integration) are nasty, and if you can bypass this and piggyback on debug
info then everything will "just work" because the folks that care about
making sure that debugging "just works" already did the work for you.
It might be more work in the short term to do the debug info approach (if
it is feasible at all), but I can tell you based on the experience with
implicit modules (and I'm sure you have some experience of your own) that
there's just going to be a neverending tail of hitches and ways that things
don't work (or work poorly) due to not having the build system / overall
system integration right, so it will be worth it in the long run.

-- Sean Silva

_sean_silva · July 1, 2017, 5:25am

Problem
-------

Instrumentation for PGO and frontend-based coverage places a large amount
of
data in object files, even though the majority of this data is not needed
at
run-time. All the data is needlessly duplicated while generating
archives, and
again while linking. PGO name data is written out into raw profiles by
instrumented programs, slowing down the training and code coverage
workflows.

Here are some numbers from a coverage + RA build of ToT clang:

  * Size of the build directory: 4.3 GB

  * Wall time needed to run "clang -help" with an SSD: 0.5 seconds

  * Size of the clang binary: 725.24 MB

  * Space wasted on duplicate name/coverage data (*.o + *.a): 923.49 MB
    - Size contributed by __llvm_covmap sections: 1.02 GB
      \_ Just within clang: 340.48 MB

We live with this duplication for debug info. In some sense, if the
overhead is small compared to debug info, should we even bother (i.e., we
assume that users accommodate debug builds, so that is a reasonable bound
on the tolerable build directory size). (I don't know the numbers; this
seems pretty large so maybe it is significant compared to debug info; just
saying that looking at absolute numbers is misleading here; numbers
compared to debug info are a closer measure to the user's perceptions)

In fact, one overall architectural observation I have is that the most
complicated part of all this is simply establishing the workflow to plumb
together data emitted per-TU to a tool that needs that information to do
some post-processing step on the results of running the binary. That sounds
a lot like the role of debug info. In fact, having a debugger open a core
file is precisely equivalent to what llvm-profdata needs to do in this
regard AFAICT.

In fact, it's so equivalent that you could in principle read the actual
counter values directly out of a core file. A core file could literally be
used as a raw profile.

E.g. you could in principle open the core in the debugger and then do:

p __profd_foo
p __profd_bar
...

(and walking vprof nodes would be more complicated but doable)

I'm not necessarily advocating this literally be done; just showing that
"everything you need is there".

Note also that the debug info approach has another nice advantage in that
it allows minimizing the runtime memory overhead for the program image to
the absolute minimum, which is important for embedded applications. Debug
info naturally stays out of the program image and so this problem is
automatically solved.

-- Sean Silva

_sean_silva · July 1, 2017, 5:35am

Problem
-------

Instrumentation for PGO and frontend-based coverage places a large
amount of
data in object files, even though the majority of this data is not
needed at
run-time. All the data is needlessly duplicated while generating
archives, and
again while linking. PGO name data is written out into raw profiles by
instrumented programs, slowing down the training and code coverage
workflows.

Here are some numbers from a coverage + RA build of ToT clang:

  * Size of the build directory: 4.3 GB

  * Wall time needed to run "clang -help" with an SSD: 0.5 seconds

  * Size of the clang binary: 725.24 MB

  * Space wasted on duplicate name/coverage data (*.o + *.a): 923.49 MB
    - Size contributed by __llvm_covmap sections: 1.02 GB
      \_ Just within clang: 340.48 MB

We live with this duplication for debug info. In some sense, if the
overhead is small compared to debug info, should we even bother (i.e., we
assume that users accommodate debug builds, so that is a reasonable bound
on the tolerable build directory size). (I don't know the numbers; this
seems pretty large so maybe it is significant compared to debug info; just
saying that looking at absolute numbers is misleading here; numbers
compared to debug info are a closer measure to the user's perceptions)

In fact, one overall architectural observation I have is that the most
complicated part of all this is simply establishing the workflow to plumb
together data emitted per-TU to a tool that needs that information to do
some post-processing step on the results of running the binary. That sounds
a lot like the role of debug info. In fact, having a debugger open a core
file is precisely equivalent to what llvm-profdata needs to do in this
regard AFAICT.

In fact, it's so equivalent that you could in principle read the actual
counter values directly out of a core file. A core file could literally be
used as a raw profile.

E.g. you could in principle open the core in the debugger and then do:

p __profd_foo
p __profd_bar
...

Sorry, should be __profc I think (or whatever the counters are called)

-- Sean Silva

davidxl · July 1, 2017, 5:39am

Problem
-------

Instrumentation for PGO and frontend-based coverage places a large
amount of
data in object files, even though the majority of this data is not
needed at
run-time. All the data is needlessly duplicated while generating
archives, and
again while linking. PGO name data is written out into raw profiles by
instrumented programs, slowing down the training and code coverage
workflows.

Here are some numbers from a coverage + RA build of ToT clang:

  * Size of the build directory: 4.3 GB

  * Wall time needed to run "clang -help" with an SSD: 0.5 seconds

  * Size of the clang binary: 725.24 MB

  * Space wasted on duplicate name/coverage data (*.o + *.a): 923.49 MB
    - Size contributed by __llvm_covmap sections: 1.02 GB
      \_ Just within clang: 340.48 MB

We live with this duplication for debug info. In some sense, if the
overhead is small compared to debug info, should we even bother (i.e., we
assume that users accommodate debug builds, so that is a reasonable bound
on the tolerable build directory size). (I don't know the numbers; this
seems pretty large so maybe it is significant compared to debug info; just
saying that looking at absolute numbers is misleading here; numbers
compared to debug info are a closer measure to the user's perceptions)

In fact, one overall architectural observation I have is that the most
complicated part of all this is simply establishing the workflow to plumb
together data emitted per-TU to a tool that needs that information to do
some post-processing step on the results of running the binary. That sounds
a lot like the role of debug info. In fact, having a debugger open a core
file is precisely equivalent to what llvm-profdata needs to do in this
regard AFAICT.

In fact, it's so equivalent that you could in principle read the actual
counter values directly out of a core file. A core file could literally be
used as a raw profile.

E.g. you could in principle open the core in the debugger and then do:

p __profd_foo
p __profd_bar
...

(and walking vprof nodes would be more complicated but doable)

I'm not necessarily advocating this literally be done; just showing that
"everything you need is there".

A core file can be significantly larger than a raw profile data, and is
usually truncated unless the core size limit is set. The in-process profile
merging performance will be really bad.

Note also that the debug info approach has another nice advantage in that
it allows minimizing the runtime memory overhead for the program image to
the absolute minimum, which is important for embedded applications. Debug
info naturally stays out of the program image and so this problem is
automatically solved.

Note that instrumented binary (not even consider coverage mapping) built
usually does not turn on debug information. Mixing them can lead to
significant object size increase that leads to linker failure for large
applications.

David

_sean_silva · July 1, 2017, 7:21pm

Problem
-------

Instrumentation for PGO and frontend-based coverage places a large
amount of
data in object files, even though the majority of this data is not
needed at
run-time. All the data is needlessly duplicated while generating
archives, and
again while linking. PGO name data is written out into raw profiles by
instrumented programs, slowing down the training and code coverage
workflows.

Here are some numbers from a coverage + RA build of ToT clang:

  * Size of the build directory: 4.3 GB

  * Wall time needed to run "clang -help" with an SSD: 0.5 seconds

  * Size of the clang binary: 725.24 MB

  * Space wasted on duplicate name/coverage data (*.o + *.a): 923.49 MB
    - Size contributed by __llvm_covmap sections: 1.02 GB
      \_ Just within clang: 340.48 MB

We live with this duplication for debug info. In some sense, if the
overhead is small compared to debug info, should we even bother (i.e., we
assume that users accommodate debug builds, so that is a reasonable bound
on the tolerable build directory size). (I don't know the numbers; this
seems pretty large so maybe it is significant compared to debug info; just
saying that looking at absolute numbers is misleading here; numbers
compared to debug info are a closer measure to the user's perceptions)

In fact, one overall architectural observation I have is that the most
complicated part of all this is simply establishing the workflow to plumb
together data emitted per-TU to a tool that needs that information to do
some post-processing step on the results of running the binary. That sounds
a lot like the role of debug info. In fact, having a debugger open a core
file is precisely equivalent to what llvm-profdata needs to do in this
regard AFAICT.

In fact, it's so equivalent that you could in principle read the actual
counter values directly out of a core file. A core file could literally be
used as a raw profile.

E.g. you could in principle open the core in the debugger and then do:

p __profd_foo
p __profd_bar
...

(and walking vprof nodes would be more complicated but doable)

I'm not necessarily advocating this literally be done; just showing that
"everything you need is there".

A core file can be significantly larger than a raw profile data, and is
usually truncated unless the core size limit is set. The in-process profile
merging performance will be really bad.

Hence "in principle". Not in practice as you rightly point out.

Note also that the debug info approach has another nice advantage in that
it allows minimizing the runtime memory overhead for the program image to
the absolute minimum, which is important for embedded applications. Debug
info naturally stays out of the program image and so this problem is
automatically solved.

Note that instrumented binary (not even consider coverage mapping) built
usually does not turn on debug information. Mixing them can lead to
significant object size increase that leads to linker failure for large
applications.

I was imagining piggybacking on the debug info sections themselves (just
storing a blob of instrumentation data in there) as a way to naturally
transport the data. Not necessarily having any relationship between
instrumentation data and debug info. So with instrumentation enabled and
debug info disabled, we would still emit a debug section but it wouldn't
contain real debug info; it would just contain a blob of data that
instrumentation would otherwise put elsewhere.

Also, the piggyback-on-debug-info approach (or any approach trying to
externalize metadata, including Vedant's original proposal) will require
extra inputs to llvm-profdata, which may lead to workflow complexity.

We will of course want to retain the current approach where the raw
profiles are standalone as it will be preferable for initial porting
efforts and also likely the right choice in many scenarios due to
simplicity (as you point out, with IRPGO the overhead is less, so
externalizing the names isn't as much of an issue).

-- Sean Silva

mehdi_amini · July 3, 2017, 6:19pm

:

Problem
-------

Instrumentation for PGO and frontend-based coverage places a large amount
of
data in object files, even though the majority of this data is not needed
at
run-time. All the data is needlessly duplicated while generating
archives, and
again while linking. PGO name data is written out into raw profiles by
instrumented programs, slowing down the training and code coverage
workflows.

Here are some numbers from a coverage + RA build of ToT clang:

  * Size of the build directory: 4.3 GB

  * Wall time needed to run "clang -help" with an SSD: 0.5 seconds

  * Size of the clang binary: 725.24 MB

  * Space wasted on duplicate name/coverage data (*.o + *.a): 923.49 MB
    - Size contributed by __llvm_covmap sections: 1.02 GB
      \_ Just within clang: 340.48 MB

We live with this duplication for debug info. In some sense, if the
overhead is small compared to debug info, should we even bother (i.e., we
assume that users accommodate debug builds, so that is a reasonable bound
on the tolerable build directory size). (I don't know the numbers; this
seems pretty large so maybe it is significant compared to debug info; just
saying that looking at absolute numbers is misleading here; numbers
compared to debug info are a closer measure to the user's perceptions)

From a build directory point of view, I agree. However when deploying on

embedded device with "limited" space/memory you can strip the debug info
and keep them locally because they're not needed on the device for running
(or remote-debugging), is it the case with the profile infos?

Eli_Friedman · July 3, 2017, 6:44pm

__llvm_prf_names and __llvm_prf_data can’t be stripped at the moment: in-process profile writing code copies them into the profile file. Changing that is part of this proposal (but it could be fixed with a narrower change). -Eli

davidxl · July 3, 2017, 7:07pm

With the recent change of profile dumping (merge mode), the IO issue with name write should no longer be a problem (it is written only once).

For coverage mapping data, another possible solution is to introduce a post-link tool that strips and compresses the coverage mapping data from the final binary and copies it to a different file. This step can be manually done by the user or by the compiler driver when coverage mapping is on. The name data can be copied too, but it requires slight llvm-profdata work flow change under a flag.

David

vedantk · July 4, 2017, 6:29am

Problem

Instrumentation for PGO and frontend-based coverage places a large amount of
data in object files, even though the majority of this data is not needed at
run-time. All the data is needlessly duplicated while generating archives, and
again while linking. PGO name data is written out into raw profiles by
instrumented programs, slowing down the training and code coverage workflows.

Here are some numbers from a coverage + RA build of ToT clang:

Size of the build directory: 4.3 GB

Wall time needed to run “clang -help” with an SSD: 0.5 seconds

Size of the clang binary: 725.24 MB

Space wasted on duplicate name/coverage data (*.o + *.a): 923.49 MB

Size contributed by __llvm_covmap sections: 1.02 GB
_ Just within clang: 340.48 MB

Size contributed by __llvm_prf_names sections: 327.46 MB
_ Just within clang: 106.76 MB

=> Space wasted within the clang binary: 447.24 MB

Running an instrumented clang binary triggers a 143MB raw profile write which
is slow even with an SSD. This problem is particularly bad for frontend-based
coverage because it generates a lot of extra name data: however, the situation
can also be improved for PGO instrumentation.

I want to point out that this is a problem with FE instrumentation with coverage turned on. Without coverage turned on, the name section size will be significantly smaller.

Yes, it’s 15MB, or about 7 times smaller.

With IR PGO, the name section size is even smaller. For instance, the IR instrumented clang size is 122.3MB, while the name section size is only 2.3MB, the space wasted is < 2%.

It’s also worth pointing out that with r306561, name data is only written out by the runtime a constant number of times per program, and not on every program invocation. That’s a big win.

The other side to this is that there are valid use cases the online profile merging mode doesn’t support, e.g generating separate sets of profiles by training on different inputs, or generating separate coverage reports for each test case. In these cases, having the option to not write out name data is a win.

Proposal

Place PGO name data and coverage data outside of object files. This would
eliminate data duplication in .a/.o files, shrink binaries, shrink raw
profiles, and speed up instrumented programs.

This sounds fine as long as the behavior (for name) is controlled under an option. For IR PGO, name size is not an issue, so keeping the name data in binary and dumped with profile data has advantage in terms of usability – the profile data is self contained. Turning on coverage trigger the behavior difference is one possible choice.

The splitting behavior would be opt-in.

As for coverage mapping data, splitting it out by default seems to be a more desirable behavior.

This can’t be a default behavior. The user/build system/IDE would need to specify a metadata file/directory.

The data embedded in the binary is not even used by the profile runtime (of course the runtime can choose to dump it so that the llvm-cov data does not need to look for the executable binary). The sole purpose of emitting it with the object file is to treat the executable/object as the mapping data container. The usability of llvm-cov won’t reduce with the proposed change.

In more detail:

The frontends get a new -fprofile-metadata-dir=<path> option. This lets
users specify where llvm will store profile metadata. If the metadata starts to
take up too much space, there’s just one directory to clean.

Why not leverage -fcoverage-mapping option – i.e. add a new flavor of this option that accepts the meta data path: -fcoverage-mapping=. If the path is not specified, the data will be emitted with object files.

This would limit the ability to store name data outside of a binary to FE-style coverage users. I recognize that large name sections aren’t always as problematic when using IR/FE PGO, but this seems like an unnecessary restriction.

The frontends continue emitting PGO name data and coverage data in the same
llvm::Module. So does LLVM’s IR-based PGO implementation. No change here.

If the InstrProf lowering pass sees that a metadata directory is available,
it constructs a new module, copies the name/coverage data into it, hashes the
module, and attempts to write that module to:

/.bc (the metadata module)

If this write operation fails, it scraps the new module: it keeps all the
metadata in the original module, and there are no changes from the current
process. I.e with this proposal we preserve backwards compatibility.

Or simply emit the raw file as coverage notes files (gcno).

After reading through the comments, I think it would be better to have the build system specify where the external data goes, and to have just one external file formed post-link.

Once the metadata module is written, the name/coverage data are entirely
stripped out of the original module. They are replaced by a path to the
metadata module:

@__llvm_profiling_metadata = “/.bc”,
section “__llvm_prf_link”

This allows incremental builds to work properly, which is an important use case
for code coverage users. When an object is rebuilt, it gets a fresh link to a
fresh profiling metadata file. Although stale files can accumulate in the
metadata directory, the stale files cannot ever be used.

Why is this needed for incremental build? The file emitted is simply a build artifact, not an input to the build.

If llvm-cov just has a path to a directory, it can only load all of the data in the directory. But the aggregate data would not be self-consistent:

$ ninja foo
<Rename/delete/edit a file.>
$ ninja foo

This isn’t a problem if there is only one external metadata file (that the build system knows about).

In an IDE like Xcode, since there’s just one target binary per scheme, it’s
possible to clean the metadata directory by removing the modules which aren’t
referenced by the target binary.

The raw profile format is updated so that links to metadata files are written
out in each profile. This makes it possible for all existing llvm-profdata and
llvm-cov commands to work, seamlessly.

It may not be as smooth as you hope: the directory containing the build artifact may not be accessible when llvm-profdata tool is run. This is especially true for for distributed build system – without telling the build system , the meta data won’t even be copied back to the user.

This is another reason the build system should be aware of any metadata stored outside of the object file.

Since user explicitly asks for emitting the data into a directory, it won’t be a usability regression to require the user to specify the path to locate the meta data – this is especially true for llvm-cov which requires user to specify the binary path anyway.

This requirement can simplify the implementation even more as there seems no need to write any link data in the binary.

This is from a later email, but I’d like to follow up to this comment here:

For coverage mapping data, another possible solution is to introduce a post-link tool that strips and compresses the coverage mapping data from the final binary and copies it to a different file. This step can be manually done by the user or by the compiler driver when coverage mapping is on. The name data can be copied too, but it requires slight llvm-profdata work flow change under a flag.

I’ve already alluded to this: this sounds like a simpler plan. Kinda like dsymutil + strip.

I’m currently traveling (sorry for the delayed responses), and will send out a revised proposal in a week or so.

thanks,
vedant

vedantk · July 4, 2017, 6:29am

Problem

Instrumentation for PGO and frontend-based coverage places a large amount of
data in object files, even though the majority of this data is not needed at
run-time. All the data is needlessly duplicated while generating archives, and
again while linking. PGO name data is written out into raw profiles by
instrumented programs, slowing down the training and code coverage workflows.

Here are some numbers from a coverage + RA build of ToT clang:

Size of the build directory: 4.3 GB

Wall time needed to run “clang -help” with an SSD: 0.5 seconds

Size of the clang binary: 725.24 MB

Space wasted on duplicate name/coverage data (*.o + *.a): 923.49 MB

Size contributed by __llvm_covmap sections: 1.02 GB
_ Just within clang: 340.48 MB

We live with this duplication for debug info. In some sense, if the overhead is small compared to debug info, should we even bother (i.e., we assume that users accommodate debug builds, so that is a reasonable bound on the tolerable build directory size). (I don’t know the numbers; this seems pretty large so maybe it is significant compared to debug info; just saying that looking at absolute numbers is misleading here; numbers compared to debug info are a closer measure to the user’s perceptions)

The size of a RelWithDebInfo build directory for the same checkout is 9 GB (I’m still just building clang, this time without instrumentation). We (more or less) get away with this because the debug info isn’t copied into the final binary [1]. We’re not getting away with this with coverage. E.g we usually store bot artifacts for a while, but we had to shut this functionality off almost immediately for our coverage bots because the uploads were horrific.

In fact, one overall architectural observation I have is that the most complicated part of all this is simply establishing the workflow to plumb together data emitted per-TU to a tool that needs that information to do some post-processing step on the results of running the binary. That sounds a lot like the role of debug info. In fact, having a debugger open a core file is precisely equivalent to what llvm-profdata needs to do in this regard AFAICT.

So it would be best if possible to piggyback on all the effort that has gone into plumbing that data to make debug info work. For example, I know that on Darwin there’s a fair amount of system-level integration to make split dwarf “just work” while keeping debug info out of final binaries.

If there is a not-too-hacky way to piggyback on debug info, that’s likely to be a really slick solution. For example, debug info could in principle (if it doesn’t already) contain information about the name of each counter in the counter array, so in principle it would be a complete enough description to identify each counter.

We don’t emit debug info for this currently. Is there a reason to?

I’m not very familiar with DWARF, but I’m imagining something like reserving an LLVM vendor-specific DWARF opcode/attribute/whatever and then stick a blob of data in there. Presumably we have code somewhere in LLDB that is “here’s a binary, find debug info for it”, and in principle we could factor out that code and lift it into an LLVM library (libFindDebugInfo) that llvm-profdata could use.

This could work for the coverage/name data. There are some really nice pieces of Darwin integration (e.g search-with-Spotlight, findDsymForUUID). I’ll look into this.

Size contributed by __llvm_prf_names sections: 327.46 MB
_ Just within clang: 106.76 MB

=> Space wasted within the clang binary: 447.24 MB

Running an instrumented clang binary triggers a 143MB raw profile write which
is slow even with an SSD. This problem is particularly bad for frontend-based
coverage because it generates a lot of extra name data: however, the situation
can also be improved for PGO instrumentation.

Proposal

Place PGO name data and coverage data outside of object files. This would
eliminate data duplication in .a/.o files, shrink binaries, shrink raw
profiles, and speed up instrumented programs.

In more detail:

The frontends get a new -fprofile-metadata-dir=<path> option. This lets
users specify where llvm will store profile metadata. If the metadata starts to
take up too much space, there’s just one directory to clean.

The frontends continue emitting PGO name data and coverage data in the same
llvm::Module. So does LLVM’s IR-based PGO implementation. No change here.

If the InstrProf lowering pass sees that a metadata directory is available,
it constructs a new module, copies the name/coverage data into it, hashes the
module, and attempts to write that module to:

/.bc (the metadata module)

If this write operation fails, it scraps the new module: it keeps all the
metadata in the original module, and there are no changes from the current
process. I.e with this proposal we preserve backwards compatibility.

Based at my experience with Clang’s implicit modules, I’m extremely wary of anything that might cause the compiler to emit a file that the build system cannot guess the name of. In fact, having the compiler emit a file that is not explicitly listed on the command line is basically just as bad in practice (in terms of feasibility of informing the build system about it).

As a simple example, ninja simply cannot represent a dependency of this type, so if you delete a /.bc it won’t know things need to be rebuilt (and it won’t know how to clean it, etc.).

So I would really strongly recommend against doing this.

Again, these problems of system integration (in particular build system integration) are nasty, and if you can bypass this and piggyback on debug info then everything will “just work” because the folks that care about making sure that debugging “just works” already did the work for you.
It might be more work in the short term to do the debug info approach (if it is feasible at all), but I can tell you based on the experience with implicit modules (and I’m sure you have some experience of your own) that there’s just going to be a neverending tail of hitches and ways that things don’t work (or work poorly) due to not having the build system / overall system integration right, so it will be worth it in the long run.

Thanks, this makes a lot of sense. The build system should keep track of where to externalize profile metadata (regardless of whether or not it piggybacks on debug info). In addition to the advantages you’ve listed, this would make testing easier.

vedant

[1] ld64:
2561 if ( strcmp(sect->segname(), “__DWARF”) == 0 ) {
2562 // note that .o file has dwarf
2563 _file->_debugInfoKind = ld::relocatable::File::kDebugInfoDwarf;
2564 // save off iteresting dwarf sections
…
2571 else if ( strcmp(sect->sectname(), “__debug_str”) == 0 )
2572 _file->_dwarfDebugStringSect = sect;
2573 // linker does not propagate dwarf sections to output file
2574 continue;

_sean_silva · July 4, 2017, 8:03pm

Problem
-------

Instrumentation for PGO and frontend-based coverage places a large amount
of
data in object files, even though the majority of this data is not needed
at
run-time. All the data is needlessly duplicated while generating
archives, and
again while linking. PGO name data is written out into raw profiles by
instrumented programs, slowing down the training and code coverage
workflows.

Here are some numbers from a coverage + RA build of ToT clang:

  * Size of the build directory: 4.3 GB

  * Wall time needed to run "clang -help" with an SSD: 0.5 seconds

  * Size of the clang binary: 725.24 MB

  * Space wasted on duplicate name/coverage data (*.o + *.a): 923.49 MB
    - Size contributed by __llvm_covmap sections: 1.02 GB
      \_ Just within clang: 340.48 MB

We live with this duplication for debug info. In some sense, if the
overhead is small compared to debug info, should we even bother (i.e., we
assume that users accommodate debug builds, so that is a reasonable bound
on the tolerable build directory size). (I don't know the numbers; this
seems pretty large so maybe it is significant compared to debug info; just
saying that looking at absolute numbers is misleading here; numbers
compared to debug info are a closer measure to the user's perceptions)

The size of a RelWithDebInfo build directory for the same checkout is 9 GB
(I'm still just building clang, this time without instrumentation).

So it sounds like the 4.3GB build directory you quoted in the OP is
substantially less, so your comment below doesn't make sense. Or was 4.3GB
just the build directory needed for building nothing but the clang binary
and its dependencies? Can you get an apples to apples number (it sounds
like it must have been much more due if the coverage bots had to be turned
off, but it would be useful to get some breakdown and an apples-to-apples
number)

We (more or less) get away with this because the debug info isn't copied
into the final binary [1]. We're not getting away with this with coverage.
E.g we usually store bot artifacts for a while, but we had to shut this
functionality off almost immediately for our coverage bots because the
uploads were horrific.

In fact, one overall architectural observation I have is that the most
complicated part of all this is simply establishing the workflow to plumb
together data emitted per-TU to a tool that needs that information to do
some post-processing step on the results of running the binary. That sounds
a lot like the role of debug info. In fact, having a debugger open a core
file is precisely equivalent to what llvm-profdata needs to do in this
regard AFAICT.

So it would be best if possible to piggyback on all the effort that has
gone into plumbing that data to make debug info work. For example, I know
that on Darwin there's a fair amount of system-level integration to make
split dwarf "just work" while keeping debug info out of final binaries.

If there is a not-too-hacky way to piggyback on debug info, that's likely
to be a really slick solution. For example, debug info could in principle
(if it doesn't already) contain information about the name of each counter
in the counter array, so in principle it would be a complete enough
description to identify each counter.

We don't emit debug info for this currently. Is there a reason to?

Probably not. My suspicion is that the most feasible solution would be one
where we just store a blob of opaque coverage data in the debug info
section. In theory (but probably not in practice) we could lower the
coverage mapping data to some form of debug info (that's what it really is,
after all; it's basically a very precise sort of debug info that is allowed
to impede optimizations to remain precise), but the effort needed to
harmonize the needs of actual debug info with "debug info lowered from
coverage data" would probably be too messy, if it was possible at all.

-- Sean Silva

dblaikie · July 5, 2017, 3:07pm

Could someone summarize the % size costs in object and executables, release versus unoptimized and debug V no-debug builds? (maybe that’s too much of a hassle, but thought it might provide some clarity about the tradeoffs, pain points, etc)

Also, added dberris here, since if I recall correctly, the XRay work has some similar aspects - where certain mapping structures are kept in the binary and consulted when interpreting XRay traces. In that case it may also be useful to avoid putting those structures into the final binary in some cases for the same sort of size tradeoff reasons.

& then even more worth looking at a generalized solution for these sort of things.

Dave

Dean_Michael_Berris1 · July 5, 2017, 3:54pm

Could someone summarize the % size costs in object and executables, release versus unoptimized and debug V no-debug builds? (maybe that’s too much of a hassle, but thought it might provide some clarity about the tradeoffs, pain points, etc)

Also, added dberris here, since if I recall correctly, the XRay work has some similar aspects - where certain mapping structures are kept in the binary and consulted when interpreting XRay traces. In that case it may also be useful to avoid putting those structures into the final binary in some cases for the same sort of size tradeoff reasons.

Yes, in XRay we depend at runtime on being able to access an in-memory array of a certain format/alignment.

It just so happens that we also need to be able to find which functions are instrumented in particular binary for tooling/interpretation/analysis purposes “offline”.

While emitting the instrumentation map as part of the binary is certainly something convenient for the runtime so as not to require “external input” to find the places in the binary that should be patched, I don’t see it as an actual deal-breaker if the runtime would have a fall-back mechanism for finding the instrumentation map externally. That might introduce a few issues when there’s a mismatch between the instrumentation map’s addresses/offsets and the binary being instrumented. With the nature of XRay, this is very dangerous because a well-crafted instrumentation map can certainly lead to potential abuse. We might need to be clever about using signatures or special markers in the binary and the instrumentation maps to match those up properly.

The tooling (llvm-xray) can already deal with a detached (crudely done, objcopy of the xray_instr_map section or a YAML representation of the same) instrumentation map. However the runtime implementation currently doesn’t. Like mentioned above, there are a few issues to work out in that regard (having a strong “unique” identifier for a binary and the instrumentation map, something potentially involving some robustly computed crypto hash at compile-time, etc.). But it’s certainly a non-trivial problem especially if we want to make it portable across object file formats (I only barely know how ELF works) and platforms (Windows, UNIX, etc.).

& then even more worth looking at a generalized solution for these sort of things.

+1 to a generalised solution. I’d be very happy to be involved in that discussion too.

Dave

Problem

Instrumentation for PGO and frontend-based coverage places a large amount of
data in object files, even though the majority of this data is not needed at
run-time. All the data is needlessly duplicated while generating archives, and
again while linking. PGO name data is written out into raw profiles by
instrumented programs, slowing down the training and code coverage workflows.

Here are some numbers from a coverage + RA build of ToT clang:

Size of the build directory: 4.3 GB

Wall time needed to run “clang -help” with an SSD: 0.5 seconds

Size of the clang binary: 725.24 MB

Space wasted on duplicate name/coverage data (*.o + *.a): 923.49 MB

Size contributed by __llvm_covmap sections: 1.02 GB
_ Just within clang: 340.48 MB

We live with this duplication for debug info. In some sense, if the overhead is small compared to debug info, should we even bother (i.e., we assume that users accommodate debug builds, so that is a reasonable bound on the tolerable build directory size). (I don’t know the numbers; this seems pretty large so maybe it is significant compared to debug info; just saying that looking at absolute numbers is misleading here; numbers compared to debug info are a closer measure to the user’s perceptions)

The size of a RelWithDebInfo build directory for the same checkout is 9 GB (I’m still just building clang, this time without instrumentation). We (more or less) get away with this because the debug info isn’t copied into the final binary [1]. We’re not getting away with this with coverage. E.g we usually store bot artifacts for a while, but we had to shut this functionality off almost immediately for our coverage bots because the uploads were horrific.

In fact, one overall architectural observation I have is that the most complicated part of all this is simply establishing the workflow to plumb together data emitted per-TU to a tool that needs that information to do some post-processing step on the results of running the binary. That sounds a lot like the role of debug info. In fact, having a debugger open a core file is precisely equivalent to what llvm-profdata needs to do in this regard AFAICT.

So it would be best if possible to piggyback on all the effort that has gone into plumbing that data to make debug info work. For example, I know that on Darwin there’s a fair amount of system-level integration to make split dwarf “just work” while keeping debug info out of final binaries.

If there is a not-too-hacky way to piggyback on debug info, that’s likely to be a really slick solution. For example, debug info could in principle (if it doesn’t already) contain information about the name of each counter in the counter array, so in principle it would be a complete enough description to identify each counter.

We don’t emit debug info for this currently. Is there a reason to?

I’m not very familiar with DWARF, but I’m imagining something like reserving an LLVM vendor-specific DWARF opcode/attribute/whatever and then stick a blob of data in there. Presumably we have code somewhere in LLDB that is “here’s a binary, find debug info for it”, and in principle we could factor out that code and lift it into an LLVM library (libFindDebugInfo) that llvm-profdata could use.

This could work for the coverage/name data. There are some really nice pieces of Darwin integration (e.g search-with-Spotlight, findDsymForUUID). I’ll look into this.

Size contributed by __llvm_prf_names sections: 327.46 MB
_ Just within clang: 106.76 MB

=> Space wasted within the clang binary: 447.24 MB

Running an instrumented clang binary triggers a 143MB raw profile write which
is slow even with an SSD. This problem is particularly bad for frontend-based
coverage because it generates a lot of extra name data: however, the situation
can also be improved for PGO instrumentation.

Proposal

Place PGO name data and coverage data outside of object files. This would
eliminate data duplication in .a/.o files, shrink binaries, shrink raw
profiles, and speed up instrumented programs.

In more detail:

The frontends get a new -fprofile-metadata-dir=<path> option. This lets
users specify where llvm will store profile metadata. If the metadata starts to
take up too much space, there’s just one directory to clean.

The frontends continue emitting PGO name data and coverage data in the same
llvm::Module. So does LLVM’s IR-based PGO implementation. No change here.

If the InstrProf lowering pass sees that a metadata directory is available,
it constructs a new module, copies the name/coverage data into it, hashes the
module, and attempts to write that module to:

/.bc (the metadata module)

If this write operation fails, it scraps the new module: it keeps all the
metadata in the original module, and there are no changes from the current
process. I.e with this proposal we preserve backwards compatibility.

Based at my experience with Clang’s implicit modules, I’m extremely wary of anything that might cause the compiler to emit a file that the build system cannot guess the name of. In fact, having the compiler emit a file that is not explicitly listed on the command line is basically just as bad in practice (in terms of feasibility of informing the build system about it).

As a simple example, ninja simply cannot represent a dependency of this type, so if you delete a /.bc it won’t know things need to be rebuilt (and it won’t know how to clean it, etc.).

So I would really strongly recommend against doing this.

Again, these problems of system integration (in particular build system integration) are nasty, and if you can bypass this and piggyback on debug info then everything will “just work” because the folks that care about making sure that debugging “just works” already did the work for you.
It might be more work in the short term to do the debug info approach (if it is feasible at all), but I can tell you based on the experience with implicit modules (and I’m sure you have some experience of your own) that there’s just going to be a neverending tail of hitches and ways that things don’t work (or work poorly) due to not having the build system / overall system integration right, so it will be worth it in the long run.

Thanks, this makes a lot of sense. The build system should keep track of where to externalize profile metadata (regardless of whether or not it piggybacks on debug info). In addition to the advantages you’ve listed, this would make testing easier.

vedant

[1] ld64:
2561 if ( strcmp(sect->segname(), “__DWARF”) == 0 ) {
2562 // note that .o file has dwarf
2563 _file->_debugInfoKind = ld::relocatable::File::kDebugInfoDwarf;
2564 // save off iteresting dwarf sections
…
2571 else if ( strcmp(sect->sectname(), “__debug_str”) == 0 )
2572 _file->_dwarfDebugStringSect = sect;
2573 // linker does not propagate dwarf sections to output file
2574 continue;

Once the metadata module is written, the name/coverage data are entirely
stripped out of the original module. They are replaced by a path to the
metadata module:

@__llvm_profiling_metadata = “/.bc”,
section “__llvm_prf_link”

This allows incremental builds to work properly, which is an important use case
for code coverage users. When an object is rebuilt, it gets a fresh link to a
fresh profiling metadata file. Although stale files can accumulate in the
metadata directory, the stale files cannot ever be used.

In an IDE like Xcode, since there’s just one target binary per scheme, it’s
possible to clean the metadata directory by removing the modules which aren’t
referenced by the target binary.

The raw profile format is updated so that links to metadata files are written
out in each profile. This makes it possible for all existing llvm-profdata and
llvm-cov commands to work, seamlessly.

The indexed profile format will not be updated: i.e, it will contain a full
symbol table, and no links. This simplifies the coverage mapping reader, because
a full symbol table is guaranteed to exist before any function records are
parsed. It also reduces the amount of coding, and makes it easier to preserve
backwards compatibility :).

The raw profile reader will learn how to read links, open up the metadata
modules it finds links to, and collect name data from those modules.

The coverage reader will learn how to read the __llvm_prf_link section, open
up metadata modules, and lazily read coverage mapping data.

Alternate Solutions

Instead of copying name data into an external metadata module, just copy the
coverage mapping data.

I’ve actually prototyped this. This might be a good way to split up patches,
although I don’t see why we wouldn’t want to tackle the name data problem
eventually.

Instead of emitting links to external metadata modules, modify llvm-cov and
llvm-profdata so that they require a path to the metadata directory.

The issue with this is that it’s way too easy to read stale metadata. It’s also
less user-friendly, which hurts adoption.

Use something other than llvm bitcode for the metadata module format.

Since we’re mostly writing large binary blobs (compressed name data or
pre-encoded source range mapping info), using bitcode shouldn’t be too slow, and
we’re not likely to get better compression with a different format.

Bitcode is also convenient, and is nice for backwards compatibility.

If you’ve made it this far, thanks for taking a look! I’d appreciate any
feedback.

vedant

LLVM Developers mailing list
llvm-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

LLVM Developers mailing list
llvm-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

– Dean

mehdi_amini · July 5, 2017, 7:03pm

This is the case on Darwin but not on Linux I believe (without
debug-fission which is still quite "rare" I believe)

Topic		Replies	Views
RFC: Reducing Instr PGO size overhead LLVM Dev List Archives	32	135	December 14, 2015
Proposal: add instrumentation for PGO and code coverage Clang Frontend	11	550	September 10, 2013
code coverage instrumentation LLVM Dev List Archives	2	108	April 14, 2015
Does IR lightweight instrumentation work on source code coverage? LLVM Project clang , llvm	3	301	July 18, 2023
Status of IR vs. frontend PGO (fprofile-generate vs fprofile-instr-generate) Clang Frontend	13	786	June 15, 2021

[RFC] Placing profile name data, and coverage data, outside of object files

Problem

Proposal

Problem

Proposal

Problem

Proposal

Alternate Solutions

Related topics