Add support for in-process profile merging in profile-runtime

One of the main missing features in Clang/LLVM profile runtime is the lack of support for online/in-process profile merging support. Profile data collected for different workloads for the same executable binary need to be collected and merged later by the offline post-processing tool. This limitation makes it hard to handle cases where the instrumented binary needs to be run with large number of small workloads, possibly in parallel. For instance, to do PGO for clang, we may choose to build a large project with the instrumented Clang binary. This is because

  1. to avoid profile from different runs from overriding others, %p substitution needs to be specified in either the command line or an environment variable so that different process can dump profile data into its own file named using pid. This will create huge requirement on the disk storage. For instance, clang’s raw profile size is typically 80M – if the instrumented clang is used to build a medium to large size project (such as clang itself), profile data can easily use up hundreds of Gig bytes of local storage.
  2. pid can also be recycled. This means that some of the profile data may be overridden without being noticed.

The way to solve this problem is to allow profile data to be merged in process. I have a prototype implementation and plan to send it out for review soon after some clean ups. By default, the profiling merging is off and it can be turned on with an user option or via an environment variable. The following summarizes the issues involved in adding this feature:

  1. the target platform needs to have file locking support
  2. there needs an efficient way to identify the profile data and associate it with the binary using binary/profdata signature;
  3. Currently without merging, profile data from shared libraries (including dlopen/dlcose ones) are concatenated into the primary profile file. This can complicate matters, as the merger also needs to find the matching shared libs, and the merger also needs to avoid unnecessary data movement/copy;
  4. value profile data is variable in length even for the same binary.

All the above issues are resolved and clang self build with instrumented binary passes (with both j1 and high parallelism).

If you have any concerns, please let me know.

thanks,

David

I have thought about this issue too, in the context of games. We may want to turn profiling only for certain frames (essentially, this is many small profile runs).

However, I have not seen it demonstrated that this kind of refined data collection will actually improve PGO results in practice.
The evidence I do have though is that IIRC Apple have found that almost all of the benefits of PGO for the Clang binary can be gotten with a handful of training runs of Clang. Are your findings different?

Also, in general, I am very wary of file locking. This can cause huge amounts of slowdown for a build and has potential portability problems. I don’t see it as a substantially better solution than wrapping clang in a script that runs clang and then just calls llvm-profdata to do the merging. Running llvm-profdata is cheap compared to doing locking in a highly parallel situation like a build.

– Sean Silva

From: "Sean Silva via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Xinliang David Li" <davidxl@google.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Saturday, February 27, 2016 8:50:05 PM
Subject: Re: [llvm-dev] Add support for in-process profile merging in profile-runtime

I have thought about this issue too, in the context of games. We may
want to turn profiling only for certain frames (essentially, this is
many small profile runs).

However, I have not seen it demonstrated that this kind of refined
data collection will actually improve PGO results in practice.
The evidence I do have though is that IIRC Apple have found that
almost all of the benefits of PGO for the Clang binary can be gotten
with a handful of training runs of Clang. Are your findings
different?

Also, in general, I am very wary of file locking.

As am I (especially since it often does not operate correctly, or is very slow, on distributed file systems). Why don't you just read in an existing file to pre-populate the counters section when it exists at startup?

-Hal

Xinliang David Li via llvm-dev <llvm-dev@lists.llvm.org> writes:

One of the main missing features in Clang/LLVM profile runtime is the lack of
support for online/in-process profile merging support. Profile data collected
for different workloads for the same executable binary need to be collected
and merged later by the offline post-processing tool. This limitation makes
it hard to handle cases where the instrumented binary needs to be run with
large number of small workloads, possibly in parallel. For instance, to do
PGO for clang, we may choose to build a large project with the instrumented
Clang binary. This is because

1) to avoid profile from different runs from overriding others, %p
   substitution needs to be specified in either the command line or an
   environment variable so that different process can dump profile data
   into its own file named using pid.

... or you can specify a more specific name that describes what's under
test, instead of %p.

   This will create huge requirement on the disk storage. For
   instance, clang's raw profile size is typically 80M -- if the
   instrumented clang is used to build a medium to large size project
   (such as clang itself), profile data can easily use up hundreds of
   Gig bytes of local storage.

This argument is kind of confusing. It says that one profile is
typicially 80M, then claims that this uses 100s of GB of data. From
these statements that only makes sense I suppose that's true if you run
1000 profiling runs without merging the data in between. Is that what
you're talking about, or did I miss something?

2) pid can also be recycled. This means that some of the profile data may be
   overridden without being noticed.

The way to solve this problem is to allow profile data to be merged in
process.

I'm not convinced. Can you provide some more concrete examples of where
the out of process merging model fails? This was a *very deliberate*
design decision in how clang's profiling works, and most of the
subsequent decisions have been based on this initial one. Changing it
has far reaching effects.

I have a prototype implementation and plan to send it out for review
soon after some clean ups. By default, the profiling merging is off and it can
be turned on with an user option or via an environment variable. The following
summarizes the issues involved in adding this feature:
1. the target platform needs to have file locking support
2. there needs an efficient way to identify the profile data and associate it
   with the binary using binary/profdata signature;
3. Currently without merging, profile data from shared libraries
   (including dlopen/dlcose ones) are concatenated into the primary
   profile file. This can complicate matters, as the merger also needs to
   find the matching shared libs, and the merger also needs to avoid
   unnecessary data movement/copy;
4. value profile data is variable in length even for the same binary.

If we actually want this, we should reconsider the design of having a
raw vs processed profiling format. The raw profile format is
specifically designed to be fast to write out and not to consider
merging profiles at all. This feature would make it nearly as
complicated as the processed format and lose all of the advantages of
making them different.

I have thought about this issue too, in the context of games. We may want
to turn profiling only for certain frames (essentially, this is many small
profile runs).

However, I have not seen it demonstrated that this kind of refined data
collection will actually improve PGO results in practice.
The evidence I do have though is that IIRC Apple have found that almost
all of the benefits of PGO for the Clang binary can be gotten with a
handful of training runs of Clang. Are your findings different?

We have a very wide customer base so we can not claim one use model is
sufficient for all users. For instance, we have users using fine grained
profile dumping control (programatically) as you described above. There are
also other possible use cases such as dump profiles for different
periodical phases into files associated with phases. Later different
phase's profile data can be merged with different weights.

Also, in general, I am very wary of file locking. This can cause huge
amounts of slowdown for a build and has potential portability problems.

I don't see much slow down with a clang build using instrumented clang as
the build compiler. With file locking and profile merging enabled, the
build time on my local machine looks like:

real 18m22.737s
user 293m18.924s
sys 9m55.532s

If profile merging/locking is disabled (i.e, let the profile dumper to
clobber/write over each other), the real time is about 14m.

I don't see it as a substantially better solution than wrapping clang in a
script that runs clang and then just calls llvm-profdata to do the merging.
Running llvm-profdata is cheap compared to doing locking in a highly
parallel situation like a build.

That would require synchronization for merging too.

From Justin's email, it looks like there is a key point I have not made

clear: the on-line profile merge is a very simple raw profile to raw
profile merging which is super fast. The end result of the profile run is
still in raw format. The raw to indexed merging is still needed -- but
instead of merging thousands of raw profiles which can be very slow, with
this model, only one raw profile input is needed.

thanks,

David

> From: "Sean Silva via llvm-dev" <llvm-dev@lists.llvm.org>
> To: "Xinliang David Li" <davidxl@google.com>
> Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
> Sent: Saturday, February 27, 2016 8:50:05 PM
> Subject: Re: [llvm-dev] Add support for in-process profile merging in
profile-runtime
>
>
>
> I have thought about this issue too, in the context of games. We may
> want to turn profiling only for certain frames (essentially, this is
> many small profile runs).
>
>
> However, I have not seen it demonstrated that this kind of refined
> data collection will actually improve PGO results in practice.
> The evidence I do have though is that IIRC Apple have found that
> almost all of the benefits of PGO for the Clang binary can be gotten
> with a handful of training runs of Clang. Are your findings
> different?
>
>
> Also, in general, I am very wary of file locking.

As am I (especially since it often does not operate correctly, or is very
slow, on distributed file systems).

Dumping thousands of copies of profiles can be more problematic IMO.

Why don't you just read in an existing file to pre-populate the counters
section when it exists at startup?

No this won't work for cases when multiple processes are dumping profile
concurrently.

David

Justin, looks like there is some misunderstanding in my email. I want to clarify it here first:

  1. I am not proposing changing the default profile dumping model as used today. The online merging is totally optional;
  2. the on-line profile merging is not doing conversion from raw to index format. It does very simple raw-to-raw merging using existing runtime APIs.
  3. the change to existing profile runtime code is just a few lines. All the new functionality is isolated in one new file. It will become clear when the patch is sent out later.

My inline replies below:

Could the profile file be named from a hash of the profile data themselves?

Even with that I still like the direction you’re heading to. To avoid contention on the “file locking”, could there be a pool of output files? I don’t know how to map a new process to one output file without a lock somewhere though.

Justin, looks like there is some misunderstanding in my email. I want to
clarify it here first:

1) I am not proposing changing the default profile dumping model as used
today. The online merging is totally optional;
2) the on-line profile merging is not doing conversion from raw to index
format. It does very simple raw-to-raw merging using existing runtime APIs.
3) the change to existing profile runtime code is just a few lines. All
the new functionality is isolated in one new file. It will become clear
when the patch is sent out later.

My inline replies below:

Xinliang David Li via llvm-dev <llvm-dev@lists.llvm.org> writes:
> One of the main missing features in Clang/LLVM profile runtime is the
lack of
> support for online/in-process profile merging support. Profile data
collected
> for different workloads for the same executable binary need to be
collected
> and merged later by the offline post-processing tool. This limitation
makes
> it hard to handle cases where the instrumented binary needs to be run
with
> large number of small workloads, possibly in parallel. For instance,
to do
> PGO for clang, we may choose to build a large project with the
instrumented
> Clang binary. This is because
>
> 1) to avoid profile from different runs from overriding others, %p
> substitution needs to be specified in either the command line or an
> environment variable so that different process can dump profile data
> into its own file named using pid.

... or you can specify a more specific name that describes what's under
test, instead of %p.

yes -- but the problem still exists -- each training process will need its
own copy of raw profile.

> This will create huge requirement on the disk storage. For
> instance, clang's raw profile size is typically 80M -- if the
> instrumented clang is used to build a medium to large size project
> (such as clang itself), profile data can easily use up hundreds of
> Gig bytes of local storage.

This argument is kind of confusing. It says that one profile is
typicially 80M, then claims that this uses 100s of GB of data. From
these statements that only makes sense I suppose that's true if you run
1000 profiling runs without merging the data in between. Is that what
you're talking about, or did I miss something?

Yes. For instance, first build a clang with
-fprofile-instr-generate=prof.data.%p, and use this instrumented clang to
build another large project such as clang itself. The second build will
produce tons of profile data.

> 2) pid can also be recycled. This means that some of the profile data
may be
> overridden without being noticed.
>
> The way to solve this problem is to allow profile data to be merged in
> process.

I'm not convinced. Can you provide some more concrete examples of where
the out of process merging model fails? This was a *very deliberate*
design decision in how clang's profiling works, and most of the
subsequent decisions have been based on this initial one. Changing it
has far reaching effects.

I am not proposing changing the out of process merging -- it is still
needed. What I meant is that, in a scenario where the instrumented binaries
are running multiple times (using their existing running harness), there is
no good/automatic way of making sure different process's profile data won't
have name conflict.

Could the profile file be named from a hash of the profile data themselves?

That will solve the name conflict problem -- but the problem with larger
resource requirement is still there :slight_smile:

Even with that I still like the direction you're heading to. To avoid
contention on the "file locking", could there be a pool of output files? I
don't know how to map a new process to one output file without a lock
somewhere though.

We can certainly introduce something like that when lock contention becomes
an issue (which is a great idea by the way, and quite straightforward to
implement). From my initial experiment, it does seem to be a problem.

thanks,

David

+ 1 to Sean's suggestion of using a wrapper script to call profdata merge.

David, does that work for your use case?

Some inline comments ---

Justin, looks like there is some misunderstanding in my email. I want to clarify it here first:

1) I am not proposing changing the default profile dumping model as used today. The online merging is totally optional;
2) the on-line profile merging is not doing conversion from raw to index format. It does very simple raw-to-raw merging using existing runtime APIs.
3) the change to existing profile runtime code is just a few lines. All the new functionality is isolated in one new file. It will become clear when the patch is sent out later.

My inline replies below:

Xinliang David Li via llvm-dev <llvm-dev@lists.llvm.org> writes:
> One of the main missing features in Clang/LLVM profile runtime is the lack of
> support for online/in-process profile merging support. Profile data collected
> for different workloads for the same executable binary need to be collected
> and merged later by the offline post-processing tool. This limitation makes
> it hard to handle cases where the instrumented binary needs to be run with
> large number of small workloads, possibly in parallel. For instance, to do
> PGO for clang, we may choose to build a large project with the instrumented
> Clang binary. This is because
>
> 1) to avoid profile from different runs from overriding others, %p
> substitution needs to be specified in either the command line or an
> environment variable so that different process can dump profile data
> into its own file named using pid.

... or you can specify a more specific name that describes what's under
test, instead of %p.

yes -- but the problem still exists -- each training process will need its own copy of raw profile.

> This will create huge requirement on the disk storage. For
> instance, clang's raw profile size is typically 80M -- if the
> instrumented clang is used to build a medium to large size project
> (such as clang itself), profile data can easily use up hundreds of
> Gig bytes of local storage.

This argument is kind of confusing. It says that one profile is
typicially 80M, then claims that this uses 100s of GB of data. From
these statements that only makes sense I suppose that's true if you run
1000 profiling runs without merging the data in between. Is that what
you're talking about, or did I miss something?

Yes. For instance, first build a clang with -fprofile-instr-generate=prof.data.%p, and use this instrumented clang to build another large project such as clang itself. The second build will produce tons of profile data.

> 2) pid can also be recycled. This means that some of the profile data may be
> overridden without being noticed.
>
> The way to solve this problem is to allow profile data to be merged in
> process.

I'm not convinced. Can you provide some more concrete examples of where
the out of process merging model fails? This was a *very deliberate*
design decision in how clang's profiling works, and most of the
subsequent decisions have been based on this initial one. Changing it
has far reaching effects.

I am not proposing changing the out of process merging -- it is still needed. What I meant is that, in a scenario where the instrumented binaries are running multiple times (using their existing running harness), there is no good/automatic way of making sure different process's profile data won't have name conflict.

Could the profile file be named from a hash of the profile data themselves?

IIUC David's plan is to repeatedly use the same on-disk raw-profile.

Even with that I still like the direction you're heading to. To avoid contention on the "file locking", could there be a pool of output files? I don't know how to map a new process to one output file without a lock somewhere though.

--
Mehdi

Using clang's self build (using instrumented clang as build compiler for profile bootstrapping) as an example. Ideally this should all be done transparently -- i.e, set the instrumented compiler as the build compiler, run ninja or make and things will just work, but with the current default profile dumping mode, it can fail in many different ways:
1) Just run ninja/make -- all clang processes will dump profile into the same file concurrently -- the result is a corrupted profile -- FAIL
2) run ninja with LLVM_PROFILE_FILE=....%p
   2.1) failure mode #1 --> really slow build due to large IO; or running out of diskspace
   2.2) failure mode #2 --> pid recyling leading to profile file name conflict -- profile overwriting happens and we loss data

Suppose 2) above finally succeeds, the user will have to merge thousands of raw profiles to indexed profile.
   
With the proposed profile on-line merging, you just need to use the instrumented clang, and one merged raw profile data automagically produced in the end. The raw to indexed merge is also much faster.

The online merge feature has a huge advantage when considering integrating the instrumented binary with existing make systems or loadtesting harness -- it is almost seamless.

> I have a prototype implementation and plan to send it out for review
> soon after some clean ups. By default, the profiling merging is off and it can
> be turned on with an user option or via an environment variable. The following
> summarizes the issues involved in adding this feature:
> 1. the target platform needs to have file locking support
> 2. there needs an efficient way to identify the profile data and associate it
> with the binary using binary/profdata signature;
> 3. Currently without merging, profile data from shared libraries
> (including dlopen/dlcose ones) are concatenated into the primary
> profile file. This can complicate matters, as the merger also needs to
> find the matching shared libs, and the merger also needs to avoid
> unnecessary data movement/copy;
> 4. value profile data is variable in length even for the same binary.

If we actually want this, we should reconsider the design of having a
raw vs processed profiling format. The raw profile format is
specifically designed to be fast to write out and not to consider
merging profiles at all. This feature would make it nearly as
complicated as the processed format and lose all of the advantages of
making them different.

See above -- all the nice raw profile dumping mechanism is still kept -- there won't be a change of that.

thanks,

David

> All the above issues are resolved and clang self build with instrumented
> binary passes (with both j1 and high parallelism).
>
> If you have any concerns, please let me know.

_______________________________________________
LLVM Developers mailing list
llvm-dev@lists.llvm.org
llvm-dev Info Page

_______________________________________________
LLVM Developers mailing list
llvm-dev@lists.llvm.org
llvm-dev Info Page

However, I have not seen it demonstrated that this kind of refined data collection will actually improve PGO results in practice.

That's an important question to ask. But, if users do not want profiles from certain runs, shouldn't they either (1) pass LLVM_PROFILE_FILE=/dev/null, or (2) not instrument the binary?

Also, in general, I am very wary of file locking. This can cause huge amounts of slowdown for a build and has potential portability problems.

Yeah, these things aren't great, but as long as they are opt-in I think it'd be reasonable.

Also to be fair, writing out tons of raw profiles to separate files is a lot slower than overwriting the same raw profile.

I don't see it as a substantially better solution than wrapping clang in a script that runs clang and then just calls llvm-profdata to do the merging. Running llvm-profdata is cheap compared to doing locking in a highly parallel situation like a build.

^ + 1.

vedant

Justin, looks like there is some misunderstanding in my email. I want to
clarify it here first:

1) I am not proposing changing the default profile dumping model as used
today. The online merging is totally optional;
2) the on-line profile merging is not doing conversion from raw to index
format. It does very simple raw-to-raw merging using existing runtime APIs.
3) the change to existing profile runtime code is just a few lines. All
the new functionality is isolated in one new file. It will become clear
when the patch is sent out later.

My inline replies below:

Xinliang David Li via llvm-dev <llvm-dev@lists.llvm.org> writes:
> One of the main missing features in Clang/LLVM profile runtime is the
lack of
> support for online/in-process profile merging support. Profile data
collected
> for different workloads for the same executable binary need to be
collected
> and merged later by the offline post-processing tool. This limitation
makes
> it hard to handle cases where the instrumented binary needs to be run
with
> large number of small workloads, possibly in parallel. For instance,
to do
> PGO for clang, we may choose to build a large project with the
instrumented
> Clang binary. This is because
>
> 1) to avoid profile from different runs from overriding others, %p
> substitution needs to be specified in either the command line or an
> environment variable so that different process can dump profile data
> into its own file named using pid.

... or you can specify a more specific name that describes what's under
test, instead of %p.

yes -- but the problem still exists -- each training process will need its
own copy of raw profile.

> This will create huge requirement on the disk storage. For
> instance, clang's raw profile size is typically 80M -- if the
> instrumented clang is used to build a medium to large size project
> (such as clang itself), profile data can easily use up hundreds of
> Gig bytes of local storage.

This argument is kind of confusing. It says that one profile is
typicially 80M, then claims that this uses 100s of GB of data. From
these statements that only makes sense I suppose that's true if you run
1000 profiling runs without merging the data in between. Is that what
you're talking about, or did I miss something?

Yes. For instance, first build a clang with
-fprofile-instr-generate=prof.data.%p, and use this instrumented clang to
build another large project such as clang itself. The second build will
produce tons of profile data.

> 2) pid can also be recycled. This means that some of the profile data
may be
> overridden without being noticed.
>
> The way to solve this problem is to allow profile data to be merged in
> process.

I'm not convinced. Can you provide some more concrete examples of where
the out of process merging model fails? This was a *very deliberate*
design decision in how clang's profiling works, and most of the
subsequent decisions have been based on this initial one. Changing it
has far reaching effects.

I am not proposing changing the out of process merging -- it is still
needed. What I meant is that, in a scenario where the instrumented binaries
are running multiple times (using their existing running harness), there is
no good/automatic way of making sure different process's profile data won't
have name conflict.

Could the profile file be named from a hash of the profile data themselves?

Even with that I still like the direction you're heading to. To avoid
contention on the "file locking", could there be a pool of output files? I
don't know how to map a new process to one output file without a lock
somewhere though.

We could add something like %42H to the format string, which expands to a
hash of the file contents mod 42. That would allow controlling how many
output files there are.

Still, I think that having a %H (or whatever) which expands to a unique ID
is a good primitive to have in the format string and at least addresses the
issue of clobbering.

-- Sean Silva

I have thought about this issue too, in the context of games. We may want
to turn profiling only for certain frames (essentially, this is many small
profile runs).

However, I have not seen it demonstrated that this kind of refined data
collection will actually improve PGO results in practice.
The evidence I do have though is that IIRC Apple have found that almost
all of the benefits of PGO for the Clang binary can be gotten with a
handful of training runs of Clang. Are your findings different?

We have a very wide customer base so we can not claim one use model is
sufficient for all users. For instance, we have users using fine grained
profile dumping control (programatically) as you described above. There are
also other possible use cases such as dump profiles for different
periodical phases into files associated with phases. Later different
phase's profile data can be merged with different weights.

Also, in general, I am very wary of file locking. This can cause huge
amounts of slowdown for a build and has potential portability problems.

I don't see much slow down with a clang build using instrumented clang as
the build compiler. With file locking and profile merging enabled, the
build time on my local machine looks like:

real 18m22.737s
user 293m18.924s
sys 9m55.532s

If profile merging/locking is disabled (i.e, let the profile dumper to
clobber/write over each other), the real time is about 14m.

I don't see it as a substantially better solution than wrapping clang in
a script that runs clang and then just calls llvm-profdata to do the
merging. Running llvm-profdata is cheap compared to doing locking in a
highly parallel situation like a build.

That would require synchronization for merging too.

From Justin's email, it looks like there is a key point I have not made
clear: the on-line profile merge is a very simple raw profile to raw
profile merging which is super fast. The end result of the profile run is
still in raw format. The raw to indexed merging is still needed -- but
instead of merging thousands of raw profiles which can be very slow, with
this model, only one raw profile input is needed.

I think that __llvm_profile_merge_buffers in the runtime would be a useful
primitive if it can be implemented simply (or
__llvm_profile_load_counters_from_buffer, perhaps). If you could post a
patch for that part as a first incremental step that would be a good
starting point for concrete discussion.

In combination with the buffer API and reset_counters this is all that is
needed for very fine-grained counter capture.

-- Sean Silva

Xinliang David Li <davidxl@google.com> writes:

Justin, looks like there is some misunderstanding in my email. I want to
clarify it here first:

1) I am not proposing changing the default profile dumping model as used
today. The online merging is totally optional;
2) the on-line profile merging is not doing conversion from raw to index
format. It does very simple raw-to-raw merging using existing runtime APIs.
3) the change to existing profile runtime code is just a few lines. All the
new functionality is isolated in one new file. It will become clear when
the patch is sent out later.

This doesn't sound quite as scary with this explanation. I'll wait for
the patch.

I have implemented the profile pool idea from Mehdi, and collected performance data related to profile merging and file locking. The following is the experiment setup:

  1. the machine has 32 logical cores (Intel sandybridge machine/64G memory)
  2. the workload is clang self build (~3.3K files to be built), and the instrumented binary is Clang.
  3. ninja parallelism j32

File systems tested (on linux)

  1. a local file system on a SSD drive

  2. tmpfs

  3. a local file system on a hard disk

  4. an internal distributed file system

Configurations tested:

  1. all processes dump to the same profile data file without locking (this configuration of course produces useless profile data in the end, but it serves as the performance baseline)
  2. profile-merging enabled with pool sizes : 1, 2, 3, 4, 5, 10, and 32
  3. using LLVM_PROFILE_FILE=…_%p to enable each process to dump its own copy of profile data (resulting in ~3.2K profile data files in the end). This configuration is only tested on some FS due to size/quota constraints.

Here is a very high level summary of the experiment result. The longer writing latency it is, the more file locking contention is (which is not surprising). In some cases, file lock has close to zero overhead, while for FS with high write latencies, file locking can affect performance negatively. In such cases, using a small pool of profile files can completely recover the performance. The size of the required pool size is capped at a small value (which depends on many different factors: write latency, the rate at which the instrumented binary retires, io throughput/network bandwidth etc).

  1. SSD

The performance is almost identical across ALL the test configurations. The real time needed to complete the full self build is ~13m10s. There is no visible file contention with file locking enabled even with pool size == 1.

  1. tmpfs

only tested with the following configs
a) shared profile with no merge
b) with merge (pool == 1), with merge (pool == 2)

Not surprisingly, the result is similar to SSD case – consistently finished building in a little more than 13m.

  1. HDD

With this configuration, file locking start to show some impact – the write is slow enough to introduce contention.

a) Shared profile without merging: ~13m10s
b) with merging
b.1) pool size == 1: ~18m20s
b.2) pool size == 2: ~16m30s
b.3) pool size == 3: ~15m55s
b.4) pool size == 4: ~16m20s
b.5) pool size == 5: ~16m42s
c) >3000 profile file without merging (%p) : ~16m50s

Increasing the size of merge pool increases dumping parallelism – the performance improves initially but when it is above 4, it starts to degrade gradually. When the HDD IO throughput is saturated at that point and increasing parallelism does not help any more.

In short, with profile merging, we just need to dump 3 profile files to achieve the same build performance that dumps >3000 files (the current default behavior).

  1. An internal file system using network attached storage

In such a file system, the file write has relatively long latency compared with local file systems. The backend storage server does dynamic load balancing so that it can achieve very high IO throughput with high parallelism (at both FE/client side and backend).

a) Single profile without profile merging : ~60m
b) Profile merging enabled:
b.1) pool size == 1: ~80m
b.2) pool size == 2: ~47m
b.3) pool size == 3: ~43m
b.4) pool size == 4: ~40m40s
b.5) pool size == 5: ~38m50s
b.6) pool size == 10: ~36m48s
b.7) pool size == 32: ~36m24s
c) >3000 profile file without profile merging (%p): ~35m24s

b.6), b.7) and c) have the best performance among all.

Unlike in HDD case, a) has poor performance here – due to low parallelism in the storage backend.

With file dumping parallelism, the performance flats out when the pool size >= 10. This is because the client (ninja+clang) system has reached its peak and becomes the new performance bottleneck.

Again, with profile merging, we only need 10 profile data file to achieve the same performance as the default behavior that requires >3000 files to be dumped.

thanks,

David

+ 1 to Sean's suggestion of using a wrapper script to call profdata merge.

David, does that work for your use case?

Every user who needs this use model will be forced to implement this
solution which does not scale. The proposed feature is relatively simple to
do.

Some inline comments ---

>
>>
>>
>> Justin, looks like there is some misunderstanding in my email. I want
to clarify it here first:
>>
>> 1) I am not proposing changing the default profile dumping model as
used today. The online merging is totally optional;
>> 2) the on-line profile merging is not doing conversion from raw to
index format. It does very simple raw-to-raw merging using existing runtime
APIs.
>> 3) the change to existing profile runtime code is just a few lines. All
the new functionality is isolated in one new file. It will become clear
when the patch is sent out later.
>>
>> My inline replies below:
>>
>>
>> Xinliang David Li via llvm-dev <llvm-dev@lists.llvm.org> writes:
>> > One of the main missing features in Clang/LLVM profile runtime is the
lack of
>> > support for online/in-process profile merging support. Profile data
collected
>> > for different workloads for the same executable binary need to be
collected
>> > and merged later by the offline post-processing tool. This
limitation makes
>> > it hard to handle cases where the instrumented binary needs to be run
with
>> > large number of small workloads, possibly in parallel. For instance,
to do
>> > PGO for clang, we may choose to build a large project with the
instrumented
>> > Clang binary. This is because
>> >
>> > 1) to avoid profile from different runs from overriding others, %p
>> > substitution needs to be specified in either the command line or an
>> > environment variable so that different process can dump profile
data
>> > into its own file named using pid.
>>
>> ... or you can specify a more specific name that describes what's under
>> test, instead of %p.
>>
>> yes -- but the problem still exists -- each training process will need
its own copy of raw profile.
>>
>>
>> > This will create huge requirement on the disk storage. For
>> > instance, clang's raw profile size is typically 80M -- if the
>> > instrumented clang is used to build a medium to large size project
>> > (such as clang itself), profile data can easily use up hundreds of
>> > Gig bytes of local storage.
>>
>> This argument is kind of confusing. It says that one profile is
>> typicially 80M, then claims that this uses 100s of GB of data. From
>> these statements that only makes sense I suppose that's true if you run
>> 1000 profiling runs without merging the data in between. Is that what
>> you're talking about, or did I miss something?
>>
>> Yes. For instance, first build a clang with
-fprofile-instr-generate=prof.data.%p, and use this instrumented clang to
build another large project such as clang itself. The second build will
produce tons of profile data.
>>
>>
>> > 2) pid can also be recycled. This means that some of the profile data
may be
>> > overridden without being noticed.
>> >
>> > The way to solve this problem is to allow profile data to be merged in
>> > process.
>>
>> I'm not convinced. Can you provide some more concrete examples of where
>> the out of process merging model fails? This was a *very deliberate*
>> design decision in how clang's profiling works, and most of the
>> subsequent decisions have been based on this initial one. Changing it
>> has far reaching effects.
>>
>> I am not proposing changing the out of process merging -- it is still
needed. What I meant is that, in a scenario where the instrumented binaries
are running multiple times (using their existing running harness), there is
no good/automatic way of making sure different process's profile data won't
have name conflict.
>
> Could the profile file be named from a hash of the profile data
themselves?

IIUC David's plan is to repeatedly use the same on-disk raw-profile.

> Even with that I still like the direction you're heading to. To avoid
contention on the "file locking", could there be a pool of output files? I
don't know how to map a new process to one output file without a lock
somewhere though.
>
> --
> Mehdi
>
>
>>
>> Using clang's self build (using instrumented clang as build compiler
for profile bootstrapping) as an example. Ideally this should all be done
transparently -- i.e, set the instrumented compiler as the build compiler,
run ninja or make and things will just work, but with the current default
profile dumping mode, it can fail in many different ways:
>> 1) Just run ninja/make -- all clang processes will dump profile into
the same file concurrently -- the result is a corrupted profile -- FAIL
>> 2) run ninja with LLVM_PROFILE_FILE=....%p
>> 2.1) failure mode #1 --> really slow build due to large IO; or
running out of diskspace
>> 2.2) failure mode #2 --> pid recyling leading to profile file name
conflict -- profile overwriting happens and we loss data
>>
>> Suppose 2) above finally succeeds, the user will have to merge
thousands of raw profiles to indexed profile.
>>
>> With the proposed profile on-line merging, you just need to use the
instrumented clang, and one merged raw profile data automagically produced
in the end. The raw to indexed merge is also much faster.
>>
>> The online merge feature has a huge advantage when considering
integrating the instrumented binary with existing make systems or
loadtesting harness -- it is almost seamless.
>>
>>
>>
>> > I have a prototype implementation and plan to send it out for review
>> > soon after some clean ups. By default, the profiling merging is off
and it can
>> > be turned on with an user option or via an environment variable. The
following
>> > summarizes the issues involved in adding this feature:
>> > 1. the target platform needs to have file locking support
>> > 2. there needs an efficient way to identify the profile data and
associate it
>> > with the binary using binary/profdata signature;
>> > 3. Currently without merging, profile data from shared libraries
>> > (including dlopen/dlcose ones) are concatenated into the primary
>> > profile file. This can complicate matters, as the merger also
needs to
>> > find the matching shared libs, and the merger also needs to avoid
>> > unnecessary data movement/copy;
>> > 4. value profile data is variable in length even for the same binary.
>>
>> If we actually want this, we should reconsider the design of having a
>> raw vs processed profiling format. The raw profile format is
>> specifically designed to be fast to write out and not to consider
>> merging profiles at all. This feature would make it nearly as
>> complicated as the processed format and lose all of the advantages of
>> making them different.
>>
>> See above -- all the nice raw profile dumping mechanism is still kept
-- there won't be a change of that.
>>
>> thanks,
>>
>> David
>>
>>
>>
>> > All the above issues are resolved and clang self build with
instrumented
>> > binary passes (with both j1 and high parallelism).
>> >
>> > If you have any concerns, please let me know.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev@lists.llvm.org
>> llvm-dev Info Page
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev@lists.llvm.org
> llvm-dev Info Page
>
> However, I have not seen it demonstrated that this kind of refined data
collection will actually improve PGO results in practice.

That's an important question to ask. But, if users do not want profiles
from certain runs, shouldn't they either (1) pass
LLVM_PROFILE_FILE=/dev/null, or (2) not instrument the binary?

Having ways to skip profile for certain runs is probably orthogonal to this
feature. Yes, using LLVM_PROFILE_FILE=/dev/null can be one way to do it.

> Also, in general, I am very wary of file locking. This can cause huge
amounts of slowdown for a build and has potential portability problems.

Yeah, these things aren't great, but as long as they are opt-in I think
it'd be reasonable.

Also to be fair, writing out tons of raw profiles to separate files is a
lot slower than overwriting the same raw profile.

> I don't see it as a substantially better solution than wrapping clang in
a script that runs clang and then just calls llvm-profdata to do the
merging. Running llvm-profdata is cheap compared to doing locking in a
highly parallel situation like a build.

^ + 1.

Doing merging using llvm-profdata requires locking for the indexed data
file too. There are many issues:
1) we don't want to teach llvm-profdata to do locking, or
2) currently seems not possible to pass the file lock ownership from parent
wrapper to llvm-profdata
3) llvm-profdata is a host tool which may not run on the target system
4) compared with raw merging/dumping, profile conversion is slower which
had more negative impact on parallel build.

thanks,

David

Hi David,

This is wonderful data and demonstrates the viability of this feature. I think this has alleviated the concerns regarding file locking.

As far as the implementation of the feature, I think we will probably want the following incremental steps:
a) implement the core merging logic and add to buffer API a primitive for merging two buffers
b) implement the file system glue to extend this to the filesystem API’s (write_file etc.)
c) implement a profile filename format string which generates a random number mod a specified amount (strawman: LLVM_PROFILE_FILE=default.profraw.%7u which generates a _u_nique number mod 7. Of course, in general it is %<N>u)

b) depends on a), but c) can be done in parallel with both.

Does this seem feasible?

– Sean Silva

sounds reasonable. My design of c) is different in many ways (e.g, using getpid()%PoolSize), but we can delay discussion of that in code review.

thanks,

David

+ 1 to Sean's suggestion of using a wrapper script to call profdata merge.

David, does that work for your use case?

Every user who needs this use model will be forced to implement this solution which does not scale. The proposed feature is relatively simple to do.

Some inline comments ---

>
>>
>>
>> Justin, looks like there is some misunderstanding in my email. I want to clarify it here first:
>>
>> 1) I am not proposing changing the default profile dumping model as used today. The online merging is totally optional;
>> 2) the on-line profile merging is not doing conversion from raw to index format. It does very simple raw-to-raw merging using existing runtime APIs.
>> 3) the change to existing profile runtime code is just a few lines. All the new functionality is isolated in one new file. It will become clear when the patch is sent out later.
>>
>> My inline replies below:
>>
>>
>> Xinliang David Li via llvm-dev <llvm-dev@lists.llvm.org> writes:
>> > One of the main missing features in Clang/LLVM profile runtime is the lack of
>> > support for online/in-process profile merging support. Profile data collected
>> > for different workloads for the same executable binary need to be collected
>> > and merged later by the offline post-processing tool. This limitation makes
>> > it hard to handle cases where the instrumented binary needs to be run with
>> > large number of small workloads, possibly in parallel. For instance, to do
>> > PGO for clang, we may choose to build a large project with the instrumented
>> > Clang binary. This is because
>> >
>> > 1) to avoid profile from different runs from overriding others, %p
>> > substitution needs to be specified in either the command line or an
>> > environment variable so that different process can dump profile data
>> > into its own file named using pid.
>>
>> ... or you can specify a more specific name that describes what's under
>> test, instead of %p.
>>
>> yes -- but the problem still exists -- each training process will need its own copy of raw profile.
>>
>>
>> > This will create huge requirement on the disk storage. For
>> > instance, clang's raw profile size is typically 80M -- if the
>> > instrumented clang is used to build a medium to large size project
>> > (such as clang itself), profile data can easily use up hundreds of
>> > Gig bytes of local storage.
>>
>> This argument is kind of confusing. It says that one profile is
>> typicially 80M, then claims that this uses 100s of GB of data. From
>> these statements that only makes sense I suppose that's true if you run
>> 1000 profiling runs without merging the data in between. Is that what
>> you're talking about, or did I miss something?
>>
>> Yes. For instance, first build a clang with -fprofile-instr-generate=prof.data.%p, and use this instrumented clang to build another large project such as clang itself. The second build will produce tons of profile data.
>>
>>
>> > 2) pid can also be recycled. This means that some of the profile data may be
>> > overridden without being noticed.
>> >
>> > The way to solve this problem is to allow profile data to be merged in
>> > process.
>>
>> I'm not convinced. Can you provide some more concrete examples of where
>> the out of process merging model fails? This was a *very deliberate*
>> design decision in how clang's profiling works, and most of the
>> subsequent decisions have been based on this initial one. Changing it
>> has far reaching effects.
>>
>> I am not proposing changing the out of process merging -- it is still needed. What I meant is that, in a scenario where the instrumented binaries are running multiple times (using their existing running harness), there is no good/automatic way of making sure different process's profile data won't have name conflict.
>
> Could the profile file be named from a hash of the profile data themselves?

IIUC David's plan is to repeatedly use the same on-disk raw-profile.

> Even with that I still like the direction you're heading to. To avoid contention on the "file locking", could there be a pool of output files? I don't know how to map a new process to one output file without a lock somewhere though.
>
> --
> Mehdi
>
>
>>
>> Using clang's self build (using instrumented clang as build compiler for profile bootstrapping) as an example. Ideally this should all be done transparently -- i.e, set the instrumented compiler as the build compiler, run ninja or make and things will just work, but with the current default profile dumping mode, it can fail in many different ways:
>> 1) Just run ninja/make -- all clang processes will dump profile into the same file concurrently -- the result is a corrupted profile -- FAIL
>> 2) run ninja with LLVM_PROFILE_FILE=....%p
>> 2.1) failure mode #1 --> really slow build due to large IO; or running out of diskspace
>> 2.2) failure mode #2 --> pid recyling leading to profile file name conflict -- profile overwriting happens and we loss data
>>
>> Suppose 2) above finally succeeds, the user will have to merge thousands of raw profiles to indexed profile.
>>
>> With the proposed profile on-line merging, you just need to use the instrumented clang, and one merged raw profile data automagically produced in the end. The raw to indexed merge is also much faster.
>>
>> The online merge feature has a huge advantage when considering integrating the instrumented binary with existing make systems or loadtesting harness -- it is almost seamless.
>>
>>
>>
>> > I have a prototype implementation and plan to send it out for review
>> > soon after some clean ups. By default, the profiling merging is off and it can
>> > be turned on with an user option or via an environment variable. The following
>> > summarizes the issues involved in adding this feature:
>> > 1. the target platform needs to have file locking support
>> > 2. there needs an efficient way to identify the profile data and associate it
>> > with the binary using binary/profdata signature;
>> > 3. Currently without merging, profile data from shared libraries
>> > (including dlopen/dlcose ones) are concatenated into the primary
>> > profile file. This can complicate matters, as the merger also needs to
>> > find the matching shared libs, and the merger also needs to avoid
>> > unnecessary data movement/copy;
>> > 4. value profile data is variable in length even for the same binary.
>>
>> If we actually want this, we should reconsider the design of having a
>> raw vs processed profiling format. The raw profile format is
>> specifically designed to be fast to write out and not to consider
>> merging profiles at all. This feature would make it nearly as
>> complicated as the processed format and lose all of the advantages of
>> making them different.
>>
>> See above -- all the nice raw profile dumping mechanism is still kept -- there won't be a change of that.
>>
>> thanks,
>>
>> David
>>
>>
>>
>> > All the above issues are resolved and clang self build with instrumented
>> > binary passes (with both j1 and high parallelism).
>> >
>> > If you have any concerns, please let me know.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev@lists.llvm.org
>> llvm-dev Info Page
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev@lists.llvm.org
> llvm-dev Info Page
>
> However, I have not seen it demonstrated that this kind of refined data collection will actually improve PGO results in practice.

That's an important question to ask. But, if users do not want profiles from certain runs, shouldn't they either (1) pass LLVM_PROFILE_FILE=/dev/null, or (2) not instrument the binary?

Having ways to skip profile for certain runs is probably orthogonal to this feature. Yes, using LLVM_PROFILE_FILE=/dev/null can be one way to do it.

> Also, in general, I am very wary of file locking. This can cause huge amounts of slowdown for a build and has potential portability problems.

Yeah, these things aren't great, but as long as they are opt-in I think it'd be reasonable.

Also to be fair, writing out tons of raw profiles to separate files is a lot slower than overwriting the same raw profile.

> I don't see it as a substantially better solution than wrapping clang in a script that runs clang and then just calls llvm-profdata to do the merging. Running llvm-profdata is cheap compared to doing locking in a highly parallel situation like a build.

^ + 1.

Doing merging using llvm-profdata requires locking for the indexed data file too. There are many issues:
1) we don't want to teach llvm-profdata to do locking, or
2) currently seems not possible to pass the file lock ownership from parent wrapper to llvm-profdata

Is this step needed to make sure another process doesn't overwrite a temporary raw profile before it's merged into the accumulated profile?

3) llvm-profdata is a host tool which may not run on the target system

IMO this is the best reason to extend compiler-rt to write out accumulated raw profiles.

4) compared with raw merging/dumping, profile conversion is slower which had more negative impact on parallel build.

Good point.

vedant

>
>
>
> + 1 to Sean's suggestion of using a wrapper script to call profdata
merge.
>
> David, does that work for your use case?
>
> Every user who needs this use model will be forced to implement this
solution which does not scale. The proposed feature is relatively simple to
do.
>
> Some inline comments ---
>
> >
> >>
> >>
> >> Justin, looks like there is some misunderstanding in my email. I want
to clarify it here first:
> >>
> >> 1) I am not proposing changing the default profile dumping model as
used today. The online merging is totally optional;
> >> 2) the on-line profile merging is not doing conversion from raw to
index format. It does very simple raw-to-raw merging using existing runtime
APIs.
> >> 3) the change to existing profile runtime code is just a few lines.
All the new functionality is isolated in one new file. It will become clear
when the patch is sent out later.
> >>
> >> My inline replies below:
> >>
> >>
> >> Xinliang David Li via llvm-dev <llvm-dev@lists.llvm.org> writes:
> >> > One of the main missing features in Clang/LLVM profile runtime is
the lack of
> >> > support for online/in-process profile merging support. Profile data
collected
> >> > for different workloads for the same executable binary need to be
collected
> >> > and merged later by the offline post-processing tool. This
limitation makes
> >> > it hard to handle cases where the instrumented binary needs to be
run with
> >> > large number of small workloads, possibly in parallel. For
instance, to do
> >> > PGO for clang, we may choose to build a large project with the
instrumented
> >> > Clang binary. This is because
> >> >
> >> > 1) to avoid profile from different runs from overriding others, %p
> >> > substitution needs to be specified in either the command line or
an
> >> > environment variable so that different process can dump profile
data
> >> > into its own file named using pid.
> >>
> >> ... or you can specify a more specific name that describes what's
under
> >> test, instead of %p.
> >>
> >> yes -- but the problem still exists -- each training process will
need its own copy of raw profile.
> >>
> >>
> >> > This will create huge requirement on the disk storage. For
> >> > instance, clang's raw profile size is typically 80M -- if the
> >> > instrumented clang is used to build a medium to large size
project
> >> > (such as clang itself), profile data can easily use up hundreds
of
> >> > Gig bytes of local storage.
> >>
> >> This argument is kind of confusing. It says that one profile is
> >> typicially 80M, then claims that this uses 100s of GB of data. From
> >> these statements that only makes sense I suppose that's true if you
run
> >> 1000 profiling runs without merging the data in between. Is that what
> >> you're talking about, or did I miss something?
> >>
> >> Yes. For instance, first build a clang with
-fprofile-instr-generate=prof.data.%p, and use this instrumented clang to
build another large project such as clang itself. The second build will
produce tons of profile data.
> >>
> >>
> >> > 2) pid can also be recycled. This means that some of the profile
data may be
> >> > overridden without being noticed.
> >> >
> >> > The way to solve this problem is to allow profile data to be merged
in
> >> > process.
> >>
> >> I'm not convinced. Can you provide some more concrete examples of
where
> >> the out of process merging model fails? This was a *very deliberate*
> >> design decision in how clang's profiling works, and most of the
> >> subsequent decisions have been based on this initial one. Changing it
> >> has far reaching effects.
> >>
> >> I am not proposing changing the out of process merging -- it is still
needed. What I meant is that, in a scenario where the instrumented binaries
are running multiple times (using their existing running harness), there is
no good/automatic way of making sure different process's profile data won't
have name conflict.
> >
> > Could the profile file be named from a hash of the profile data
themselves?
>
> IIUC David's plan is to repeatedly use the same on-disk raw-profile.
>
>
> > Even with that I still like the direction you're heading to. To avoid
contention on the "file locking", could there be a pool of output files? I
don't know how to map a new process to one output file without a lock
somewhere though.
> >
> > --
> > Mehdi
> >
> >
> >>
> >> Using clang's self build (using instrumented clang as build compiler
for profile bootstrapping) as an example. Ideally this should all be done
transparently -- i.e, set the instrumented compiler as the build compiler,
run ninja or make and things will just work, but with the current default
profile dumping mode, it can fail in many different ways:
> >> 1) Just run ninja/make -- all clang processes will dump profile into
the same file concurrently -- the result is a corrupted profile -- FAIL
> >> 2) run ninja with LLVM_PROFILE_FILE=....%p
> >> 2.1) failure mode #1 --> really slow build due to large IO; or
running out of diskspace
> >> 2.2) failure mode #2 --> pid recyling leading to profile file name
conflict -- profile overwriting happens and we loss data
> >>
> >> Suppose 2) above finally succeeds, the user will have to merge
thousands of raw profiles to indexed profile.
> >>
> >> With the proposed profile on-line merging, you just need to use the
instrumented clang, and one merged raw profile data automagically produced
in the end. The raw to indexed merge is also much faster.
> >>
> >> The online merge feature has a huge advantage when considering
integrating the instrumented binary with existing make systems or
loadtesting harness -- it is almost seamless.
> >>
> >>
> >>
> >> > I have a prototype implementation and plan to send it out for review
> >> > soon after some clean ups. By default, the profiling merging is off
and it can
> >> > be turned on with an user option or via an environment variable.
The following
> >> > summarizes the issues involved in adding this feature:
> >> > 1. the target platform needs to have file locking support
> >> > 2. there needs an efficient way to identify the profile data and
associate it
> >> > with the binary using binary/profdata signature;
> >> > 3. Currently without merging, profile data from shared libraries
> >> > (including dlopen/dlcose ones) are concatenated into the primary
> >> > profile file. This can complicate matters, as the merger also
needs to
> >> > find the matching shared libs, and the merger also needs to avoid
> >> > unnecessary data movement/copy;
> >> > 4. value profile data is variable in length even for the same
binary.
> >>
> >> If we actually want this, we should reconsider the design of having a
> >> raw vs processed profiling format. The raw profile format is
> >> specifically designed to be fast to write out and not to consider
> >> merging profiles at all. This feature would make it nearly as
> >> complicated as the processed format and lose all of the advantages of
> >> making them different.
> >>
> >> See above -- all the nice raw profile dumping mechanism is still kept
-- there won't be a change of that.
> >>
> >> thanks,
> >>
> >> David
> >>
> >>
> >>
> >> > All the above issues are resolved and clang self build with
instrumented
> >> > binary passes (with both j1 and high parallelism).
> >> >
> >> > If you have any concerns, please let me know.
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev@lists.llvm.org
> >> llvm-dev Info Page
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev@lists.llvm.org
> > llvm-dev Info Page
> >
> > However, I have not seen it demonstrated that this kind of refined
data collection will actually improve PGO results in practice.
>
> That's an important question to ask. But, if users do not want profiles
from certain runs, shouldn't they either (1) pass
LLVM_PROFILE_FILE=/dev/null, or (2) not instrument the binary?
>
> Having ways to skip profile for certain runs is probably orthogonal to
this feature. Yes, using LLVM_PROFILE_FILE=/dev/null can be one way to do
it.
>
>
>
> > Also, in general, I am very wary of file locking. This can cause huge
amounts of slowdown for a build and has potential portability problems.
>
> Yeah, these things aren't great, but as long as they are opt-in I think
it'd be reasonable.
>
> Also to be fair, writing out tons of raw profiles to separate files is a
lot slower than overwriting the same raw profile.
>
>
> > I don't see it as a substantially better solution than wrapping clang
in a script that runs clang and then just calls llvm-profdata to do the
merging. Running llvm-profdata is cheap compared to doing locking in a
highly parallel situation like a build.
>
> ^ + 1.
>
>
> Doing merging using llvm-profdata requires locking for the indexed data
file too. There are many issues:
> 1) we don't want to teach llvm-profdata to do locking, or
> 2) currently seems not possible to pass the file lock ownership from
parent wrapper to llvm-profdata

Is this step needed to make sure another process doesn't overwrite a
temporary raw profile before it's merged into the accumulated profile?

Instrumented binary process produces temporary raw profile data will use
tricks like %p that won't collide with other process (such temp files will
be deleted by the wrapper after merging). It is the indexed profile
produced by llvm-profdata that needs synchronization in this scheme.

David

sounds reasonable. My design of c) is different in many ways (e.g, using
getpid()%PoolSize), but we can delay discussion of that in code review.

I like that (e.g. support %7p in addition to %p).

-- Sean Silva