multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Hi,

The current design of -fprofile-instr-generate has the same fundamental flaw
as the old gcc’s gcov instrumentation: it has contention on counters.
A trivial synthetic test case was described here:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html

For the problem to appear we need to have a hot function that is simultaneously executed
by multiple threads – then we will have high contention on the racy profile counters.

Such situation is not necessary very frequent, but when it happens
-fprofile-instr-generate becomes barely usable due to huge slowdown (5x-10x)

An example is the multi-threaded vp9 video encoder.

git clone https://chromium.googlesource.com/webm/libvpx

cd libvpx/

F="-no-integrated-as -fprofile-instr-generate"; CC=“clang $F” CXX=“clang++ $F” LD=“clang++ $F” ./configure

make -j32

get sample video from from https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m

time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m

When running single-threaded, -fprofile-instr-generate adds reasonable ~15% overhead

(8.5 vs 10 seconds)
When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 seconds).

I am not saying that this flaw is a showstopper, but with the continued move
towards multithreading it will be hurting more and more users of coverage and PGO.
AFAICT, most of our PGO users simply can not run their software in single-threaded mode,
and some of them surely have hot functions running in all threads at once.

At the very least we should document this problem, but better try fixing it.

Some ideas:

  • per-thread counters. Solves the problem at huge cost in RAM per-thread
  • 8-bit per-thread counters, dumping into central counters on overflow.
  • per-cpu counters (not portable, requires very modern kernel with lots of patches)
  • sharded counters: each counter represented as N counters sitting in different cache lines. Every thread accesses the counter with index TID%N. Solves the problem partially, better with larger values of N, but then again it costs RAM.
  • reduce contention on hot counters by not incrementing them if they are big enough:
    {if (counter < 65536) counter++}; This reduces the accuracy though. Is that bad for PGO?
  • self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) == 0) counter++;

Other thoughts?

–kcc

If accuracy is not critical, incrementing the counters without any guards might be good enough.
Hot areas will still be hot and cold areas will not be affected.

Yaron

If accuracy is not critical, incrementing the counters without any guards
might be good enough.

No. Contention on the counters leads to 5x-10x slowdown. This is never
good enough.

--kcc

Hot areas will still be hot and cold areas will not be affected.

How about per-thread if the counter is hot enough?

Jon

Err. How do you know if the counter is hot w/o first profiling the app?

if (counter_is_local)
   counter++
   if (counter > threshold)
     counter_is_local=false
else
   increment_thread_local_counter()

if (counter_is_local)
  counter++
  if (counter > threshold)
    counter_is_local=false
else
  increment_thread_local_counter()

This might be another choice, but unless you implement
increment_thread_local_counter() in some clever way
it will still require the same RAM overhead as just keeping the counters in
TLS

    if (counter_is_local)
       counter++
       if (counter > threshold)
         counter_is_local=false
    else
       increment_thread_local___counter()

This might be another choice, but unless you implement
increment_thread_local___counter() in some clever way
it will still require the same RAM overhead as just keeping the counters in TLS

Right, so you wouldn't want to have every thread have its own copy of the whole list of counters, but I think you may be able to get away with keeping smaller per-thread pools, and bump-ptr allocating within those pools.

Hi,

The current design of -fprofile-instr-generate has the same fundamental
flaw
as the old gcc's gcov instrumentation: it has contention on counters.
A trivial synthetic test case was described here:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html

For the problem to appear we need to have a hot function that is
simultaneously executed
by multiple threads -- then we will have high contention on the racy
profile counters.

Such situation is not necessary very frequent, but when it happens
-fprofile-instr-generate becomes barely usable due to huge slowdown
(5x-10x)

We have seen even larger slowdowns, but it is uncommon, nor have I heard
many complaints about it.

An example is the multi-threaded vp9 video encoder.

git clone https://chromium.googlesource.com/webm/libvpx
cd libvpx/
F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++
$F" LD="clang++ $F" ./configure
make -j32
# get sample video from from
https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m
time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m

When running single-threaded, -fprofile-instr-generate adds reasonable
~15% overhead
(8.5 vs 10 seconds)
When running with 8 threads, it has 7x overhead (3.5 seconds vs 26
seconds).

I am not saying that this flaw is a showstopper, but with the continued
move
towards multithreading it will be hurting more and more users of coverage
and PGO.
AFAICT, most of our PGO users simply can not run their software in
single-threaded mode,
and some of them surely have hot functions running in all threads at once.

At the very least we should document this problem, but better try fixing
it.

Users can also select smaller (but still representative) training data set
to solve the problem.

Some ideas:

- per-thread counters. Solves the problem at huge cost in RAM per-thread

It is not practical. Especially for TLS counters -- it creates huge
pressure on stack memory.

- 8-bit per-thread counters, dumping into central counters on overflow.

The overflow will happen very quickly with 8bit counter.

- per-cpu counters (not portable, requires very modern kernel with lots of
patches)
- sharded counters: each counter represented as N counters sitting in
different cache lines. Every thread accesses the counter with index TID%N.
Solves the problem partially, better with larger values of N, but then
again it costs RAM.

Interesting idea. This may work well.

- reduce contention on hot counters by not incrementing them if they are
big enough:
   {if (counter < 65536) counter++}; This reduces the accuracy though. Is
that bad for PGO?

yes, it will be bad.

- self-cooling logarithmic counters: if ((fast_random() % (1 << counter))
== 0) counter++;

Another idea is to use stack local counters per function -- synced up with
global counters on entry and exit. the problem with it is for deeply
recursive calls, stack pressure can be too high.

In Google GCC, we implemented another technique which proves to be very
effective -- it is called FDO sampling. Basically counters will be updated
every N samples.

David

I think they'd need to be synced with global counters before function
calls as well, since any function call can call "exit()".

right -- but it might be better to handle this in other ways. For instance
a stack of counters for each frames is maintained. At exit, they are
flushed in a batch. Or simply ignore it in case of program exit .

David

It seems to me like we’re going to have a hard time getting good multithreaded performance without significant impact on the single-threaded behavior. We might need to add an option to choose between those. There’s a lot of room for improvement in the performance with the current instrumentation, so maybe we can find a way to make things incrementally better in a way that helps both, but avoiding the multithreaded cache conflicts seems like it’s going to be expensive in other ways.

Yes – option. It would be unwise to penalize single threaded app unconditionally.

David

I don't really agree.

First, multithreaded applications are going to be the majority soon, even
if they aren't already. We should design for them and support them well by
default. If, once we have that, we find single threaded performance
dramatically suffers, then maybe we should add a flag. But it doesn't make
sense to do this before we even have data.

If someone wants to revise the instrumentation in a way that works better for multithreaded code, that’s great. Before the change is committed, we should have performance data comparing it to the current code. If there is no regression, then fine. If it significantly hurts single-threaded performance, then we will need a flag.

If you want to default Darwin into slow multithreaded instrumentation to
make single threaded instrumentation faster, go for it. That is not the
correct default for Clang as a project or the Linux port though.

Anyways, all of this is somewhat moot. We really need the *architecture* of
PGO instrumentation to not be multiple times slower for multithreaded
applications. I think this is critical limitation of the current design.

Having thought a bit about the best strategy to solve this, I think we
should use a tradeoff of memory to reduce contention. I don't really like
any of the other options as much, if we can get that one to work. Here is
my specific suggestion:

- per-cpu counters (not portable, requires very modern kernel with lots of
patches)
- sharded counters: each counter represented as N counters sitting in
different cache lines. Every thread accesses the counter with index TID%N.
Solves the problem partially, better with larger values of N, but then
again it costs RAM.

I think we should combine these somewhat.

At an abstract level, I think we should design the profiling to support up
to N shards of counters.

I think we should have a dynamic number of shards if possible. The goal
here is that if you never need multiple shards (single threaded) you pay
essentially zero cost. I would have a global number of shards that changes
rarely, and re-compute it on entry to each function with something along
the lines of:

if (thread-ID != main's thread-ID && shard_count == 1) {
  shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES));
  // if shard_count changed with this, we can also call a library routine
here that does the work of allocating the actual extra shards.
}

MAX is a fixed cap so even on systems with 100s of cores we don't do
something silly. NUBER_OF_THREADS, if supported on the OS, can limit the
shards when we only have a small number of threads in the program.
NUMBER_OF_CORES, if supported on the OS, can limit the shards. If we don't
have the number of threads, we just use the number of cores. If we don't
have the number of cores, we can just guess 8 (or something).

Then, we can gracefully fall back on the following strategies to pick an
index into the shards:

- Per-core non-atomic counter updates (if we support them) using
restartable sequences
- Use a core-ID as the index into the shards to statistically minimize the
contention, and do the increments atomically so it remains correct even if
the core-ID load happens before a core migration and the counter increment
occurs afterward
- Use (thread-ID % number of cores) if we don't have support for getting a
core-ID from the target OS. This will still have a reasonable distribution
I suspect, even if not perfect.

Finally, I wouldn't merge on shutdown if possible. I would just write
larger raw profile data for multithreaded runs, and let the profdata tool
merge them.

If this is still too much memory, then I would suggest doing the above, but
doing it independently for each function so that only those functions
actually called via multithreaded code end up sharding their counters.

I think this would be reasonably straightforward to implement, not
significantly grow the cost of single-threaded instrumentation, and largely
mitigate the contention on the counters. It can benefit from advanced hooks
into the OS when those are available, but seems pretty easy to implement on
roughly any OS with reasonable tradeoffs. The only really hard requirement
is the ability to get a thread-id, but I think that is fairly reasonable
(C++ even makes this essentially mandatory).

Thoughts?

Chandler Carruth <chandlerc@google.com> writes:

Having thought a bit about the best strategy to solve this, I think we should
use a tradeoff of memory to reduce contention. I don't really like any of the
other options as much, if we can get that one to work. Here is my specific
suggestion:

    - per-cpu counters (not portable, requires very modern kernel with lots of
    patches)
    - sharded counters: each counter represented as N counters sitting in
    different cache lines. Every thread accesses the counter with index TID%N.
    Solves the problem partially, better with larger values of N, but then
    again it costs RAM.

I think we should combine these somewhat.

At an abstract level, I think we should design the profiling to support up to
N shards of counters.

I think we should have a dynamic number of shards if possible. The goal here
is that if you never need multiple shards (single threaded) you pay
essentially zero cost. I would have a global number of shards that changes
rarely, and re-compute it on entry to each function with something along the
lines of:

if (thread-ID != main's thread-ID && shard_count == 1) {
shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES));
// if shard_count changed with this, we can also call a library routine here
that does the work of allocating the actual extra shards.
}

Is it possible to hook on something more clever than function entry?
Choosing to do this based on thread creation or something like that
could make this extra check very cheap.

MAX is a fixed cap so even on systems with 100s of cores we don't do something
silly. NUBER_OF_THREADS, if supported on the OS, can limit the shards when we
only have a small number of threads in the program. NUMBER_OF_CORES, if
supported on the OS, can limit the shards. If we don't have the number of
threads, we just use the number of cores. If we don't have the number of
cores, we can just guess 8 (or something).

I like the general idea of this approach. The obvious TLS approach
worried me in that it explodes memory usage by much more than you'll
generally need.

Then, we can gracefully fall back on the following strategies to pick an index
into the shards:

- Per-core non-atomic counter updates (if we support them) using restartable
sequences
- Use a core-ID as the index into the shards to statistically minimize the
contention, and do the increments atomically so it remains correct even if the
core-ID load happens before a core migration and the counter increment occurs
afterward
- Use (thread-ID % number of cores) if we don't have support for getting a
core-ID from the target OS. This will still have a reasonable distribution I
suspect, even if not perfect.

Finally, I wouldn't merge on shutdown if possible. I would just write larger
raw profile data for multithreaded runs, and let the profdata tool merge them.

Agreed. Doing the merging offline is a clear win. The space overhead is
short lived enough that it shouldn't be a big issue.

If this is still too much memory, then I would suggest doing the above, but
doing it independently for each function so that only those functions actually
called via multithreaded code end up sharding their counters.

I'd worry that doing this per-function might be too fine of granularity,
but there are certainly use cases where large chunks of the code are
only exercised by one (of many) threads, so something along these lines
could be worthwhile.

Good thinking, but why do you think runtime selection of shard count is
better than compile time selection? For single threaded apps, shard count
is always 1, so why paying the penalty to check thread id each time
function is entered?

For multi-threaded apps, I would expect MAX to be smaller than NUM_OF_CORES
to avoid excessive memory consumption, then you always end up with N ==
MAX. If MAX is larger than NUM_OF_CORES, for large MT apps, the # of
threads tends to be larger than NUM_OF_CORES, so it also ends up with N ==
MAX. For rare cases, the shard count may switch between MAX and
NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter
arrays each time it changes.

Making N non compile time constant also makes the indexing more expensive.
Of course we can ignore thread migration and do CSE on it.

David

Chandler Carruth <chandlerc@google.com> writes:

if (thread-ID != main's thread-ID && shard_count < std::min(MAX,
NUMBER_OF_CORES)) {
> shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS,
NUMBER_OF_CORES));
> // if shard_count changed with this, we can also call a library
routine here
> that does the work of allocating the actual extra shards.
> }

Note, I had a bug here. The first time I had this only grow the shard count
once, and later thought it might be nice to grow this incrementally so that
applications that only need 2 or 3 threads, don't pay for more. I fixed the
guard here, but naturally this is still hand-wavy. =D

Is it possible to hook on something more clever than function entry?
Choosing to do this based on thread creation or something like that
could make this extra check very cheap.

Yea, *if* you can do it on thread creation, that would be awesome. But we
don't really have a good way to model that right now. on the flip side.

> MAX is a fixed cap so even on systems with 100s of cores we don't do
something
> silly. NUBER_OF_THREADS, if supported on the OS, can limit the shards
when we
> only have a small number of threads in the program. NUMBER_OF_CORES, if
> supported on the OS, can limit the shards. If we don't have the number of
> threads, we just use the number of cores. If we don't have the number of
> cores, we can just guess 8 (or something).

I like the general idea of this approach. The obvious TLS approach
worried me in that it explodes memory usage by much more than you'll
generally need.

> Then, we can gracefully fall back on the following strategies to pick an
index
> into the shards:
>
> - Per-core non-atomic counter updates (if we support them) using
restartable
> sequences
> - Use a core-ID as the index into the shards to statistically minimize
the
> contention, and do the increments atomically so it remains correct even
if the
> core-ID load happens before a core migration and the counter increment
occurs
> afterward
> - Use (thread-ID % number of cores) if we don't have support for getting
a
> core-ID from the target OS. This will still have a reasonable
distribution I
> suspect, even if not perfect.
>
> Finally, I wouldn't merge on shutdown if possible. I would just write
larger
> raw profile data for multithreaded runs, and let the profdata tool merge
them.

Agreed. Doing the merging offline is a clear win. The space overhead is
short lived enough that it shouldn't be a big issue.

> If this is still too much memory, then I would suggest doing the above,
but
> doing it independently for each function so that only those functions
actually
> called via multithreaded code end up sharding their counters.

I'd worry that doing this per-function might be too fine of granularity,
but there are certainly use cases where large chunks of the code are
only exercised by one (of many) threads, so something along these lines
could be worthwhile.

Yea. If we need to do it in a fine grained way (IE, not whole-program
sharding), I think my favorite granularity would be
per-post-inliner-function. The other thing to do is to set up
function-local variables at that phase that are used to locate the counters
cheaply within the function body.

One way that might work would be to use essentially synthetic intrinsics to
introduce the *markers* for instrumentation (and the function names,
frontend hashes etc, all the stuff that needs language-level semantics)
into the code, but have LLVM transform these into the actual
instrumentation code after inlining and optimization. Then it can also
inject the necessary code into the more coarse grained boundaries.