RFC: Pass to prune redundant profiling instrumentation

Hi,

I'd like to add a new pass to LLVM which removes redundant profile counter
updates. The goal is to speed up code coverage testing and profile generation
for PGO.

I'm sending this email out to describe my approach, share some early results,
and gather feedback.

Problem Overview

Hi,

I'd like to add a new pass to LLVM which removes redundant profile counter
updates. The goal is to speed up code coverage testing and profile generation
for PGO.

I'm sending this email out to describe my approach, share some early results,
and gather feedback.

Problem Overview

A profile counter is redundant if it's incremented in exactly the same basic
blocks as some other profile counter. Consider the following module:

   local void f1() {
     instrprof_increment(profc_f1);
   }

   void f2() {
     instrprof_increment(profc_f2);
     f1();
   }

Once the inliner runs and deletes f1, we're left with:

   void f2() {
     instrprof_increment(profc_f2);
     instrprof_increment(profc_f1);
   }

Now we can say profc_f1 is redundant (or, an alias for profc_f2).

I've noticed that frontend-based instrumentation can generate many redundant
profile counters. This degrades performance and increases code size. We can
address the problem by getting rid of redundant counter updates. The trick is
to make sure we get back the same profiles.

Proposed Solution

I propose a pruning pass which takes the following steps:

1. Delete functions with local linkage and only one use, if that use is in
    a profile data record.

    These functions are left around by the inliner (it doesn't know that
    they're safe to delete). Deleting them reduces code size and simplifies
    subsequent analysis of profile counters.

I think that this is something global-DCE should be teached about (if it does not know already).

2. Determine which profile counters are essential.

What is an "essential counter"?

3. Erase all updates to redundant profile counters.

When you are saying "erase", you need to actually replace the multiple counter increment with *a new* counter, right? I.e. when replacing:

     instrprof_increment(profc_f2);
     instrprof_increment(profc_f1);

you need to emit:

     instrprof_increment(profc_f2f1_fused);

4. Emit the aliases into a new section in the binary.

    Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some changes
    in compiler-rt are required to walk the alias section and fill in the
    correct execution counts at program exit time.

This pass needs to be run after the inliner in order to be effective.

The complexity of this pass is O(N*M), where N is the number of profile
counters, and M is the average number of updates per counter.

What you're describing here is not clear to me? But I may have been totally misunderstanding the whole concept :slight_smile:

Hi,

I'd like to add a new pass to LLVM which removes redundant profile counter
updates. The goal is to speed up code coverage testing and profile
generation
for PGO.

We may want to have a focused discussion about this goal, rather than a
particular suggestion. There are a lot of things we can do. Off the top of
my head, some are:

1. add some sort of alias annotation (such as an independent TBAA root
node) to all the counter increment memory instructions to tell the
optimizer they don't interfere with the usual loads and stores.

2. communicate to the optimizer that counters can be registerized. In a
loop like:
for (int i = 0; i < N; i++) {
  if (foo())
    bar();
  else
    baz();
}
we perform O(N) counter increments (i.e. load, increment, store) last I
checked. However, if the counters are in registers, then we only perform
O(1) memory operations. This can dramatically reduce the pressure on the
CPU's load/store units and also relieve cross-core cache line ping-pong
when two cores are executing the same code. Note that the latter benefit is
attained even if we ultimately end up spilling the counters due to
increased register pressure.

I actually don't know what is preventing the usual optimization pipeline
from getting 2 right.

I'm sending this email out to describe my approach, share some early
results,
and gather feedback.

Problem Overview

A profile counter is redundant if it's incremented in exactly the same
basic
blocks as some other profile counter. Consider the following module:

    local void f1() {
      instrprof_increment(profc_f1);
    }

    void f2() {
      instrprof_increment(profc_f2);
      f1();
    }

Once the inliner runs and deletes f1, we're left with:

    void f2() {
      instrprof_increment(profc_f2);
      instrprof_increment(profc_f1);
    }

Now we can say profc_f1 is redundant (or, an alias for profc_f2).

I've noticed that frontend-based instrumentation can generate many
redundant
profile counters. This degrades performance and increases code size. We
can
address the problem by getting rid of redundant counter updates. The trick
is
to make sure we get back the same profiles.

Proposed Solution

I propose a pruning pass which takes the following steps:

  1. Delete functions with local linkage and only one use, if that use is
in
     a profile data record.

     These functions are left around by the inliner (it doesn't know that
     they're safe to delete). Deleting them reduces code size and
simplifies
     subsequent analysis of profile counters.

  2. Determine which profile counters are essential.

  3. Erase all updates to redundant profile counters.

  4. Emit the aliases into a new section in the binary.

     Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
changes
     in compiler-rt are required to walk the alias section and fill in the
     correct execution counts at program exit time.

This pass needs to be run after the inliner in order to be effective.

The complexity of this pass is O(N*M), where N is the number of profile
counters, and M is the average number of updates per counter. In practice
it is
a bit faster, since we can skip the analysis of counters which are
discovered to
be redundant early on in the process.

I think a conceptually simpler design is something like:

for each CFG edge:
    record which FE counters have ended up associated with it
remove FE counters
run IR instrumentation pass
emit a side table mapping IR instr counters to FE counters (more generally:
how to reconstruct FE counters from the IR counters)

The problem is simply reduced to the IR instrumentation pass.

Early Results

The pruning pass results in 25% speed improvement in the example program
above
(where f2 is called in a loop 10^8 times).

Here is a slightly less contrived example:

  #include <vector>
  #include <algorithm>
  #include <cstdlib>

  static void escape(void *p) {
    asm volatile("" : : "g"(p) : "memory");
  }

  int main(int argc, char **argv) {
    std::vector<int> V(atoi(argv[1]));
    escape(reinterpret_cast<void *>(V.data()));
    std::sort(V.begin(), V.end());

    return V[0];

  }

I get the following results on my desktop (10^8 elements, 5 runs each):

  O3: 0.262s
  O3 + PGOInstr: 0.663s
  O3 + PGOInstr + Pruning: 0.606s (8.6% performance win, 672 aliases)
  O3 + CoverageInstr: 0.690s
  O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 aliases)

Next Steps?

Is the performance of instrumented code something we think we need to fix?

With frontend instrumentation it definitely is.

What's an acceptable compile-time overhead for running this pruning pass?

Instrumented builds are "special" anyway so a fair slowdown is probably
acceptable. I.e. this doesn't affect regular developer compilation times.
As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2`
vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the
control, then that indicates that a 5% extra slowdown is probably
reasonable for a substantial reduction in profiling overhead (which can
result in a qualitative improvement to the actual usability of the
instrumented program).

-- Sean Silva

>
> These functions are left around by the inliner (it doesn't know that
> they're safe to delete). Deleting them reduces code size and
simplifies
> subsequent analysis of profile counters.

I think that this is something global-DCE should be teached about (if it
does not know already).

What attribute do you think we could use to do that?

> 4. Emit the aliases into a new section in the binary.
>
> Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
changes
> in compiler-rt are required to walk the alias section and fill in the
> correct execution counts at program exit time.
>
> This pass needs to be run after the inliner in order to be effective.
>
> The complexity of this pass is O(N*M), where N is the number of profile
> counters, and M is the average number of updates per counter.

What you're describing here is not clear to me? But I may have been
totally misunderstanding the whole concept :slight_smile:

Also, describing the complexity like this is strange.
It should be just O(N) where N is the number of instructions.

Vedant, Are you concerned about instrumentation run performance for PGO or for coverage testing? For coverage testing, is coverage information enough or count information is also needed? Depending on the answer to the questions, the solution may be very different.

If the answer is for PGO, the a much better longer term solution is to migrate to use IR based instrumentation. Not only because IR based instrumentation places counter update ‘optimally’, the early optimization including pre-inline of tiny funcs is very effective in reducing instr overhead plus the benefit of better profile quality due to context sensitivity. For more details or performance numbers see Rong’s RFC about late instrumentation.

If the answer is PGO, but for various reasons, the FE based instrumentation has to be used, then I think there is another more effective way previously suggested by Sean. Basically you can skip single BB functions completely during instrumentation. There are various reasons why profile data for single bb function is not useful:

  1. they are usually small and beneficial to be inlined regardless of profile data
  2. the BB of the inline instance can get profile data from the caller context
  3. the profile data for the out of line single BB func is also useless

Rong has some data showing the effectiveness of this method – not as good as the pre-optimization approach, but IIRC also very good.

If the answer is for coverage but the actual count value does not really matter, then a more effective way of reducing overhead is to teach the instrumentation lowering pass to lower the counter update into a simple store:

counter_1 = 1;

Such stores from the inline instances of the same func can be easily eliminated. It will also help multithreaded program a lot when there is heavy counter contention.

Another benefit is that the size of the counter can be effectively reduced to 1 byte instead of 8 byte.

The tricky situation is you want coverage data but count per line is also important – but I wonder having special pass to just handle this scenario is worth the effort.

Also a side note, I think longer term we should unify three instrumentation mechanism into one: FE based, IR based, and old gcda instrumentation. IR based is the most efficient implementation – when combined with gcda profiling runtime, it can be used to replace current gcda profiling which is not efficient. It is also possible to use IR based instr for coverage mapping (with coverage map format mostly unchanged) but main challenge is passing source info etc.

thanks,

David

Hi,

I'd like to add a new pass to LLVM which removes redundant profile counter
updates. The goal is to speed up code coverage testing and profile
generation
for PGO.

We may want to have a focused discussion about this goal, rather than a
particular suggestion. There are a lot of things we can do. Off the top of
my head, some are:

1. add some sort of alias annotation (such as an independent TBAA root
node) to all the counter increment memory instructions to tell the
optimizer they don't interfere with the usual loads and stores.

2. communicate to the optimizer that counters can be registerized. In a
loop like:
for (int i = 0; i < N; i++) {
  if (foo())
    bar();
  else
    baz();
}
we perform O(N) counter increments (i.e. load, increment, store) last I
checked. However, if the counters are in registers, then we only perform
O(1) memory operations. This can dramatically reduce the pressure on the
CPU's load/store units and also relieve cross-core cache line ping-pong
when two cores are executing the same code. Note that the latter benefit is
attained even if we ultimately end up spilling the counters due to
increased register pressure.

I actually don't know what is preventing the usual optimization pipeline
from getting 2 right.

Call Mod-ref. We need to teach the optimizer that the counter owned by the
current function (if the function is proved to be non-recursive in some
way) can not be modified by any other calls.

David

Hi,

I'd like to add a new pass to LLVM which removes redundant profile
counter
updates. The goal is to speed up code coverage testing and profile
generation
for PGO.

We may want to have a focused discussion about this goal, rather than a
particular suggestion. There are a lot of things we can do. Off the top of
my head, some are:

1. add some sort of alias annotation (such as an independent TBAA root
node) to all the counter increment memory instructions to tell the
optimizer they don't interfere with the usual loads and stores.

2. communicate to the optimizer that counters can be registerized. In a
loop like:
for (int i = 0; i < N; i++) {
  if (foo())
    bar();
  else
    baz();
}
we perform O(N) counter increments (i.e. load, increment, store) last I
checked. However, if the counters are in registers, then we only perform
O(1) memory operations. This can dramatically reduce the pressure on the
CPU's load/store units and also relieve cross-core cache line ping-pong
when two cores are executing the same code. Note that the latter benefit is
attained even if we ultimately end up spilling the counters due to
increased register pressure.

I actually don't know what is preventing the usual optimization pipeline
from getting 2 right.

Call Mod-ref. We need to teach the optimizer that the counter owned by
the current function (if the function is proved to be non-recursive in some
way) can not be modified by any other calls.

I don't think that's a sufficient explanation. Consider the following
example:

Sean:~/tmp % cat testprofile.cpp
int foo(int n) {
  unsigned Ret = 42;
  for (int i = 0; i < n; i++) {
    if (i % 100) {
      Ret += 789;
    } else {
      Ret *= (283 + Ret);
    }
  }
  return Ret;
}

Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate
testprofile.cpp -S -emit-llvm -O2 >foo.ll
Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll

!Screen Shot 2016-03-10 at 10.12.52 PM.png|709x531

-- Sean Silva

I saw your example has function call in the loop…

For this example, I think the reason is we don’t yet do speculative PRE which usually requires profile information. Note that the update is conditionally done in a branch. Note that the counter update of the block %8 is fully optimized away.

David

Hi,

I'd like to add a new pass to LLVM which removes redundant profile
counter
updates. The goal is to speed up code coverage testing and profile
generation
for PGO.

We may want to have a focused discussion about this goal, rather than a
particular suggestion. There are a lot of things we can do. Off the top of
my head, some are:

1. add some sort of alias annotation (such as an independent TBAA root
node) to all the counter increment memory instructions to tell the
optimizer they don't interfere with the usual loads and stores.

2. communicate to the optimizer that counters can be registerized. In a
loop like:
for (int i = 0; i < N; i++) {
  if (foo())
    bar();
  else
    baz();
}
we perform O(N) counter increments (i.e. load, increment, store) last I
checked. However, if the counters are in registers, then we only perform
O(1) memory operations. This can dramatically reduce the pressure on the
CPU's load/store units and also relieve cross-core cache line ping-pong
when two cores are executing the same code. Note that the latter benefit is
attained even if we ultimately end up spilling the counters due to
increased register pressure.

I actually don't know what is preventing the usual optimization pipeline
from getting 2 right.

Call Mod-ref. We need to teach the optimizer that the counter owned by
the current function (if the function is proved to be non-recursive in some
way) can not be modified by any other calls.

I don't think that's a sufficient explanation. Consider the following
example:

Sean:~/tmp % cat testprofile.cpp
int foo(int n) {
  unsigned Ret = 42;
  for (int i = 0; i < n; i++) {
    if (i % 100) {
      Ret += 789;
    } else {
      Ret *= (283 + Ret);
    }
  }
  return Ret;
}

Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate
testprofile.cpp -S -emit-llvm -O2 >foo.ll
Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll

(resent)

I saw your example has function call in the loop, thus the comment about
Modref.

For this example, the reason seems to that we don't yet do speculative PRE
which usually requires profile information. Note that the update is
conditionally done in a branch. Also note that the counter update of the
block %8 is fully optimized away.

David

Hi,

I'd like to add a new pass to LLVM which removes redundant profile
counter
updates. The goal is to speed up code coverage testing and profile
generation
for PGO.

We may want to have a focused discussion about this goal, rather than a
particular suggestion. There are a lot of things we can do. Off the top of
my head, some are:

1. add some sort of alias annotation (such as an independent TBAA root
node) to all the counter increment memory instructions to tell the
optimizer they don't interfere with the usual loads and stores.

2. communicate to the optimizer that counters can be registerized. In a
loop like:
for (int i = 0; i < N; i++) {
  if (foo())
    bar();
  else
    baz();
}
we perform O(N) counter increments (i.e. load, increment, store) last I
checked. However, if the counters are in registers, then we only perform
O(1) memory operations. This can dramatically reduce the pressure on the
CPU's load/store units and also relieve cross-core cache line ping-pong
when two cores are executing the same code. Note that the latter benefit is
attained even if we ultimately end up spilling the counters due to
increased register pressure.

I actually don't know what is preventing the usual optimization
pipeline from getting 2 right.

Call Mod-ref. We need to teach the optimizer that the counter owned by
the current function (if the function is proved to be non-recursive in some
way) can not be modified by any other calls.

I don't think that's a sufficient explanation. Consider the following
example:

Sean:~/tmp % cat testprofile.cpp
int foo(int n) {
  unsigned Ret = 42;
  for (int i = 0; i < n; i++) {
    if (i % 100) {
      Ret += 789;
    } else {
      Ret *= (283 + Ret);
    }
  }
  return Ret;
}

Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate
testprofile.cpp -S -emit-llvm -O2 >foo.ll
Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll

(resent)

I saw your example has function call in the loop, thus the comment about
Modref.

For this example, the reason seems to that we don't yet do speculative PRE
which usually requires profile information. Note that the update is
conditionally done in a branch. Also note that the counter update of the
block %8 is fully optimized away.

(forgive my ignorance of compiler theory)

Should we generally expect the optimizer to get this right at some point?
(i.e. is it a bug in the optimizer?) Or do you think that a special pass is
needed to get the desired effect for counters for normal use of
instrumentation?
It seems like inherently counter updates will happen only conditionally,
and so this kind of situation will be common.

Counters seem like they have special semantics that allows more
flexibility. For example, the call mod-ref problem is actually not a
problem due to commutativity. We know that the callee will at most
increment the counter, so as long as the load,increment,store is not
interrupted by a call (which is easy to avoid) then we can freely move it
across calls. (actually, in theory a call into the profile lib could read
the counters, but in practice I expect this to not be an issue).

-- Sean Silva

Hi,

I'd like to add a new pass to LLVM which removes redundant profile
counter
updates. The goal is to speed up code coverage testing and profile
generation
for PGO.

We may want to have a focused discussion about this goal, rather than
a particular suggestion. There are a lot of things we can do. Off the top
of my head, some are:

1. add some sort of alias annotation (such as an independent TBAA root
node) to all the counter increment memory instructions to tell the
optimizer they don't interfere with the usual loads and stores.

2. communicate to the optimizer that counters can be registerized. In
a loop like:
for (int i = 0; i < N; i++) {
  if (foo())
    bar();
  else
    baz();
}
we perform O(N) counter increments (i.e. load, increment, store) last
I checked. However, if the counters are in registers, then we only perform
O(1) memory operations. This can dramatically reduce the pressure on the
CPU's load/store units and also relieve cross-core cache line ping-pong
when two cores are executing the same code. Note that the latter benefit is
attained even if we ultimately end up spilling the counters due to
increased register pressure.

I actually don't know what is preventing the usual optimization
pipeline from getting 2 right.

Call Mod-ref. We need to teach the optimizer that the counter owned by
the current function (if the function is proved to be non-recursive in some
way) can not be modified by any other calls.

I don't think that's a sufficient explanation. Consider the following
example:

Sean:~/tmp % cat testprofile.cpp
int foo(int n) {
  unsigned Ret = 42;
  for (int i = 0; i < n; i++) {
    if (i % 100) {
      Ret += 789;
    } else {
      Ret *= (283 + Ret);
    }
  }
  return Ret;
}

Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate
testprofile.cpp -S -emit-llvm -O2 >foo.ll
Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll

(resent)

I saw your example has function call in the loop, thus the comment about
Modref.

For this example, the reason seems to that we don't yet do speculative
PRE which usually requires profile information. Note that the update is
conditionally done in a branch. Also note that the counter update of the
block %8 is fully optimized away.

(forgive my ignorance of compiler theory)

Should we generally expect the optimizer to get this right at some point?
(i.e. is it a bug in the optimizer?) Or do you think that a special pass is
needed to get the desired effect for counters for normal use of
instrumentation?
It seems like inherently counter updates will happen only conditionally,
and so this kind of situation will be common.

This is a limitation of the compiler. It is not specific to counter
variables -- same applies to user defined globals. There might be a bug
tracking this.

Counters seem like they have special semantics that allows more
flexibility. For example, the call mod-ref problem is actually not a
problem due to commutativity. We know that the callee will at most
increment the counter, so as long as the load,increment,store is not
interrupted by a call (which is easy to avoid) then we can freely move it
across calls. (actually, in theory a call into the profile lib could read
the counters, but in practice I expect this to not be an issue).

Good observation, but it can cause problems with no access by call
assumption -- for instance register promotion of a counter variable across
call but it is actually updated by a call in the loop body. At the end of
the loop, the update from the calls will be lost.

David

Xinliang David Li via llvm-dev <llvm-dev@lists.llvm.org> writes:

Vedant, Are you concerned about instrumentation run performance for PGO or
for coverage testing? For coverage testing, is coverage information enough
or count information is also needed? Depending on the answer to the
questions, the solution may be very different.

If the answer is for PGO, the a much better longer term solution is to
migrate to use IR based instrumentation. Not only because IR based
instrumentation places counter update 'optimally', the early optimization
including pre-inline of tiny funcs is very effective in reducing instr
overhead plus the benefit of better profile quality due to context
sensitivity. For more details or performance numbers see Rong's RFC about
late instrumentation.

If the answer is PGO, but for various reasons, the FE based instrumentation
has to be used, then I think there is another more effective way previously
suggested by Sean. Basically you can skip single BB functions completely
during instrumentation. There are various reasons why profile data for
single bb function is not useful:
1) they are usually small and beneficial to be inlined regardless of
profile data
2) the BB of the inline instance can get profile data from the caller
context
3) the profile data for the out of line single BB func is also useless

Rong has some data showing the effectiveness of this method -- not as good
as the pre-optimization approach, but IIRC also very good.

If the answer is for coverage but the actual count value does not really
matter, then a more effective way of reducing overhead is to teach the
instrumentation lowering pass to lower the counter update into a simple
store:

  counter_1 = 1;

This won't work. The FE based instrumentation has implicit counters for
various AST constructs that can only be worked out by doing math with
the explicit counters. If they don't have actual counts, this doesn't
work.

Such stores from the inline instances of the same func can be easily
eliminated. It will also help multithreaded program a lot when there is
heavy counter contention.

Another benefit is that the size of the counter can be effectively reduced
to 1 byte instead of 8 byte.

The tricky situation is you want coverage data but count per line is also
important -- but I wonder having special pass to just handle this scenario
is worth the effort.

Isn't "coverage with count-per-line" the whole point of a coverage
feature?

It's also convenient to gather data for PGO and coverage in the same
run, rather than having to build instrumented twice and run the same
tests/training runs twice. Performance of the FE based instrumentation
is pretty important for productivity reasons.

Also a side note, I think longer term we should unify three instrumentation
mechanism into one: FE based, IR based, and old gcda instrumentation. IR
based is the most efficient implementation -- when combined with gcda
profiling runtime, it can be used to replace current gcda profiling which
is not efficient. It is also possible to use IR based instr for coverage
mapping (with coverage map format mostly unchanged) but main challenge is
passing source info etc.

I don't disagree, but I think that getting the same quality of coverage
data from the IR based profiles as we do from the FE ones is a fairly
large undertaking.

Xinliang David Li via llvm-dev <llvm-dev@lists.llvm.org> writes:
> Vedant, Are you concerned about instrumentation run performance for PGO
or
> for coverage testing? For coverage testing, is coverage information
enough
> or count information is also needed? Depending on the answer to the
> questions, the solution may be very different.
>
> If the answer is for PGO, the a much better longer term solution is to
> migrate to use IR based instrumentation. Not only because IR based
> instrumentation places counter update 'optimally', the early optimization
> including pre-inline of tiny funcs is very effective in reducing instr
> overhead plus the benefit of better profile quality due to context
> sensitivity. For more details or performance numbers see Rong's RFC about
> late instrumentation.
>
> If the answer is PGO, but for various reasons, the FE based
instrumentation
> has to be used, then I think there is another more effective way
previously
> suggested by Sean. Basically you can skip single BB functions completely
> during instrumentation. There are various reasons why profile data for
> single bb function is not useful:
> 1) they are usually small and beneficial to be inlined regardless of
> profile data
> 2) the BB of the inline instance can get profile data from the caller
> context
> 3) the profile data for the out of line single BB func is also useless
>
> Rong has some data showing the effectiveness of this method -- not as
good
> as the pre-optimization approach, but IIRC also very good.
>
> If the answer is for coverage but the actual count value does not really
> matter, then a more effective way of reducing overhead is to teach the
> instrumentation lowering pass to lower the counter update into a simple
> store:
>
> counter_1 = 1;

This won't work. The FE based instrumentation has implicit counters for
various AST constructs that can only be worked out by doing math with
the explicit counters. If they don't have actual counts, this doesn't
work.

This depends on how FE picks regions to instrument. The key part is that FE
should not generate region with 'minus' counter expressions, only 'add' is
allowed.

> Such stores from the inline instances of the same func can be easily
> eliminated. It will also help multithreaded program a lot when there is
> heavy counter contention.
>
> Another benefit is that the size of the counter can be effectively
reduced
> to 1 byte instead of 8 byte.
>
> The tricky situation is you want coverage data but count per line is also
> important -- but I wonder having special pass to just handle this
scenario
> is worth the effort.

Isn't "coverage with count-per-line" the whole point of a coverage
feature?

The point of coverage testing is to see if there are any lines of the code
not 'executed' at runtime -- in that sense, count does not really matter.
Whether it executes 1000 times or only 1 time is irrelevant. However it
makes a difference if a line has '0' count.

Note that asan based coverage tool does not do counting either.

It's also convenient to gather data for PGO and coverage in the same
run, rather than having to build instrumented twice and run the same
tests/training runs twice. Performance of the FE based instrumentation
is pretty important for productivity reasons.

> Also a side note, I think longer term we should unify three
instrumentation
> mechanism into one: FE based, IR based, and old gcda instrumentation. IR
> based is the most efficient implementation -- when combined with gcda
> profiling runtime, it can be used to replace current gcda profiling which
> is not efficient. It is also possible to use IR based instr for coverage
> mapping (with coverage map format mostly unchanged) but main challenge is
> passing source info etc.

I don't disagree, but I think that getting the same quality of coverage
data from the IR based profiles as we do from the FE ones is a fairly
large undertaking.

Right -- I am thinking about this and there might be good ways to do it
(e.g, let FE prepare some of the map data and source info filled and
backend can fill in with counter expressions).

thanks,

David

Sean Silva via llvm-dev <llvm-dev@lists.llvm.org> writes:

Hi,

I'd like to add a new pass to LLVM which removes redundant profile counter
updates. The goal is to speed up code coverage testing and profile
generation
for PGO.

We may want to have a focused discussion about this goal, rather than a
particular suggestion. There are a lot of things we can do. Off the top of
my head, some are:

1. add some sort of alias annotation (such as an independent TBAA root
node) to all the counter increment memory instructions to tell the
optimizer they don't interfere with the usual loads and stores.

2. communicate to the optimizer that counters can be registerized. In a
loop like:
for (int i = 0; i < N; i++) {
  if (foo())
    bar();
  else
    baz();
}
we perform O(N) counter increments (i.e. load, increment, store) last I
checked. However, if the counters are in registers, then we only perform
O(1) memory operations. This can dramatically reduce the pressure on the
CPU's load/store units and also relieve cross-core cache line ping-pong
when two cores are executing the same code. Note that the latter benefit is
attained even if we ultimately end up spilling the counters due to
increased register pressure.

I actually don't know what is preventing the usual optimization pipeline
from getting 2 right.

Yes, we definitely want to teach optimizations to be clever about
instrumentation counters. This kind of thing should help both the FE and
IR based instrumentation.

I think it's somewhat orthogonal to Vedant's approach here though - even
if we can hoist increments out of loops or do the work in registers some
of the time, reducing counts that are known to be identical to one
memory location should still bring a fair bit of value.

I wonder about that.

I'm interested by the coverage of a test-suite, but I may not use this as a training set for PGO: why would I train on torture-tests that are exercising corner cases of the code base (that I want to be clearly "cold" in the PGO profile).

There have been a lot of responses. I'll try to summarize the thread and respond
to some of the questions/feedback.

Summary

There have been a lot of responses. I'll try to summarize the thread and
respond
to some of the questions/feedback.

Summary

1. We should teach GlobalDCE to strip out instrumented functions which the
   inliner cannot delete.

2. Sean suggests adding metadata to loads/stores of counters to convey that
   they do not alias normal program data.

   I'm not familiar with AA, but off-hand this seems like a good idea.

3. Sean also suggests improving the optimizer to registerize counters when
   possible.

   Ditto, seems like a good idea.

4. Sean has an alternate proposal for counter pruning which works by
remapping
   FE-generated counters to counters generated by the IR instrumentation
pass.
   David likes this proposal.

I did not express preference about this translation scheme, but proposed
using IR based instrumentation longer term :slight_smile:

   I have some questions about this but am willing to try it. IMO, the
pruning
   pass I've suggested could complement this nicely -- I don't view them as
   mutually exclusive (though I could be wrong).

5. There seems to be consensus over the need to improve performance of
   FE-instrumented programs.

6. David is concerned that the proposed pruning pass is too narrow in
scope,
   and would like to see results on a larger benchmark. I'll try to get
some
   numbers.

7. David mentioned some optimizations that are possible if we make coverage
   information "yes or no" (per line). Justin pointed out that this doesn't
   work for a few of Apple's use cases.

What Justine said IIUC is that with the current way implicit counts are
generated, it is not doable. This does not seem to be related to Apple's
use cases.

David

Vedant Kumar <vsk@apple.com> writes:

There have been a lot of responses. I'll try to summarize the thread
and respond to some of the questions/feedback.

...

FE to IR Counter Remapping

I have a question about this plan:

for each CFG edge:
    record which FE counters have ended up associated with it
remove FE counters
run IR instrumentation pass
emit a side table mapping IR instr counters to FE counters

Currently, -instrprof happens early in the pipeline. IIUC this is done to
allow the optimizer to work with load+add+stores, instead of profile update
intrinsics.

It would be an interesting experiment to see what it would look like to
teach optimizations about the instrprof intrinsics and lower them much
later. I suspect knowing that these aren't just stores to random memory
would enable us to make good decisions in various places.

Of course, this might end up spreading to much special case knowledge
through various optimizations and not be worth it.

Vedant Kumar <vsk@apple.com> writes:

There have been a lot of responses. I'll try to summarize the thread
and respond to some of the questions/feedback.

...

FE to IR Counter Remapping

I have a question about this plan:

for each CFG edge:
   record which FE counters have ended up associated with it
remove FE counters
run IR instrumentation pass
emit a side table mapping IR instr counters to FE counters

Currently, -instrprof happens early in the pipeline. IIUC this is done to
allow the optimizer to work with load+add+stores, instead of profile update
intrinsics.

It would be an interesting experiment to see what it would look like to
teach optimizations about the instrprof intrinsics and lower them much
later. I suspect knowing that these aren't just stores to random memory
would enable us to make good decisions in various places.

Do you think we could get good enough results by attaching !invariant.load or
AA metadata to lowered profile counter updates?

Of course, this might end up spreading to much special case knowledge
through various optimizations and not be worth it.

Say we introduce a counter remapping pass like the one Sean suggested. It
should be run before -instrprof so that we don't waste time lowering a bunch
of instrprof_increment intrinsics which we'll have to throw away later.

But that means that the CFGs that the counter remapping pass operates
on won't reflect changes made by the inliner (or any other
optimizations which alter the CFG), right?

ISTM the pruning pass I've proposed is useful whether we're doing FE-based
instrumentation _or_ late instrumentation. Since it operates on loads+stores
directly, it can clean up redundant counter increments at any point in the
pipeline (after -instrprof).

I'd like to add an interesting data point to back this up. Revisiting the
std::sort example, here's what I get with -fprofile-instrument=llvm (again
using 10^8 array elements, and averaging over 5 runs):

O3: 0.262s
O3 + LLVMInstr: 0.705s
O3 + LLVMInstr + Pruning: 0.644s (47 counter alias mappings created)

So, it *is* possible for we see real performance improvements by running a
pruning pass after late IR-based instrumentation.

I still think we need more numbers before moving forward, and will work on
that.

vedant

>
> Vedant Kumar <vsk@apple.com> writes:
>> There have been a lot of responses. I'll try to summarize the thread
>> and respond to some of the questions/feedback.
> ...
>> FE to IR Counter Remapping
>> ==========================
>>
>> I have a question about this plan:
>>
>>> for each CFG edge:
>>> record which FE counters have ended up associated with it
>>> remove FE counters
>>> run IR instrumentation pass
>>> emit a side table mapping IR instr counters to FE counters
>>
>> Currently, -instrprof happens early in the pipeline. IIUC this is done
to
>> allow the optimizer to work with load+add+stores, instead of profile
update
>> intrinsics.
>
> It would be an interesting experiment to see what it would look like to
> teach optimizations about the instrprof intrinsics and lower them much
> later. I suspect knowing that these aren't just stores to random memory
> would enable us to make good decisions in various places.

Do you think we could get good enough results by attaching !invariant.load
or
AA metadata to lowered profile counter updates?

> Of course, this might end up spreading to much special case knowledge
> through various optimizations and not be worth it.
>
>> Say we introduce a counter remapping pass like the one Sean suggested.
It
>> should be run before -instrprof so that we don't waste time lowering a
bunch
>> of instrprof_increment intrinsics which we'll have to throw away later.
>>
>> But that means that the CFGs that the counter remapping pass operates
>> on won't reflect changes made by the inliner (or any other
>> optimizations which alter the CFG), right?
>>
>> ISTM the pruning pass I've proposed is useful whether we're doing
FE-based
>> instrumentation _or_ late instrumentation. Since it operates on
loads+stores
>> directly, it can clean up redundant counter increments at any point in
the
>> pipeline (after -instrprof).

I'd like to add an interesting data point to back this up. Revisiting the
std::sort example, here's what I get with -fprofile-instrument=llvm (again
using 10^8 array elements, and averaging over 5 runs):

O3: 0.262s
O3 + LLVMInstr: 0.705s
O3 + LLVMInstr + Pruning: 0.644s (47 counter alias mappings created)

There is a llvm-pipeline change for llvm instr pending. Once that is in,
the benefit shown here will probably disappear.

David