(RFC) Encoding code duplication factor in discriminator

Motivation:
Many optimizations duplicate code. E.g. loop unroller duplicates the loop body, GVN duplicates computation, etc. The duplicated code will share the same debug info with the original code. For SamplePGO, the debug info is used to present the profile. Code duplication will affect profile accuracy. Taking loop unrolling for example:

#1 foo();
#2 for (i = 0; i < N; i++) {

#3 bar();
#4 }

If N is 8 during runtime, a reasonable profile will look like:

#1: 10
#3: 80

But if the compiler unrolls the loop by a factor of 4, the callsite to bar() is duplicated 4 times and the profile will look like:

#1: 10
#3: 20

The sample count for #3 is 20 because all 4 callsites to bar() are sampled 20 times each, and they shared the same debug loc (#3) so that 20 will be attributed to #3 (If one debugloc is mapped by multiple instructions, the max sample count of these instructions is used as debugloc’s sample count).

When loading this profile into compiler, it will think the loop trip count is 2 instead of 8.

Proposal:
When compiler duplicates code, it encodes the duplication info in the debug info. As the duplication is not interesting to debugger, I propose to encode this as part of the discriminator.

There are 2 types of code duplication:

  1. duplicated code are guaranteed to have the same execution count (e.g. loop unroll and loop vectorize). We can record the duplication factor, for the above example “4” is recorded in the discriminator.
  2. duplicated code that may have different execution count (e.g. loop peel and gvn). For a same debugloc, a unique number is assigned to each copy and encoded in the discriminator.

Assume that the discriminator is uint32. The traditional discriminator is less than 256, let’s take 8 bit for it. For duplication factor (type 1 duplication), we assume the maximum unroll_factor * vectorize_factor is less than 256, thus 8 bit for it. For unique number(type 2 duplication), we assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we still have 11 free bits left in the discriminator encoding.

Let’s take the original source as an example, after loop unrolling and peeling, the code may looks like:

for (i = 0; i < N & 3; i+= 4) {
foo(); // discriminator: 0x40
foo(); // discriminator: 0x40
foo(); // discriminator: 0x40
foo(); // discriminator: 0x40
}
if (i++ < N) {
foo(); // discriminator: 0x100
if (i++ < N) {
foo(); // discriminator: 0x200
if (i++ < N) {
foo(); // discriminator: 0x300
}
}
}

The cost of this change would be increased debug_line size because: 1. we are recording more discriminators 2. discriminators become larger and will take more ULEB128 encoding.

The benefit is that the sample pgo profile can accurately represent the code execution frequency. And we do not need to introduce new building blocks to debug info.

Comments?

Thanks,
Dehao

Is there prior art for this sort of thing (in GCC, for example) - I take it this isn’t the first time this has come up as a problem for profile accuracy? (so it’d be helpful to know prior solutions to this (& if we’re not doing whatever was done before, what it is about our situation that’s different, etc), or why it hasn’t been a problem, etc)

Do you have an estimate of the debug_line size increase? I guess it will be small.

David

You are right, this problem also exists in GCC. And it was not solved there, we just left it open. As now we are tuning aggressively for LLVM, I think it’s a good time to raise this issue and have a thorough solution so that we have a good chance of gaining more better performance than GCC afdo.

Dehao

The impact to debug_line is actually not small. I only implemented the part 1 (encoding duplication factor) for loop unrolling and loop vectorization. The debug_line size overhead for “-O2 -g1” binary of speccpu C/C++ benchmarks:

433.milc | 23.59% |

  • | - |
    444.namd | 6.25% |
    447.dealII | 8.43% |
    450.soplex | 2.41% |
    453.povray | 5.40% |
    470.lbm | 0.00% |
    482.sphinx3 | 7.10% |
    400.perlbench | 2.77% |
    401.bzip2 | 9.62% |
    403.gcc | 2.67% |
    429.mcf | 9.54% |
    445.gobmk | 7.40% |
    456.hmmer | 9.79% |
    458.sjeng | 9.98% |
    462.libquantum | 10.90% |
    464.h264ref | 30.21% |
    471.omnetpp | 0.52% |
    473.astar | 5.67% |
    483.xalancbmk | 1.46% |
    mean | 7.86% |

Dehao

The large percentages are from those tiny benchmarks. If you look at omnetpp (0.52%), and xalanc (1.46%), the increase is small. To get a better average increase, you can sum up total debug_line size before and after and compute percentage accordingly.

David

If we use sum instead of geomean, then the diff will be 5.59%

Yes, most of the size growth are from small applications, with one exception: h264, which has significant amount of loops to unroll/vectorize.

Dehao

It looks like the example doesn’t use the encoding described in the text?

Assume that the discriminator is uint32. The traditional discriminator is less than 256, let’s take 8 bit for it. For duplication factor (type 1 duplication), we assume the maximum unroll_factor * vectorize_factor is less than 256, thus 8 bit for it. For unique number(type 2 duplication), we assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we still have 11 free bits left in the discriminator encoding.

Let’s take the original source as an example, after loop unrolling and peeling, the code may looks like:

for (i = 0; i < N & 3; i+= 4) {

foo(); // discriminator: 0x40

foo(); // discriminator: 0x40

foo(); // discriminator: 0x40

foo(); // discriminator: 0x40

}

if (i++ < N) {

foo(); // discriminator: 0x100

if (i++ < N) {

foo(); // discriminator: 0x200

if (i++ < N) {

foo(); // discriminator: 0x300

}

}

}

If we allocate 8 bits to “traditional” discriminators, then 0x40 falls into that range, so I’d think the calls to foo() inside the loop should be using 0x400 to encode the unroll factor. Note this requires 2 bytes for ULEB128 instead of 1.

And if we allocate another 8 bits to the unroll factor, then the trailing calls should use 0x10000, 0x20000, 0x30000. These will require 3 bytes for ULEB128 instead of 2.

I think it would be fine to allocate only 4 bits to “traditional” discriminators (as you need them to be unique across the same source location, but not across different source locations, and 16 independent basic blocks for the same source location seems like plenty to me; but I haven’t looked at a lot of cases with discriminators). This would allow you to use 0x40 to encode the unroll factor in this example. If you still want to allocate 8 bits for unroll factor, then the trailing calls would use 0x1000, 0x2000, 0x3000 so as long as you had no more than 3 trailing calls you can still encode the discriminator in a 2-byte ULEB128.

–paulr

Hi Dehao,

This is definitely an important problem, thanks for writing this up!

There is a related problem that I think we can address at the same time: When we multiversion code, for example when we use runtime checks to enable the creation of a vectorized loop while retaining the scalar loop, and then we collect profiling data, we should be able to recover the relative running time of the different versions of the loop. In the future, I suspect we'll end up with transformations that produce even more versions of loops (Intel's compiler, for example, has a useful pragma that allows the user to request loop specializations for a specified set of trip counts).

I'd like to have a scheme where the source location + discriminator can be mapped to information about the relevant loop version so that our profiling tools can display that information usefully to the user, and so that our optimizations can make useful choices (i.e. don't bother vectorizing a loop when the scalar loop is run a lot but the vectorized version almost never runs).

In short, I think that differentiating these different regions using the descriminator seems like a natural choice, but "And we do not need to introduce new building blocks to debug info" might not save us all that much in the long run. To keep information on what regions correspond to what optimizations, we may need to do that. That's not a bad thing, and I'd rather we solve this in a way that is extensible. Plus, this might make it easier to use fewer bits, thus helping the overall impact on the size of the debug sections.

Thanks again,
Hal

It looks like the example doesn't use the encoding described in the text?

Assume that the discriminator is uint32. The traditional discriminator is
less than 256, let's take 8 bit for it. For duplication factor (type 1
duplication), we assume the maximum unroll_factor * vectorize_factor is
less than 256, thus 8 bit for it. For unique number(type 2 duplication), we
assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we
still have 11 free bits left in the discriminator encoding.

Let's take the original source as an example, after loop unrolling and
peeling, the code may looks like:

for (i = 0; i < N & 3; i+= 4) {

  foo(); // discriminator: 0x40

  foo(); // discriminator: 0x40

  foo(); // discriminator: 0x40

  foo(); // discriminator: 0x40

}

if (i++ < N) {

  foo(); // discriminator: 0x100

  if (i++ < N) {

    foo(); // discriminator: 0x200

    if (i++ < N) {

      foo(); // discriminator: 0x300

    }

  }

}

If we allocate 8 bits to "traditional" discriminators, then 0x40 falls
into that range, so I'd think the calls to foo() inside the loop should be
using 0x400 to encode the unroll factor.

You are right, thanks for pointing this out.

Note this requires 2 bytes for ULEB128 instead of 1.

And if we allocate another 8 bits to the unroll factor, then the trailing
calls should use 0x10000, 0x20000, 0x30000. These will require 3 bytes for
ULEB128 instead of 2.

I think it would be fine to allocate only 4 bits to "traditional"
discriminators (as you need them to be unique across the same source
location, but not across different source locations, and 16 independent
basic blocks for the same source location seems like plenty to me; but I
haven't looked at a lot of cases with discriminators). This would allow
you to use 0x40 to encode the unroll factor in this example. If you still
want to allocate 8 bits for unroll factor, then the trailing calls would
use 0x1000, 0x2000, 0x3000 so as long as you had no more than 3 trailing
calls you can still encode the discriminator in a 2-byte ULEB128.

I did some evaluation on speccpu2006 c/c++ benchmarks. Among all source
built, there are 268K non-zero discriminators emitted. The largest
discriminator is 779 (coming from 471.omnetpp, which has significant amount
of EH code.) The 99% percentile is 18. The 85% percentile is 3. For the
duplication factor, the largest is 320, the 99% percentile is 40, the 85%
percentile is 4.

If we want to encode this into the discriminator, I would propose encode
something like:

high bits ----------> low bits
EEEEEEEECCCCCFFDDD CCCFFFFFFDDDDD

D: normal discriminator
F: duplication factor
C: code cloning
E: empty bits

So the lower 14 bits should be able to cover 99% percentile

Or something like:
high bits ----------> low bits
EEEEEEEECCCCCFFDDD CFFFDDD CCFFFDD

So the lower 7 bits should be able to cover 85% percentile and the lower 14
bits should be able to cover 99% percentile.

Dehao

Hi Dehao,

This is definitely an important problem, thanks for writing this up!

There is a related problem that I think we can address at the same time:
When we multiversion code, for example when we use runtime checks to enable
the creation of a vectorized loop while retaining the scalar loop, and then
we collect profiling data, we should be able to recover the relative
running time of the different versions of the loop. In the future, I
suspect we'll end up with transformations that produce even more versions
of loops (Intel's compiler, for example, has a useful pragma that allows
the user to request loop specializations for a specified set of trip
counts).

I'd like to have a scheme where the source location + discriminator can be
mapped to information about the relevant loop version so that our profiling
tools can display that information usefully to the user, and so that our
optimizations can make useful choices (i.e. don't bother vectorizing a loop
when the scalar loop is run a lot but the vectorized version almost never
runs).

That's definitely a valid and important use case, and it is important to
sample pgo too. That's why I proposed to have "duplicated code that may
have different execution count" being recorded. Will that suffice to get
the info you want? (i.e. for every version of the multi-versioned loop, you
will have a disincentive discriminator associated with all the code it
expands.

In short, I think that differentiating these different regions using the
descriminator seems like a natural choice, but "And we do not need to
introduce new building blocks to debug info" might not save us all that
much in the long run. To keep information on what regions correspond to
what optimizations, we may need to do that. That's not a bad thing, and I'd
rather we solve this in a way that is extensible. Plus, this might make it
easier to use fewer bits, thus helping the overall impact on the size of
the debug sections.

I agree that if we want to extend this in the future, we need to have
separate dwarf bits other than discriminator. For current use case,
discriminator seem to be good enough. And if we encode efficiently, it will
be better than introducing a new field. e.g., we can encode all info in a
1-byte ULEB128 85%~90% of the time, but for a new field, we will at least
need 2 bytes if both discriminator and cloning info exists for an
instruction.

Dehao

The largest discriminator is 779 (coming from 471.omnetpp, which has significant amount of EH code.)

779 distinct blocks coming from a single source location? That’s astounding.

Or something like:

high bits ----------> low bits

EEEEEEEECCCCCFFDDD CFFFDDD CCFFFDD

So the lower 7 bits should be able to cover 85% percentile and the lower 14 bits should be able to cover 99% percentile.

Having a scheme for compact representation for the vast majority of cases is great, and will really help keep the size of the section under control. Did you have a plan for the degenerate cases where one of these elements (D/F/C) exceeds the specified capacity? You already have one, because 779 > 8 bits.

Thanks,

–paulr

The largest discriminator is 779 (coming from 471.omnetpp, which has
significant amount of EH code.)

779 distinct blocks coming from a single source location? That's
astounding.

Agree, but that's a rare case. All code in landing pads for a super large
function LargeNet::doBuildInside are mapped to the same line (end of the
function). So we end up assigning new discriminators for each one of them.

Or something like:

high bits ----------> low bits

EEEEEEEECCCCCFFDDD CFFFDDD CCFFFDD

So the lower 7 bits should be able to cover 85% percentile and the lower
14 bits should be able to cover 99% percentile.

Having a scheme for compact representation for the vast majority of cases
is great, and will really help keep the size of the section under control.
Did you have a plan for the degenerate cases where one of these elements
(D/F/C) exceeds the specified capacity? You already have one, because 779
> 8 bits.

Thanks,

--paulr

That's right, we still have 8 empty bits. Maybe we can extend D to 10
bits. We can also set a cap to D, F and C.

Dehao

From: "Dehao Chen" <dehao@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Xinliang David Li" <davidxl@google.com>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Tuesday, November 1, 2016 11:43:41 AM
Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

> Hi Dehao,

> This is definitely an important problem, thanks for writing this
> up!

> There is a related problem that I think we can address at the same
> time: When we multiversion code, for example when we use runtime
> checks to enable the creation of a vectorized loop while retaining
> the scalar loop, and then we collect profiling data, we should be
> able to recover the relative running time of the different versions
> of the loop. In the future, I suspect we'll end up with
> transformations that produce even more versions of loops (Intel's
> compiler, for example, has a useful pragma that allows the user to
> request loop specializations for a specified set of trip counts).

> I'd like to have a scheme where the source location + discriminator
> can be mapped to information about the relevant loop version so
> that
> our profiling tools can display that information usefully to the
> user, and so that our optimizations can make useful choices (i.e.
> don't bother vectorizing a loop when the scalar loop is run a lot
> but the vectorized version almost never runs).

That's definitely a valid and important use case, and it is important
to sample pgo too. That's why I proposed to have " duplicated code
that may have different execution count" being recorded. Will that
suffice to get the info you want? (i.e. for every version of the
multi-versioned loop, you will have a disincentive discriminator
associated with all the code it expands.

I don't know. Can you explain how the process will work? By the time the code/metadata arrives at, say, the loop vectorizer, how can we tell whether the vectorized version we might now create will be executed (based on having profiling data from a run where the compiler might have previously made a similar choice)?

> In short, I think that differentiating these different regions
> using
> the descriminator seems like a natural choice, but "And we do not
> need to introduce new building blocks to debug info" might not save
> us all that much in the long run. To keep information on what
> regions correspond to what optimizations, we may need to do that.
> That's not a bad thing, and I'd rather we solve this in a way that
> is extensible. Plus, this might make it easier to use fewer bits,
> thus helping the overall impact on the size of the debug sections.

I agree that if we want to extend this in the future, we need to have
separate dwarf bits other than discriminator. For current use case,
discriminator seem to be good enough. And if we encode efficiently,
it will be better than introducing a new field. e.g., we can encode
all info in a 1-byte ULEB128 85%~90% of the time, but for a new
field, we will at least need 2 bytes if both discriminator and
cloning info exists for an instruction.

Is this because you need at least one more byte for the debug-info field?

In general, I don't really care where we stuff the bits so long as we can get the necessary information back out. For a vectorized loop, for example, we should be able to figure out which counts go to the vectorized loop vs. the scalar loop. I don't, however, want to end up with something that is super-non-extensible (e.g. we have only a few bits, so the vectorizer can get one and the unroller can get one, but loop distribution is out of luck). Maybe we need an 'extension' bit saying that there is more information encoded elsewhere?

Thanks again,
Hal

------------------------------

*From: *"Dehao Chen" <dehao@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Xinliang David Li" <davidxl@google.com>, "llvm-dev" <
llvm-dev@lists.llvm.org>
*Sent: *Tuesday, November 1, 2016 11:43:41 AM
*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

Hi Dehao,

This is definitely an important problem, thanks for writing this up!

There is a related problem that I think we can address at the same time:
When we multiversion code, for example when we use runtime checks to enable
the creation of a vectorized loop while retaining the scalar loop, and then
we collect profiling data, we should be able to recover the relative
running time of the different versions of the loop. In the future, I
suspect we'll end up with transformations that produce even more versions
of loops (Intel's compiler, for example, has a useful pragma that allows
the user to request loop specializations for a specified set of trip
counts).

I'd like to have a scheme where the source location + discriminator can
be mapped to information about the relevant loop version so that our
profiling tools can display that information usefully to the user, and so
that our optimizations can make useful choices (i.e. don't bother
vectorizing a loop when the scalar loop is run a lot but the vectorized
version almost never runs).

That's definitely a valid and important use case, and it is important to
sample pgo too. That's why I proposed to have "duplicated code that may
have different execution count" being recorded. Will that suffice to get
the info you want? (i.e. for every version of the multi-versioned loop, you
will have a disincentive discriminator associated with all the code it
expands.

I don't know. Can you explain how the process will work? By the time the
code/metadata arrives at, say, the loop vectorizer, how can we tell whether
the vectorized version we might now create will be executed (based on
having profiling data from a run where the compiler might have previously
made a similar choice)?

For example, for the following loop:

for (i = 0; i < N; i++)
  stmt;

After many loop optimizations it could become something like:

if (vectorized1)
  for (i = 0;.....)
     stmt; //uid: 1
else if (vectorized2)
  for (i = 0;.....)
    stmt; //uid: 2
else if (unrolled)
  for (i = 0; i < N & (~0x3) i += 4)
    stmt; stmt; stmt; stmt; //uid: 3
  for (; i < N; i++)
    stmt; //uid: 4
else
  for (; i < N; i++)
    stmt; //uid: 5

So basically, each clone of the loop will have a uid, which you can use to
identify which version of the loop the instruction was coming from.

In short, I think that differentiating these different regions using the
descriminator seems like a natural choice, but "And we do not need to
introduce new building blocks to debug info" might not save us all that
much in the long run. To keep information on what regions correspond to
what optimizations, we may need to do that. That's not a bad thing, and I'd
rather we solve this in a way that is extensible. Plus, this might make it
easier to use fewer bits, thus helping the overall impact on the size of
the debug sections.

I agree that if we want to extend this in the future, we need to have
separate dwarf bits other than discriminator. For current use case,
discriminator seem to be good enough. And if we encode efficiently, it will
be better than introducing a new field. e.g., we can encode all info in a
1-byte ULEB128 85%~90% of the time, but for a new field, we will at least
need 2 bytes if both discriminator and cloning info exists for an
instruction.

Is this because you need at least one more byte for the debug-info field?

In general, I don't really care where we stuff the bits so long as we can
get the necessary information back out. For a vectorized loop, for example,
we should be able to figure out which counts go to the vectorized loop vs.
the scalar loop. I don't, however, want to end up with something that is
super-non-extensible (e.g. we have only a few bits, so the vectorizer can
get one and the unroller can get one, but loop distribution is out of
luck). Maybe we need an 'extension' bit saying that there is more
information encoded elsewhere?

As illustrated in the above example, it is not like "vectorization has a
distinct bit". All different optimizations make clones of code which will
be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space
will be capped by the number of clones all optimizations have made, instead
of # of optimizations that has applied. And it will be capped at 2^N-1. The
cons of using uid is that you will not know if a clone is coming from
vectorization or unroll or loop distribution.

Thanks,
Dehao

From: "Dehao Chen" <dehao@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Xinliang David Li" <davidxl@google.com>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Tuesday, November 1, 2016 1:24:01 PM
Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

> > From: "Dehao Chen" < dehao@google.com >
>

> > To: "Hal Finkel" < hfinkel@anl.gov >
>

> > Cc: "Xinliang David Li" < davidxl@google.com >, "llvm-dev" <
> > llvm-dev@lists.llvm.org >
>

> > Sent: Tuesday, November 1, 2016 11:43:41 AM
>

> > Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> > discriminator
>

>

> > > Hi Dehao,
> >
>

> > > This is definitely an important problem, thanks for writing
> > > this
> > > up!
> >
>

> > > There is a related problem that I think we can address at the
> > > same
> > > time: When we multiversion code, for example when we use
> > > runtime
> > > checks to enable the creation of a vectorized loop while
> > > retaining
> > > the scalar loop, and then we collect profiling data, we should
> > > be
> > > able to recover the relative running time of the different
> > > versions
> > > of the loop. In the future, I suspect we'll end up with
> > > transformations that produce even more versions of loops
> > > (Intel's
> > > compiler, for example, has a useful pragma that allows the user
> > > to
> > > request loop specializations for a specified set of trip
> > > counts).
> >
>

> > > I'd like to have a scheme where the source location +
> > > discriminator
> > > can be mapped to information about the relevant loop version so
> > > that
> > > our profiling tools can display that information usefully to
> > > the
> > > user, and so that our optimizations can make useful choices
> > > (i.e.
> > > don't bother vectorizing a loop when the scalar loop is run a
> > > lot
> > > but the vectorized version almost never runs).
> >
>

> > That's definitely a valid and important use case, and it is
> > important
> > to sample pgo too. That's why I proposed to have " duplicated
> > code
> > that may have different execution count" being recorded. Will
> > that
> > suffice to get the info you want? (i.e. for every version of the
> > multi-versioned loop, you will have a disincentive discriminator
> > associated with all the code it expands.
>

> I don't know. Can you explain how the process will work? By the
> time
> the code/metadata arrives at, say, the loop vectorizer, how can we
> tell whether the vectorized version we might now create will be
> executed (based on having profiling data from a run where the
> compiler might have previously made a similar choice)?

For example, for the following loop:

for (i = 0; i < N; i++)
stmt;

After many loop optimizations it could become something like:

if (vectorized1)
for (i = 0;.....)
stmt; //uid: 1
else if (vectorized2)
for (i = 0;.....)
stmt; //uid: 2
else if (unrolled)
for (i = 0; i < N & (~0x3) i += 4)
stmt; stmt; stmt; stmt; //uid: 3
for (; i < N; i++)
stmt; //uid: 4
else
for (; i < N; i++)
stmt; //uid: 5

So basically, each clone of the loop will have a uid, which you can
use to identify which version of the loop the instruction was coming
from.

> > > In short, I think that differentiating these different regions
> > > using
> > > the descriminator seems like a natural choice, but "And we do
> > > not
> > > need to introduce new building blocks to debug info" might not
> > > save
> > > us all that much in the long run. To keep information on what
> > > regions correspond to what optimizations, we may need to do
> > > that.
> > > That's not a bad thing, and I'd rather we solve this in a way
> > > that
> > > is extensible. Plus, this might make it easier to use fewer
> > > bits,
> > > thus helping the overall impact on the size of the debug
> > > sections.
> >
>

> > I agree that if we want to extend this in the future, we need to
> > have
> > separate dwarf bits other than discriminator. For current use
> > case,
> > discriminator seem to be good enough. And if we encode
> > efficiently,
> > it will be better than introducing a new field. e.g., we can
> > encode
> > all info in a 1-byte ULEB128 85%~90% of the time, but for a new
> > field, we will at least need 2 bytes if both discriminator and
> > cloning info exists for an instruction.
>

> Is this because you need at least one more byte for the debug-info
> field?

> In general, I don't really care where we stuff the bits so long as
> we
> can get the necessary information back out. For a vectorized loop,
> for example, we should be able to figure out which counts go to the
> vectorized loop vs. the scalar loop. I don't, however, want to end
> up with something that is super-non-extensible (e.g. we have only a
> few bits, so the vectorizer can get one and the unroller can get
> one, but loop distribution is out of luck). Maybe we need an
> 'extension' bit saying that there is more information encoded
> elsewhere?

As illustrated in the above example, it is not like "vectorization
has a distinct bit". All different optimizations make clones of code
which will be labeled by UIDs represented by N (e.g. 8) bits. In
this way, the space will be capped by the number of clones all
optimizations have made, instead of # of optimizations that has
applied. And it will be capped at 2^N-1. The cons of using uid is
that you will not know if a clone is coming from vectorization or
unroll or loop distribution.

Okay, but that kind of semantic mapping is important. How should we encode/recover that information? To be clear, I'm not saying that we need to implement that up front, but there needs to be a clear path to an implementation, because I don't want to have two disjoint schemes.

Thanks again,
Hal

------------------------------

*From: *"Dehao Chen" <dehao@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Xinliang David Li" <davidxl@google.com>, "llvm-dev" <
llvm-dev@lists.llvm.org>
*Sent: *Tuesday, November 1, 2016 1:24:01 PM

*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

------------------------------

*From: *"Dehao Chen" <dehao@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Xinliang David Li" <davidxl@google.com>, "llvm-dev" <
llvm-dev@lists.llvm.org>
*Sent: *Tuesday, November 1, 2016 11:43:41 AM
*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

Hi Dehao,

This is definitely an important problem, thanks for writing this up!

There is a related problem that I think we can address at the same time:
When we multiversion code, for example when we use runtime checks to enable
the creation of a vectorized loop while retaining the scalar loop, and then
we collect profiling data, we should be able to recover the relative
running time of the different versions of the loop. In the future, I
suspect we'll end up with transformations that produce even more versions
of loops (Intel's compiler, for example, has a useful pragma that allows
the user to request loop specializations for a specified set of trip
counts).

I'd like to have a scheme where the source location + discriminator can
be mapped to information about the relevant loop version so that our
profiling tools can display that information usefully to the user, and so
that our optimizations can make useful choices (i.e. don't bother
vectorizing a loop when the scalar loop is run a lot but the vectorized
version almost never runs).

That's definitely a valid and important use case, and it is important to
sample pgo too. That's why I proposed to have "duplicated code that may
have different execution count" being recorded. Will that suffice to get
the info you want? (i.e. for every version of the multi-versioned loop, you
will have a disincentive discriminator associated with all the code it
expands.

I don't know. Can you explain how the process will work? By the time the
code/metadata arrives at, say, the loop vectorizer, how can we tell whether
the vectorized version we might now create will be executed (based on
having profiling data from a run where the compiler might have previously
made a similar choice)?

For example, for the following loop:

for (i = 0; i < N; i++)
  stmt;

After many loop optimizations it could become something like:

if (vectorized1)
  for (i = 0;.....)
     stmt; //uid: 1
else if (vectorized2)
  for (i = 0;.....)
    stmt; //uid: 2
else if (unrolled)
  for (i = 0; i < N & (~0x3) i += 4)
    stmt; stmt; stmt; stmt; //uid: 3
  for (; i < N; i++)
    stmt; //uid: 4
else
  for (; i < N; i++)
    stmt; //uid: 5

So basically, each clone of the loop will have a uid, which you can use to
identify which version of the loop the instruction was coming from.

In short, I think that differentiating these different regions using the
descriminator seems like a natural choice, but "And we do not need to
introduce new building blocks to debug info" might not save us all that
much in the long run. To keep information on what regions correspond to
what optimizations, we may need to do that. That's not a bad thing, and I'd
rather we solve this in a way that is extensible. Plus, this might make it
easier to use fewer bits, thus helping the overall impact on the size of
the debug sections.

I agree that if we want to extend this in the future, we need to have
separate dwarf bits other than discriminator. For current use case,
discriminator seem to be good enough. And if we encode efficiently, it will
be better than introducing a new field. e.g., we can encode all info in a
1-byte ULEB128 85%~90% of the time, but for a new field, we will at least
need 2 bytes if both discriminator and cloning info exists for an
instruction.

Is this because you need at least one more byte for the debug-info field?

In general, I don't really care where we stuff the bits so long as we can
get the necessary information back out. For a vectorized loop, for example,
we should be able to figure out which counts go to the vectorized loop vs.
the scalar loop. I don't, however, want to end up with something that is
super-non-extensible (e.g. we have only a few bits, so the vectorizer can
get one and the unroller can get one, but loop distribution is out of
luck). Maybe we need an 'extension' bit saying that there is more
information encoded elsewhere?

As illustrated in the above example, it is not like "vectorization has a
distinct bit". All different optimizations make clones of code which will
be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space
will be capped by the number of clones all optimizations have made, instead
of # of optimizations that has applied. And it will be capped at 2^N-1. The
cons of using uid is that you will not know if a clone is coming from
vectorization or unroll or loop distribution.

Okay, but that kind of semantic mapping is important. How should we
encode/recover that information? To be clear, I'm not saying that we need
to implement that up front, but there needs to be a clear path to an
implementation, because I don't want to have two disjoint schemes.

You mean that you want to know which optimization created the clone? How
would you use that info? Looks to me this will expose compiler
implementation detail in debug info.

This is still doable, assume we have 15 interesting optimizations to track,
we can use 4 bits to encode the optimization type that created the clone.
But this becomes nasty if the a clone is created by more than one
optimizations. In that way, discriminator may not be fit for this purpose.

Thanks,
Dehao

As illustrated in the above example, it is not like “vectorization has a distinct bit”. All different optimizations make clones of code which will be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space will be capped by the number of clones all optimizations have made, instead of # of optimizations that has applied. And it will be capped at 2^N-1. The cons of using uid is that you will not know if a clone is coming from vectorization or unroll or loop distribution.

Okay, but that kind of semantic mapping is important. How should we encode/recover that information? To be clear, I’m not saying that we need to implement that up front, but there needs to be a clear path to an implementation, because I don’t want to have two disjoint schemes.

You mean that you want to know which optimization created the clone? How would you use that info? Looks to me this will expose compiler implementation detail in debug info.

This is still doable, assume we have 15 interesting optimizations to track, we can use 4 bits to encode the optimization type that created the clone. But this becomes nasty if the a clone is created by more than one optimizations. In that way, discriminator may not be fit for this purpose.

My understanding was that the encoding scheme would allow the profiling analysis to correctly map execution data back to the original source construct, while preserving the property that each distinct basic block would have its own discriminator value. That is, the execution data would be attributed back to the original source construct, not whatever each individual optimization had done to it, and the data for the original source construct would correctly reflect the execution (e.g. profiling says you got 82 hits on the original loop, rather than reporting 20 hits on the unrolled-by-4 loop plus 1 each on 2 of the trailing copies).

It sounds like Hal is thinking that the per-discriminator execution info would be preserved down to the point where an individual optimization could look at the profile for each piece, and make decisions on that basis.

I’m not clear how that would be possible, as the optimization would have to first do the transform (or predict how it would do the transform) in order to see which individual-discriminator counts mapped to which actual blocks, and then make some kind of decision about whether to do the transform differently based on that information. Then, if the optimization did choose to do the transform differently, then that leaves the IR in a state where the individual discriminators cannot map back to it. (Say you unroll by 2 instead of 4; then you have only 1 trailing copy, not 3, and a discriminator that maps to the second trailing copy now maps to nothing. The individual-discriminator data becomes useless.)

Am I expressing this well enough to show that what Hal is looking for is not feasible?

–paulr

As illustrated in the above example, it is not like “vectorization has a distinct bit”. All different optimizations make clones of code which will be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space will be capped by the number of clones all optimizations have made, instead of # of optimizations that has applied. And it will be capped at 2^N-1. The cons of using uid is that you will not know if a clone is coming from vectorization or unroll or loop distribution.

Okay, but that kind of semantic mapping is important. How should we encode/recover that information? To be clear, I’m not saying that we need to implement that up front, but there needs to be a clear path to an implementation, because I don’t want to have two disjoint schemes.

You mean that you want to know which optimization created the clone? How would you use that info? Looks to me this will expose compiler implementation detail in debug info.

This is still doable, assume we have 15 interesting optimizations to track, we can use 4 bits to encode the optimization type that created the clone. But this becomes nasty if the a clone is created by more than one optimizations. In that way, discriminator may not be fit for this purpose.

My understanding was that the encoding scheme would allow the profiling analysis to correctly map execution data back to the original source construct, while preserving the property that each distinct basic block would have its own discriminator value. That is, the execution data would be attributed back to the original source construct, not whatever each individual optimization had done to it, and the data for the original source construct would correctly reflect the execution (e.g. profiling says you got 82 hits on the original loop, rather than reporting 20 hits on the unrolled-by-4 loop plus 1 each on 2 of the trailing copies).

It sounds like Hal is thinking that the per-discriminator execution info would be preserved down to the point where an individual optimization could look at the profile for each piece, and make decisions on that basis.

I’m not clear how that would be possible, as the optimization would have to first do the transform (or predict how it would do the transform) in order to see which individual-discriminator counts mapped to which actual blocks, and then make some kind of decision about whether to do the transform differently based on that information. Then, if the optimization did choose to do the transform differently, then that leaves the IR in a state where the individual discriminators cannot map back to it. (Say you unroll by 2 instead of 4; then you have only 1 trailing copy, not 3, and a discriminator that maps to the second trailing copy now maps to nothing. The individual-discriminator data becomes useless.)

Am I expressing this well enough to show that what Hal is looking for is not feasible?

If Hal’s proposal is for SamplePGO purpose, let me clarify some design principles of SamplePGO.

The profile for sample pgo uses source location as the key to map the execution count back to IR. This design is based on the principle that we do not want the profile to be tightly couple with compiler IR. Instead, profile is simple an attribute of the source code. We have been benefited a lot from this design that the profile can easily be reused across different source versions and compiler versions, or even compilers.

That being said, the design to encode more info into discriminator does not mean that we will change the profile. The encoded info in discriminator will be handled by the create_llvm_prof tool, which combines counts from different clones of the same source code and generate the combined profile data. The output profile will not have any cloning/dupliaction bits at all. So for the initial example profile I provided, the output profile will be: