(RFC) Encoding code duplication factor in discriminator

Oops… pressed the wrong button and sent out early…

damn… my english is not readable at all when I try to write fast… trying to make some clarification below, hopefully can make it more readable…

From: "Dehao Chen" <dehao@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Paul Robinson" <paul.robinson@sony.com>, "Xinliang David Li"
<davidxl@google.com>, "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Tuesday, November 1, 2016 4:14:43 PM
Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

damn... my english is not readable at all when I try to write fast...
trying to make some clarification below, hopefully can make it more
readable...

> Oops... pressed the wrong button and sent out early...

> > If Hal's proposal is for SamplePGO purpose, let me clarify some
> > design principles of SamplePGO.
>

> > The profile for sample pgo uses source location as the key to map
> > the
> > execution count back to IR. This design is based on the principle
> > that we do not want the profile to be tightly couple with
> > compiler
> > IR. Instead, profile is simple an attribute of the source code.
> > We
> > have been benefited a lot from this design that the profile can
> > easily be reused across different source versions and compiler
> > versions, or even compilers.
>

> > That being said, the design to encode more info into
> > discriminator
> > does not mean that we will change the profile. The encoded info
> > in
> > discriminator will be handled by the create_llvm_prof tool, which
> > combines counts from different clones of the same source code and
> > generate the combined profile data. The output profile will not
> > have
> > any cloning/dupliaction bits at all. So for the initial example
> > profile I provided, the output profile will be:
>

> #1: 10

> #3: 80

> Not:

> #1: 10

> #3.0x400: 70

> #3.0x10400: 5

> #3.0x20400: 3

> #3.0x30400: 2

> The goal of the proposed change, is to make profile more accurately
> represent the attribute of the source.

> The non-goal of the proposed change, is to provide more context in
> the profile to present the behavior of program in the context of
> different context.

The non-goal of the proposed change, is to provide more context in
the profile to present the behavior of program in different
contexts.'

Okay, but why? The information will be present in the profiles, why not encode it in the metadata and let the optimizer decide when to sum it up? We should even provide some utility functions that make this transparent for passes that don't care about the distinction.

> If we are pursuing more context-sensitive profile, that would be a
> very different design. In this design, we will need to read profile
> multiple times, and have the profile tightly coupled with the
> compiler IR/pass manager. That is doable, but I don't think that
> probably better suits instrumentation based PGO's domain. Comments?

If we are pursuing more context-sensitive profile, that would be a
very different design in which we need to read in profiles multiple
times, and have the profile tightly coupled with the compiler
IR/pass manager. That is doable, but that probably better suits
instrumentation based PGO's domain. Comments?

I don't understand why you say we'd need to read the profile multiple times. Can you please explain? I also don't think it needs to be that tightly coupled; we just need each pass that generates multiple copies of things to have some way to generate a unique id for what it's doing. It's all per source location, so I don't even think we need any kind of complicated hashing scheme.

Thanks again,
Hal

From: "Hal Finkel via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Dehao Chen" <dehao@google.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Xinliang David Li"
<davidxl@google.com>
Sent: Tuesday, November 1, 2016 4:26:17 PM
Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

> From: "Dehao Chen" <dehao@google.com>

> To: "Hal Finkel" <hfinkel@anl.gov>

> Cc: "Paul Robinson" <paul.robinson@sony.com>, "Xinliang David Li"
> <davidxl@google.com>, "llvm-dev" <llvm-dev@lists.llvm.org>

> Sent: Tuesday, November 1, 2016 4:14:43 PM

> Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator

> damn... my english is not readable at all when I try to write
> fast...
> trying to make some clarification below, hopefully can make it more
> readable...

> > Oops... pressed the wrong button and sent out early...
>

>

> > > If Hal's proposal is for SamplePGO purpose, let me clarify some
> > > design principles of SamplePGO.
> >
>

> > > The profile for sample pgo uses source location as the key to
> > > map
> > > the
> > > execution count back to IR. This design is based on the
> > > principle
> > > that we do not want the profile to be tightly couple with
> > > compiler
> > > IR. Instead, profile is simple an attribute of the source code.
> > > We
> > > have been benefited a lot from this design that the profile can
> > > easily be reused across different source versions and compiler
> > > versions, or even compilers.
> >
>

> > > That being said, the design to encode more info into
> > > discriminator
> > > does not mean that we will change the profile. The encoded info
> > > in
> > > discriminator will be handled by the create_llvm_prof tool,
> > > which
> > > combines counts from different clones of the same source code
> > > and
> > > generate the combined profile data. The output profile will not
> > > have
> > > any cloning/dupliaction bits at all. So for the initial example
> > > profile I provided, the output profile will be:
> >
>

> > #1: 10
>

> > #3: 80
>

> > Not:
>

> > #1: 10
>

> > #3.0x400: 70
>

> > #3.0x10400: 5
>

> > #3.0x20400: 3
>

> > #3.0x30400: 2
>

Also, how does this work for vectorization? For vectorization, you're going to multiply by the duplication factor first? The issue obviously is that each vector instruction counts for VF times as many scalar instructions, and so to get equivalent scalar counts, you need to multiply by VF. I had assumed this is what you were saying when I read the initial e-mail, but if you're also using the same duplication scheme for unrolling, then I think we need some way to differentiate.

-Hal

------------------------------

*From: *"Hal Finkel via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"Dehao Chen" <dehao@google.com>
*Cc: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Xinliang David Li" <
davidxl@google.com>
*Sent: *Tuesday, November 1, 2016 4:26:17 PM
*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

------------------------------

*From: *"Dehao Chen" <dehao@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Paul Robinson" <paul.robinson@sony.com>, "Xinliang David Li" <
davidxl@google.com>, "llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Tuesday, November 1, 2016 4:14:43 PM
*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

damn... my english is not readable at all when I try to write fast...
trying to make some clarification below, hopefully can make it more
readable...

Oops... pressed the wrong button and sent out early...

If Hal's proposal is for SamplePGO purpose, let me clarify some design
principles of SamplePGO.

The profile for sample pgo uses source location as the key to map the
execution count back to IR. This design is based on the principle that we
do not want the profile to be tightly couple with compiler IR. Instead,
profile is simple an attribute of the source code. We have been benefited
a lot from this design that the profile can easily be reused across
different source versions and compiler versions, or even compilers.

That being said, the design to encode more info into discriminator does
not mean that we will change the profile. The encoded info in discriminator
will be handled by the create_llvm_prof tool, which combines counts from
different clones of the same source code and generate the combined profile
data. The output profile will not have any cloning/dupliaction bits at all.
So for the initial example profile I provided, the output profile will be:

#1: 10
#3: 80

Not:

#1: 10
#3.0x400: 70
#3.0x10400: 5
#3.0x20400: 3
#3.0x30400: 2

Also, how does this work for vectorization? For vectorization, you're
going to multiply by the duplication factor first? The issue obviously is
that each vector instruction counts for VF times as many scalar
instructions, and so to get equivalent scalar counts, you need to multiply
by VF. I had assumed this is what you were saying when I read the initial
e-mail, but if you're also using the same duplication scheme for unrolling,
then I think we need some way to differentiate.

In my original proposal, I only record VF*UF if the clone is both unrolled
and vectorized. I did not distinguish between unroll and vectorization
because I the unroll/vectorize decision only depends on the trip count,
which can be derived from the duplication factor. If more profile info can
be used for better unroll/vectorize decision making, we can definitely add
it.

-Hal

The goal of the proposed change, is to make profile more accurately
represent the attribute of the source.
The non-goal of the proposed change, is to provide more context in the
profile to present the behavior of program in the context of different
context.

The non-goal of the proposed change, is to provide more context in the
profile to present the behavior of program in different contexts.'

Okay, but why? The information will be present in the profiles, why not
encode it in the metadata and let the optimizer decide when to sum it up?
We should even provide some utility functions that make this transparent
for passes that don't care about the distinction.

I see, you are proposing encoding these profile in the metadata. I agree

that this can work, but from implementation's perspective, it could could
lead to great complexity.

Sample PGO pass first read in the source->count map, use it to get
basic_block->count map, and then use heuristics to propagate and get
branch->probability count. So we don't have Metadata to store the raw
count, but only the branch probability is stored after sample pgo pass.

If we are pursuing more context-sensitive profile, that would be a very
different design. In this design, we will need to read profile multiple
times, and have the profile tightly coupled with the compiler IR/pass
manager. That is doable, but I don't think that probably better suits
instrumentation based PGO's domain. Comments?

If we are pursuing more context-sensitive profile, that would be a very
different design in which we need to read in profiles multiple times, and
have the profile tightly coupled with the compiler IR/pass manager. That is
doable, but that probably better suits instrumentation based PGO's domain.
Comments?

I don't understand why you say we'd need to read the profile multiple
times. Can you please explain? I also don't think it needs to be that
tightly coupled; we just need each pass that generates multiple copies of
things to have some way to generate a unique id for what it's doing. It's
all per source location, so I don't even think we need any kind of
complicated hashing scheme.

As explained above, we only have branch probability. Yes, we can record

more instruction/basicblock count context info in additional metadata. But
when you use it in optimizer, I suppose you will need to convert it to
branch probability too?

Let me take an example to demonstrate my understanding of your use of
context-sensitive profile.

original code:
for ()
  stmt1;

optimized code:
if (cond1)
  for()
    stmt1; //100 samples
else if (cond2)
  for()
    stmt1; // 0 samples
else
  for()
    stmt1; // 50 samples

The loop was multi-versioned by cond1 and cond2. With my proposal, the
profile for stmt1 will be combined as 150 samples. If we use this profile
to optimize the orginal source, after annotation, the code becomes:

for()
  stmt1 // 150 samples

Later when it comes to the mulit-version optimization, it will transform
the code to:

if (cond1)
  for()
    stmt1; //50 samples
else if (cond2)
  for()
    stmt1; // 50 samples
else
  for()
    stmt1; // 50 samples

The 150 samples are evenly distributed to 3 clones as we do not have
context info.

With your proposal, after annotation, the original code becomes:

for()
  stmt1 // 150 samples (M1:100, M2:0, M3:50)

Then during mutli-version optimization, the optimizer reads the metadata
and find M2 is never taken, so it will transform the code to:

if (cond1)
  for()
    stmt1; // 100 samples
else
  for()
    stmt1; // 50 samples

There are two major parts that can benefit from the extra metadata recorded
for each copy.

1. It can demonstrate that cond2 is always false and there is no need to
create that version.
2. It can correctly attribute samples to cloned copy (i.e. 100, 50
respectively in this case)

I think the most important benefit is #1. For #2, my guess is scaled count
is already good enough.

Here comes the problem: cond1 and cond2 are both generated by compiler.
Then how do we map the metadata back to them? Using uid or special encoding
seems fragile because when upstream optimization changes, compiler may
choose to put cond2 in front of cond1 so the uid will change accordingly.
How would we handle cases like this?

Dehao

From: "Dehao Chen" <dehao@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Xinliang David Li"
<davidxl@google.com>
Sent: Tuesday, November 1, 2016 6:41:29 PM
Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

> > From: "Hal Finkel via llvm-dev" < llvm-dev@lists.llvm.org >
>

> > To: "Dehao Chen" < dehao@google.com >
>

> > Cc: "llvm-dev" < llvm-dev@lists.llvm.org >, "Xinliang David Li" <
> > davidxl@google.com >
>

> > Sent: Tuesday, November 1, 2016 4:26:17 PM
>

> > Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> > discriminator
>

> > > From: "Dehao Chen" < dehao@google.com >
> >
>

> > > To: "Hal Finkel" < hfinkel@anl.gov >
> >
>

> > > Cc: "Paul Robinson" < paul.robinson@sony.com >, "Xinliang David
> > > Li"
> > > <
> > > davidxl@google.com >, "llvm-dev" < llvm-dev@lists.llvm.org >
> >
>

> > > Sent: Tuesday, November 1, 2016 4:14:43 PM
> >
>

> > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor
> > > in
> > > discriminator
> >
>

> > > damn... my english is not readable at all when I try to write
> > > fast...
> > > trying to make some clarification below, hopefully can make it
> > > more
> > > readable...
> >
>

> >
>

> > > > Oops... pressed the wrong button and sent out early...
> > >
> >
>

> > >
> >
>

> > > > > If Hal's proposal is for SamplePGO purpose, let me clarify
> > > > > some
> > > > > design principles of SamplePGO.
> > > >
> > >
> >
>

> > > > > The profile for sample pgo uses source location as the key
> > > > > to
> > > > > map
> > > > > the
> > > > > execution count back to IR. This design is based on the
> > > > > principle
> > > > > that we do not want the profile to be tightly couple with
> > > > > compiler
> > > > > IR. Instead, profile is simple an attribute of the source
> > > > > code.
> > > > > We
> > > > > have been benefited a lot from this design that the profile
> > > > > can
> > > > > easily be reused across different source versions and
> > > > > compiler
> > > > > versions, or even compilers.
> > > >
> > >
> >
>

> > > > > That being said, the design to encode more info into
> > > > > discriminator
> > > > > does not mean that we will change the profile. The encoded
> > > > > info
> > > > > in
> > > > > discriminator will be handled by the create_llvm_prof tool,
> > > > > which
> > > > > combines counts from different clones of the same source
> > > > > code
> > > > > and
> > > > > generate the combined profile data. The output profile will
> > > > > not
> > > > > have
> > > > > any cloning/dupliaction bits at all. So for the initial
> > > > > example
> > > > > profile I provided, the output profile will be:
> > > >
> > >
> >
>

> > > > #1: 10
> > >
> >
>

> > > > #3: 80
> > >
> >
>

> > > > Not:
> > >
> >
>

> > > > #1: 10
> > >
> >
>

> > > > #3.0x400: 70
> > >
> >
>

> > > > #3.0x10400: 5
> > >
> >
>

> > > > #3.0x20400: 3
> > >
> >
>

> > > > #3.0x30400: 2
> > >
> >
>

> Also, how does this work for vectorization? For vectorization,
> you're
> going to multiply by the duplication factor first? The issue
> obviously is that each vector instruction counts for VF times as
> many scalar instructions, and so to get equivalent scalar counts,
> you need to multiply by VF. I had assumed this is what you were
> saying when I read the initial e-mail, but if you're also using the
> same duplication scheme for unrolling, then I think we need some
> way
> to differentiate.

In my original proposal, I only record VF*UF if the clone is both
unrolled and vectorized. I did not distinguish between unroll and
vectorization because I the unroll/vectorize decision only depends
on the trip count, which can be derived from the duplication factor.
If more profile info can be used for better unroll/vectorize
decision making, we can definitely add it.

I'm still missing something here. In your proposed system, does the tool collecting the profiling information *interpret* the duplication factor encoded in the descriminators at all? If it does, it seems like this should be specific to vectorization. For unrolling, you still have the same number of instructions, and so you just need to make the different instructions from the different loop iterations carry different discriminator values (so they they'll sum together, instead of using the maximum, as you explained in your original proposal).

> -Hal

> > > > The goal of the proposed change, is to make profile more
> > > > accurately
> > > > represent the attribute of the source.
> > >
> >
>

> > > > The non-goal of the proposed change, is to provide more
> > > > context
> > > > in
> > > > the profile to present the behavior of program in the context
> > > > of
> > > > different context.
> > >
> >
>

> > > The non-goal of the proposed change, is to provide more context
> > > in
> > > the profile to present the behavior of program in different
> > > contexts.'
> >
>

> > Okay, but why? The information will be present in the profiles,
> > why
> > not encode it in the metadata and let the optimizer decide when
> > to
> > sum it up? We should even provide some utility functions that
> > make
> > this transparent for passes that don't care about the
> > distinction.
>

I see, you are proposing encoding these profile in the metadata. I
agree that this can work, but from implementation's perspective, it
could could lead to great complexity.

Sample PGO pass first read in the source->count map, use it to get
basic_block->count map, and then use heuristics to propagate and get
branch->probability count. So we don't have Metadata to store the
raw count, but only the branch probability is stored after sample
pgo pass.

I think that, as you indicate below, we'd need to change this to {source, disc}->count map --> {BB, disc}->count map --> {branch, disc}->prob map. The idea being that the total probability for the branch is the sum of the probabilities associated with it for each discriminator value. Most users will just sum them up, but some passes will want the breakdowns.

> > > > If we are pursuing more context-sensitive profile, that would
> > > > be
> > > > a
> > > > very different design. In this design, we will need to read
> > > > profile
> > > > multiple times, and have the profile tightly coupled with the
> > > > compiler IR/pass manager. That is doable, but I don't think
> > > > that
> > > > probably better suits instrumentation based PGO's domain.
> > > > Comments?
> > >
> >
>

> > > If we are pursuing more context-sensitive profile, that would
> > > be
> > > a
> > > very different design in which we need to read in profiles
> > > multiple
> > > times, and have the profile tightly coupled with the compiler
> > > IR/pass manager. That is doable, but that probably better suits
> > > instrumentation based PGO's domain. Comments?
> >
>

> > I don't understand why you say we'd need to read the profile
> > multiple
> > times. Can you please explain? I also don't think it needs to be
> > that tightly coupled; we just need each pass that generates
> > multiple
> > copies of things to have some way to generate a unique id for
> > what
> > it's doing. It's all per source location, so I don't even think
> > we
> > need any kind of complicated hashing scheme.
>

As explained above, we only have branch probability. Yes, we can
record more instruction/basicblock count context info in additional
metadata. But when you use it in optimizer, I suppose you will need
to convert it to branch probability too?

Let me take an example to demonstrate my understanding of your use of
context-sensitive profile.

original code:
for ()
stmt1;

optimized code:
if (cond1)
for()
stmt1; //100 samples
else if (cond2)
for()
stmt1; // 0 samples
else
for()
stmt1; // 50 samples

The loop was multi-versioned by cond1 and cond2. With my proposal,
the profile for stmt1 will be combined as 150 samples. If we use
this profile to optimize the orginal source, after annotation, the
code becomes:

for()
stmt1 // 150 samples

Later when it comes to the mulit-version optimization, it will
transform the code to:

if (cond1)
for()
stmt1; //50 samples
else if (cond2)
for()
stmt1; // 50 samples
else
for()
stmt1; // 50 samples

The 150 samples are evenly distributed to 3 clones as we do not have
context info.

With your proposal, after annotation, the original code becomes:

for()
stmt1 // 150 samples (M1:100, M2:0, M3:50)

Then during mutli-version optimization, the optimizer reads the
metadata and find M2 is never taken, so it will transform the code
to:

if (cond1)
for()
stmt1; // 100 samples
else

for()
stmt1; // 50 samples

There are two major parts that can benefit from the extra metadata
recorded for each copy.

1. It can demonstrate that cond2 is always false and there is no need
to create that version.
2. It can correctly attribute samples to cloned copy (i.e. 100, 50
respectively in this case)

I think the most important benefit is #1. For #2, my guess is scaled
count is already good enough.

By what are you scaling the count?

Here comes the problem: cond1 and cond2 are both generated by
compiler. Then how do we map the metadata back to them? Using uid or
special encoding seems fragile because when upstream optimization
changes, compiler may choose to put cond2 in front of cond1 so the
uid will change accordingly. How would we handle cases like this?

Exactly for this reason, I don't think we can rely on the ordering to generate the identifiers; they need to have semantic meaning. I agree that this is tricky in general. However, for the cases I have in mind (unrolling, vectorization, etc.) these passes don't generate arbitrary conditionals, but rather, specific conditionals related to runtime overlap checks, trip-count thresholds, and I think that it should be easy to encode a discriminator value which can be unambiguously inverted.

Thanks again,
Hal

------------------------------

*From: *"Dehao Chen" <dehao@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Xinliang David Li" <
davidxl@google.com>
*Sent: *Tuesday, November 1, 2016 6:41:29 PM

*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

------------------------------

*From: *"Hal Finkel via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"Dehao Chen" <dehao@google.com>
*Cc: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Xinliang David Li" <
davidxl@google.com>
*Sent: *Tuesday, November 1, 2016 4:26:17 PM
*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

------------------------------

*From: *"Dehao Chen" <dehao@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Paul Robinson" <paul.robinson@sony.com>, "Xinliang David Li" <
davidxl@google.com>, "llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Tuesday, November 1, 2016 4:14:43 PM
*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

damn... my english is not readable at all when I try to write fast...
trying to make some clarification below, hopefully can make it more
readable...

Oops... pressed the wrong button and sent out early...

If Hal's proposal is for SamplePGO purpose, let me clarify some design
principles of SamplePGO.

The profile for sample pgo uses source location as the key to map the
execution count back to IR. This design is based on the principle that we
do not want the profile to be tightly couple with compiler IR. Instead,
profile is simple an attribute of the source code. We have been benefited
a lot from this design that the profile can easily be reused across
different source versions and compiler versions, or even compilers.

That being said, the design to encode more info into discriminator does
not mean that we will change the profile. The encoded info in discriminator
will be handled by the create_llvm_prof tool, which combines counts from
different clones of the same source code and generate the combined profile
data. The output profile will not have any cloning/dupliaction bits at all.
So for the initial example profile I provided, the output profile will be:

#1: 10
#3: 80

Not:

#1: 10
#3.0x400: 70
#3.0x10400: 5
#3.0x20400: 3
#3.0x30400: 2

Also, how does this work for vectorization? For vectorization, you're
going to multiply by the duplication factor first? The issue obviously is
that each vector instruction counts for VF times as many scalar
instructions, and so to get equivalent scalar counts, you need to multiply
by VF. I had assumed this is what you were saying when I read the initial
e-mail, but if you're also using the same duplication scheme for unrolling,
then I think we need some way to differentiate.

In my original proposal, I only record VF*UF if the clone is both unrolled
and vectorized. I did not distinguish between unroll and vectorization
because I the unroll/vectorize decision only depends on the trip count,
which can be derived from the duplication factor. If more profile info can
be used for better unroll/vectorize decision making, we can definitely add
it.

I'm still missing something here. In your proposed system, does the tool
collecting the profiling information *interpret* the duplication factor
encoded in the descriminators at all? If it does, it seems like this should
be specific to vectorization. For unrolling, you still have the same number
of instructions, and so you just need to make the different instructions
from the different loop iterations carry different discriminator values (so
they they'll sum together, instead of using the maximum, as you explained
in your original proposal).

Let's take a look at the example, my proposed discriminator assignment is:

------------------------------

*From: *"Dehao Chen" <dehao@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Xinliang David Li" <
davidxl@google.com>
*Sent: *Tuesday, November 1, 2016 6:41:29 PM

*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

------------------------------

*From: *"Hal Finkel via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"Dehao Chen" <dehao@google.com>
*Cc: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Xinliang David Li" <
davidxl@google.com>
*Sent: *Tuesday, November 1, 2016 4:26:17 PM
*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

------------------------------

*From: *"Dehao Chen" <dehao@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Paul Robinson" <paul.robinson@sony.com>, "Xinliang David Li" <
davidxl@google.com>, "llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Tuesday, November 1, 2016 4:14:43 PM
*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

damn... my english is not readable at all when I try to write fast...
trying to make some clarification below, hopefully can make it more
readable...

Oops... pressed the wrong button and sent out early...

If Hal's proposal is for SamplePGO purpose, let me clarify some design
principles of SamplePGO.

The profile for sample pgo uses source location as the key to map the
execution count back to IR. This design is based on the principle that we
do not want the profile to be tightly couple with compiler IR. Instead,
profile is simple an attribute of the source code. We have been benefited
a lot from this design that the profile can easily be reused across
different source versions and compiler versions, or even compilers.

That being said, the design to encode more info into discriminator does
not mean that we will change the profile. The encoded info in discriminator
will be handled by the create_llvm_prof tool, which combines counts from
different clones of the same source code and generate the combined profile
data. The output profile will not have any cloning/dupliaction bits at all.
So for the initial example profile I provided, the output profile will be:

#1: 10
#3: 80

Not:

#1: 10
#3.0x400: 70
#3.0x10400: 5
#3.0x20400: 3
#3.0x30400: 2

Also, how does this work for vectorization? For vectorization, you're
going to multiply by the duplication factor first? The issue obviously is
that each vector instruction counts for VF times as many scalar
instructions, and so to get equivalent scalar counts, you need to multiply
by VF. I had assumed this is what you were saying when I read the initial
e-mail, but if you're also using the same duplication scheme for unrolling,
then I think we need some way to differentiate.

In my original proposal, I only record VF*UF if the clone is both unrolled
and vectorized. I did not distinguish between unroll and vectorization
because I the unroll/vectorize decision only depends on the trip count,
which can be derived from the duplication factor. If more profile info can
be used for better unroll/vectorize decision making, we can definitely add
it.

I'm still missing something here. In your proposed system, does the tool
collecting the profiling information *interpret* the duplication factor
encoded in the descriminators at all?

Yes it does.

If it does, it seems like this should be specific to vectorization. For
unrolling, you still have the same number of instructions, and so you just
need to make the different instructions from the different loop iterations
carry different discriminator values (so they they'll sum together, instead
of using the maximum, as you explained in your original proposal).

Let's take a look at the original loop unroll example. My proposed
discriminator assignment is:
for (i = 0; i < N & 3; i+= 4) {
  foo(); // discriminator: 0x400
  foo(); // discriminator: 0x400
  foo(); // discriminator: 0x400
  foo(); // discriminator: 0x400
}
if (i++ < N) {
  foo(); // discriminator: 0x10000
  if (i++ < N) {
    foo(); // discriminator: 0x20000
    if (i++ < N) {
      foo(); // discriminator: 0x30000
    }
  }
}

IIRC, your proposed discriminator assignment is:

for (i = 0; i < N & 3; i+= 4) {
  foo(); // discriminator: 0x10000
  foo(); // discriminator: 0x20000
  foo(); // discriminator: 0x30000
  foo(); // discriminator: 0x40000
}
if (i++ < N) {
  foo(); // discriminator: 0x50000
  if (i++ < N) {
    foo(); // discriminator: 0x60000
    if (i++ < N) {
      foo(); // discriminator: 0x70000
    }
  }
}

I think my method can save discriminator space because all clones share the
same discriminator. Consider that if a loop is unrolled 100 times, then
with your algorithm, we may end up having 100 different discriminators,
which seem not good for debug info size.

-Hal

The goal of the proposed change, is to make profile more accurately
represent the attribute of the source.
The non-goal of the proposed change, is to provide more context in the
profile to present the behavior of program in the context of different
context.

The non-goal of the proposed change, is to provide more context in the
profile to present the behavior of program in different contexts.'

Okay, but why? The information will be present in the profiles, why not
encode it in the metadata and let the optimizer decide when to sum it up?
We should even provide some utility functions that make this transparent
for passes that don't care about the distinction.

I see, you are proposing encoding these profile in the metadata. I agree

that this can work, but from implementation's perspective, it could could
lead to great complexity.

Sample PGO pass first read in the source->count map, use it to get
basic_block->count map, and then use heuristics to propagate and get
branch->probability count. So we don't have Metadata to store the raw
count, but only the branch probability is stored after sample pgo pass.

I think that, as you indicate below, we'd need to change this to {source,
disc}->count map --> {BB, disc}->count map --> {branch, disc}->prob map.
The idea being that the total probability for the branch is the sum of the
probabilities associated with it for each discriminator value. Most users
will just sum them up, but some passes will want the breakdowns.

Agree. Note that this would add some complexity to BFI/BPI

If we are pursuing more context-sensitive profile, that would be a very
different design. In this design, we will need to read profile multiple
times, and have the profile tightly coupled with the compiler IR/pass
manager. That is doable, but I don't think that probably better suits
instrumentation based PGO's domain. Comments?

If we are pursuing more context-sensitive profile, that would be a very
different design in which we need to read in profiles multiple times, and
have the profile tightly coupled with the compiler IR/pass manager. That is
doable, but that probably better suits instrumentation based PGO's domain.
Comments?

I don't understand why you say we'd need to read the profile multiple
times. Can you please explain? I also don't think it needs to be that
tightly coupled; we just need each pass that generates multiple copies of
things to have some way to generate a unique id for what it's doing. It's
all per source location, so I don't even think we need any kind of
complicated hashing scheme.

As explained above, we only have branch probability. Yes, we can record

more instruction/basicblock count context info in additional metadata. But
when you use it in optimizer, I suppose you will need to convert it to
branch probability too?

Let me take an example to demonstrate my understanding of your use of
context-sensitive profile.

original code:
for ()
  stmt1;

optimized code:
if (cond1)
  for()
    stmt1; //100 samples
else if (cond2)
  for()
    stmt1; // 0 samples
else
  for()
    stmt1; // 50 samples

The loop was multi-versioned by cond1 and cond2. With my proposal, the
profile for stmt1 will be combined as 150 samples. If we use this profile
to optimize the orginal source, after annotation, the code becomes:

for()
  stmt1 // 150 samples

Later when it comes to the mulit-version optimization, it will transform
the code to:

if (cond1)
  for()
    stmt1; //50 samples
else if (cond2)
  for()
    stmt1; // 50 samples
else
  for()
    stmt1; // 50 samples

The 150 samples are evenly distributed to 3 clones as we do not have
context info.

With your proposal, after annotation, the original code becomes:

for()
  stmt1 // 150 samples (M1:100, M2:0, M3:50)

Then during mutli-version optimization, the optimizer reads the metadata
and find M2 is never taken, so it will transform the code to:

if (cond1)
  for()
    stmt1; // 100 samples
else
  for()
    stmt1; // 50 samples

There are two major parts that can benefit from the extra metadata
recorded for each copy.

1. It can demonstrate that cond2 is always false and there is no need to
create that version.
2. It can correctly attribute samples to cloned copy (i.e. 100, 50
respectively in this case)

I think the most important benefit is #1. For #2, my guess is scaled count
is already good enough.

By what are you scaling the count?

It's random. Or most likely, each branch will be 50% taken.

Here comes the problem: cond1 and cond2 are both generated by compiler.
Then how do we map the metadata back to them? Using uid or special encoding
seems fragile because when upstream optimization changes, compiler may
choose to put cond2 in front of cond1 so the uid will change accordingly.
How would we handle cases like this?

Exactly for this reason, I don't think we can rely on the ordering to
generate the identifiers; they need to have semantic meaning. I agree that
this is tricky in general. However, for the cases I have in mind
(unrolling, vectorization, etc.) these passes don't generate arbitrary
conditionals, but rather, specific conditionals related to runtime overlap
checks, trip-count thresholds, and I think that it should be easy to encode
a discriminator value which can be unambiguously inverted.

I'm still not quite clear how to do this. Needs to think more thoroughly.
But one thing I want to note is that, we want to encode the context info
only when it will be useful for the optimization. For vectorization/unroll,
I think accurate trip count is already good enough. So do we want to record
its context?

Thanks,
Dehao

From: "Dehao Chen" <dehao@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Xinliang David Li"
<davidxl@google.com>
Sent: Tuesday, November 1, 2016 8:24:30 PM
Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

> > From: "Dehao Chen" < dehao@google.com >
>

> > To: "Hal Finkel" < hfinkel@anl.gov >
>

> > Cc: "llvm-dev" < llvm-dev@lists.llvm.org >, "Xinliang David Li" <
> > davidxl@google.com >
>

> > Sent: Tuesday, November 1, 2016 6:41:29 PM
>

> > Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> > discriminator
>

>

> > > > From: "Hal Finkel via llvm-dev" < llvm-dev@lists.llvm.org >
> > >
> >
>

> > > > To: "Dehao Chen" < dehao@google.com >
> > >
> >
>

> > > > Cc: "llvm-dev" < llvm-dev@lists.llvm.org >, "Xinliang David
> > > > Li"
> > > > <
> > > > davidxl@google.com >
> > >
> >
>

> > > > Sent: Tuesday, November 1, 2016 4:26:17 PM
> > >
> >
>

> > > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication
> > > > factor
> > > > in
> > > > discriminator
> > >
> >
>

> > > > > From: "Dehao Chen" < dehao@google.com >
> > > >
> > >
> >
>

> > > > > To: "Hal Finkel" < hfinkel@anl.gov >
> > > >
> > >
> >
>

> > > > > Cc: "Paul Robinson" < paul.robinson@sony.com >, "Xinliang
> > > > > David
> > > > > Li"
> > > > > <
> > > > > davidxl@google.com >, "llvm-dev" < llvm-dev@lists.llvm.org
> > > > > >
> > > >
> > >
> >
>

> > > > > Sent: Tuesday, November 1, 2016 4:14:43 PM
> > > >
> > >
> >
>

> > > > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication
> > > > > factor
> > > > > in
> > > > > discriminator
> > > >
> > >
> >
>

> > > > > damn... my english is not readable at all when I try to
> > > > > write
> > > > > fast...
> > > > > trying to make some clarification below, hopefully can make
> > > > > it
> > > > > more
> > > > > readable...
> > > >
> > >
> >
>

> > > >
> > >
> >
>

> > > > > > Oops... pressed the wrong button and sent out early...
> > > > >
> > > >
> > >
> >
>

> > > > >
> > > >
> > >
> >
>

> > > > > > > If Hal's proposal is for SamplePGO purpose, let me
> > > > > > > clarify
> > > > > > > some
> > > > > > > design principles of SamplePGO.
> > > > > >
> > > > >
> > > >
> > >
> >
>

> > > > > > > The profile for sample pgo uses source location as the
> > > > > > > key
> > > > > > > to
> > > > > > > map
> > > > > > > the
> > > > > > > execution count back to IR. This design is based on the
> > > > > > > principle
> > > > > > > that we do not want the profile to be tightly couple
> > > > > > > with
> > > > > > > compiler
> > > > > > > IR. Instead, profile is simple an attribute of the
> > > > > > > source
> > > > > > > code.
> > > > > > > We
> > > > > > > have been benefited a lot from this design that the
> > > > > > > profile
> > > > > > > can
> > > > > > > easily be reused across different source versions and
> > > > > > > compiler
> > > > > > > versions, or even compilers.
> > > > > >
> > > > >
> > > >
> > >
> >
>

> > > > > > > That being said, the design to encode more info into
> > > > > > > discriminator
> > > > > > > does not mean that we will change the profile. The
> > > > > > > encoded
> > > > > > > info
> > > > > > > in
> > > > > > > discriminator will be handled by the create_llvm_prof
> > > > > > > tool,
> > > > > > > which
> > > > > > > combines counts from different clones of the same
> > > > > > > source
> > > > > > > code
> > > > > > > and
> > > > > > > generate the combined profile data. The output profile
> > > > > > > will
> > > > > > > not
> > > > > > > have
> > > > > > > any cloning/dupliaction bits at all. So for the initial
> > > > > > > example
> > > > > > > profile I provided, the output profile will be:
> > > > > >
> > > > >
> > > >
> > >
> >
>

> > > > > > #1: 10
> > > > >
> > > >
> > >
> >
>

> > > > > > #3: 80
> > > > >
> > > >
> > >
> >
>

> > > > > > Not:
> > > > >
> > > >
> > >
> >
>

> > > > > > #1: 10
> > > > >
> > > >
> > >
> >
>

> > > > > > #3.0x400: 70
> > > > >
> > > >
> > >
> >
>

> > > > > > #3.0x10400: 5
> > > > >
> > > >
> > >
> >
>

> > > > > > #3.0x20400: 3
> > > > >
> > > >
> > >
> >
>

> > > > > > #3.0x30400: 2
> > > > >
> > > >
> > >
> >
>

> > > Also, how does this work for vectorization? For vectorization,
> > > you're
> > > going to multiply by the duplication factor first? The issue
> > > obviously is that each vector instruction counts for VF times
> > > as
> > > many scalar instructions, and so to get equivalent scalar
> > > counts,
> > > you need to multiply by VF. I had assumed this is what you were
> > > saying when I read the initial e-mail, but if you're also using
> > > the
> > > same duplication scheme for unrolling, then I think we need
> > > some
> > > way
> > > to differentiate.
> >
>

> > In my original proposal, I only record VF*UF if the clone is both
> > unrolled and vectorized. I did not distinguish between unroll and
> > vectorization because I the unroll/vectorize decision only
> > depends
> > on the trip count, which can be derived from the duplication
> > factor.
> > If more profile info can be used for better unroll/vectorize
> > decision making, we can definitely add it.
>

> I'm still missing something here. In your proposed system, does the
> tool collecting the profiling information *interpret* the
> duplication factor encoded in the descriminators at all?

Yes it does.

Okay. I'm still missing something. Can you please write out an example using your notation below for a vectorized loop?

> If it does, it seems like this should be specific to vectorization.
> For unrolling, you still have the same number of instructions, and
> so you just need to make the different instructions from the
> different loop iterations carry different discriminator values (so
> they they'll sum together, instead of using the maximum, as you
> explained in your original proposal).

Let's take a look at the original loop unroll example. My proposed
discriminator assignment is:

for (i = 0; i < N & 3; i+= 4) {
foo(); // discriminator: 0x400
foo(); // discriminator: 0x400
foo(); // discriminator: 0x400
foo(); // discriminator: 0x400
}
if (i++ < N) {
foo(); // discriminator: 0x10000
if (i++ < N) {
foo(); // discriminator: 0x20000
if (i++ < N) {
foo(); // discriminator: 0x30000
}
}
}

IIRC, your proposed discriminator assignment is:

To be clear, I don't think that I've proposed any assignment scheme at all. Regardless, I was asking this question purely in the context of your proposal.

Thanks again,
Hal

------------------------------

*From: *"Dehao Chen" <dehao@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Xinliang David Li" <
davidxl@google.com>
*Sent: *Tuesday, November 1, 2016 8:24:30 PM

*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

------------------------------

*From: *"Dehao Chen" <dehao@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Xinliang David Li" <
davidxl@google.com>
*Sent: *Tuesday, November 1, 2016 6:41:29 PM

*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

------------------------------

*From: *"Hal Finkel via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"Dehao Chen" <dehao@google.com>
*Cc: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Xinliang David Li" <
davidxl@google.com>
*Sent: *Tuesday, November 1, 2016 4:26:17 PM
*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

------------------------------

*From: *"Dehao Chen" <dehao@google.com>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Paul Robinson" <paul.robinson@sony.com>, "Xinliang David Li" <
davidxl@google.com>, "llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Tuesday, November 1, 2016 4:14:43 PM
*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

damn... my english is not readable at all when I try to write fast...
trying to make some clarification below, hopefully can make it more
readable...

Oops... pressed the wrong button and sent out early...

If Hal's proposal is for SamplePGO purpose, let me clarify some design
principles of SamplePGO.

The profile for sample pgo uses source location as the key to map the
execution count back to IR. This design is based on the principle that we
do not want the profile to be tightly couple with compiler IR. Instead,
profile is simple an attribute of the source code. We have been benefited
a lot from this design that the profile can easily be reused across
different source versions and compiler versions, or even compilers.

That being said, the design to encode more info into discriminator
does not mean that we will change the profile. The encoded info in
discriminator will be handled by the create_llvm_prof tool, which combines
counts from different clones of the same source code and generate the
combined profile data. The output profile will not have any
cloning/dupliaction bits at all. So for the initial example profile I
provided, the output profile will be:

#1: 10
#3: 80

Not:

#1: 10
#3.0x400: 70
#3.0x10400: 5
#3.0x20400: 3
#3.0x30400: 2

Also, how does this work for vectorization? For vectorization, you're
going to multiply by the duplication factor first? The issue obviously is
that each vector instruction counts for VF times as many scalar
instructions, and so to get equivalent scalar counts, you need to multiply
by VF. I had assumed this is what you were saying when I read the initial
e-mail, but if you're also using the same duplication scheme for unrolling,
then I think we need some way to differentiate.

In my original proposal, I only record VF*UF if the clone is both
unrolled and vectorized. I did not distinguish between unroll and
vectorization because I the unroll/vectorize decision only depends on the
trip count, which can be derived from the duplication factor. If more
profile info can be used for better unroll/vectorize decision making, we
can definitely add it.

I'm still missing something here. In your proposed system, does the tool
collecting the profiling information *interpret* the duplication factor
encoded in the descriminators at all?

Yes it does.

Okay. I'm still missing something. Can you please write out an example
using your notation below for a vectorized loop?

Sure. Original code:

for (int i = 0; i < N; i++)
  a[i] = b[i];

Transformed code:

for (int i = 0; i < N / 4; i++)
  vectorized_assign_4(&a[i * 4], &b[i * 4]); // discriminator 0x400, sample
count 20
for (int i = N & ~3; i < N; i++)
  a[i] = b[i]; // discriminator 0x10000, sample count 10

The derived sample count for original a[i] = b[i] is: 20 * 4 + 10 = 90

If it does, it seems like this should be specific to vectorization. For
unrolling, you still have the same number of instructions, and so you just
need to make the different instructions from the different loop iterations
carry different discriminator values (so they they'll sum together, instead
of using the maximum, as you explained in your original proposal).

Let's take a look at the original loop unroll example. My proposed
discriminator assignment is:
for (i = 0; i < N & 3; i+= 4) {
  foo(); // discriminator: 0x400
  foo(); // discriminator: 0x400
  foo(); // discriminator: 0x400
  foo(); // discriminator: 0x400
}
if (i++ < N) {
  foo(); // discriminator: 0x10000
  if (i++ < N) {
    foo(); // discriminator: 0x20000
    if (i++ < N) {
      foo(); // discriminator: 0x30000
    }
  }
}

IIRC, your proposed discriminator assignment is:

To be clear, I don't think that I've proposed any assignment scheme at
all. Regardless, I was asking this question purely in the context of your
proposal.

Oh I used this example to show my understanding of your reply "For
unrolling, you still have the same number of instructions, and so you just
need to make the different instructions from the different loop iterations
carry different discriminator values (so they they'll sum together, instead
of using the maximum, as you explained in your original proposal)." Please
let me know if my understanding is incorrect.

Thanks,
Dehao

Discussed with Hal, Adrain and Paul offline at the llvm dev meeting today.

  • trip count is not enough for vectorization, there is runtime check that might go false, which can be reflected in profile that we may want to preserve.
  • simply recording these context-profile may cause problems to iterative-sample-pgo. i.e. when you find a loop’s vectorized version no executed (due to runtime check), you will choose not to vectorize (which is optimal). But when you collect profile from this binary (optimized with sample-pgo, not vectorize the loop), as the loop is not vectorized, we do not have the context to demonstrate “the loop should not vectorize” any more. So it will end up being vectorized again, introducing perf instability.

To summarize, more context info may improve performance for one iteration, but the perf improvement may not be stable across iterations. If we aim at performance stability (which is one of the major goals of this RFC), profile should only reflect the attribute of source, not compiler transformations.

But there are sample pgo users who does not care about iterative sample pgo performance, for them, as Hal suggested, we should invent a more extensible way to preserve profile context. Apparently discriminator is not an extensible choice. So how about we just use discriminator to store the attribute of the source (i.e trip count), and later design new extensible ways in dwarf to represent more context info?

Adrian also suggested that we may need to consider have a flag or a separate debugging mode to indicate if we want to emit discriminator to prevent debug_line size increase.

Dehao

(+Adrian & Paul since they’re mentioned here, +Diego since he might have opinions as he implemented the discriminator stuff originally)

my 2c (that was mentioned here & that I idly discussed with Diego a few weeks ago in no great detail) is also that maybe a separate mode for PGO might be what we want/need, because it’s not what sanitizers or simple backtrace tools need & we’re slowly growing the size of debug info for profiling needs that we don’t need elsewhere. (might be worth doing a full measurement here as we add new features - what’s the growth of all hte profile related features (from (including) discriminators up) and see if it’s still a reasonable thing to keep with the rest of the debug info use cases or needs a separate flag)

But I’m not too fussed/don’t feel terribly strongly, really.

In many cases, the line-table fussing to improve autoFDO/sample-PGO would also likely help the debugging experience for optimized code, certainly in cases where line attribution is inherently ambiguous. In those cases, I have no problem with Just Doing It.

Something likely to pad the line table to benefit profiling without similarly benefiting debugging… that’s probably worth inventing a –gprofile or some such. I suspect Dehao’s more elaborate ideas about discriminators fall into that category (although IIRC discriminators were invented mainly to benefit profiling in the first place, so maybe having discriminators at all would be something to put under a –gprofile kind of option).

–paulr

In many cases, the line-table fussing to improve autoFDO/sample-PGO would also likely help the debugging experience for optimized code, certainly in cases where line attribution is inherently ambiguous. In those cases, I have no problem with Just Doing It.

Agreed, certainly.

that’s probably worth inventing a –gprofile or some such. I suspect Dehao’s more elaborate ideas about discriminators fall into that category (although IIRC discriminators were invented mainly to benefit profiling in the first place, so maybe having discriminators at all would be something to put under a –gprofile kind of option).

Yep

  • Dave

-gprofile should have:

  1. emit linkage name in all subprograms
  2. emit discriminator with dup-factor copy-factor encoded
  3. use-unknown-locations=true

Here comes the problem: is it an intermediate -g mode between -g1 and -g2? If yes, this means that -g2 will also need to include all the 3 items above. If not, then it should not be called -gprofile, but rather a flag like -femit-debug-for-profiling", so that we can use it with “-g2 -femit-debug-for-profiling”, or “-gmlt -femit-debug-for-profiling”.

Dehao

(resending with history snipped to prevent needing list moderation to post due to message exceeding size limit)

For our users we fully expect their profile/debug builds to be the same thing (that is, almost all debugging is on fully optimized code, where fully optimized also means part of an iterative profiling workflow). Having extra debug information that only benefits the profiling part seems reasonable. What would be unacceptable is if we ever had to ask our users to make a trade off between debug info that was best for profiling at the expense of debugging quality or vice versa. The “-femit-debug-for-profiling” flag seems to fit this model of it being more a strict superset rather than a tuning option more clearly.

-Greg

yep, +1, seems like a -f flag to me. (you’d want to be able to combine it with any other level of debugging - no need to have to choose between profiling or debugging)

(Agreed for the record).

-eric

Thanks all for the comments.

I’ve moved forward with “adding a new -f flag” solution in https://reviews.llvm.org/D25435

Please comment on that patch about flag naming etc.

Thanks,
Dehao