RFC: PGO Late instrumentation for LLVM

> Date: Tue, 1 Sep 2015 14:21:16 -0700
> From: Rong Xu via llvm-dev <llvm-dev@lists.llvm.org>
> Cc: llvm-dev <llvm-dev@lists.llvm.org>, David Li <davidxl@google.com>
Subject: Re: [llvm-dev] RFC: PGO Late instrumentation for LLVM
>>>> *(2) Performance impact of context sensitivity*
>>>> LLVM does not use the profile information fully in the back-end
optimizations, for instance, inlining does not fully use the profile

counts

>>>> -- it only marks hot/cold function attribute based on function
entry
counts. To evaluate the impact of profile context sensitivity, GCC is

used

>>>> in the experiment. Note that GCC PGO improves clang performance a
lot
more
>>>> than clang PGO.
>>>> First we summarize the methodology used in the experiment: 0)

build clang with GCC O2 without early inlining and measure

clang's
>>>> performance. GCC early inlining (einline) is similar to pre-inline
used by
>>>> late instrumentation.
>>>> 1) build clang with GCC O2 with early inlining and measure

performance.

>>>> The performance difference of 1) and 0) is denoted as E which
measures
>>>> the contribution of early inlining.
>>>> 2) build clang with GCC O2 + PGO without early inlining.
>>>> 3) build clang with GCC O2 + PGO with early inlining.
>>>> The performance difference of 3) and 2) is denoted as EC. It

constitutes

>>>> roughly two parts a) early inlining contribution b) context
sensitive
profiling enabled with early inlining.
>>>> The contribution of context sensitive profiling can be estimated

by

EC
>>>> -
>>>> E above.
-------------------------------------------------------------------------------

Config wall_time_for_use speedup_vs_(0)

>>>> speedup_vs_(1)
>>>> (0) base w/o einline 84.946 1.000
0.934
>>>> (1) base O2 79.310 1.071
1.000
>>>> (2) profile-arcs w/o einline 63.518 1.337
1.249
>>>> (3) profile-arcs 48.364 1.756
1.640
>>>> We see the following:
>>>> 1) GCC PGO with early inlining improves clang performance by 64.0%
(v.s.
>>>> base O2 w/ early inline).
>>>> 2) GCC PGO w/o early inlining improves clang performance by 33.7%
(v.s.
>>>> base O2 w/o early inline).
>>>> 3) Early inlining performance contribution is about 7.1%.
>>>> 4) Profile context sensitivity contribution is estimated to be
22.2%
(i.e. 64.0% -33.7% - 7.1%), which is pretty significant.
Rong,
Sorry for the late response. Just wanted to clarify my understanding of

data in (2) Performance impact of context sensitivity.

On clang as an application:
3) Early inlining contribution is about 7.1%,

This is the effect of pre-inlining without profile guidance.

2) PGO w/o early inlining contribution is about 33.7%,
4) so the additional combined effect of 2 and 3 is about 22.2%,

correct?

Not combined effect -- but remaining effect (by excluding 2 and 3)

In other words, just avoiding inlining small/simple callees and

updating

their profile counts in the call graph by the main inliner - all

through

the use of early inlining - improves clang performance by 22.2%.

Not sure what you mean here. 22% is the estimate of the effect of CS

profile due to clones of profile counters during instrumentation
(through

pre-inlining). Profile update with inlining always exist including in

2).

If we compare times for:
(2) profile-arcs w/o einline - 63.518 secs, v.s.
(3) profile-arcs - 48.364 secs,
we get about 31.3% improvement due to early inline with PGO.

If we compare times for:
(0) base w/o einline - 84.946, v.s.
(1) base O2 - 79.310.
we get about 7.1% improvement due to early inline without PGO.

What can we attribute the difference of 24.2% (31.3 - 7.1) to?
31.3% is the total contribution of early inline with PGO.
Is 24.2% the context-sensitivity part of it, meaning that the profile
counts in the call graph are more precise duing the inlining process,
inlining decisions are better, etc.?

Ivan

>> > Date: Tue, 1 Sep 2015 14:21:16 -0700
>> > From: Rong Xu via llvm-dev <llvm-dev@lists.llvm.org>
>> > Cc: llvm-dev <llvm-dev@lists.llvm.org>, David Li <davidxl@google.com>
>> Subject: Re: [llvm-dev] RFC: PGO Late instrumentation for LLVM
>> >>>> *(2) Performance impact of context sensitivity*
>> >>>> LLVM does not use the profile information fully in the back-end
>> optimizations, for instance, inlining does not fully use the profile
counts
>> >>>> -- it only marks hot/cold function attribute based on function
>> entry
>> counts. To evaluate the impact of profile context sensitivity, GCC is
used
>> >>>> in the experiment. Note that GCC PGO improves clang performance a
>> lot
>> more
>> >>>> than clang PGO.
>> >>>> First we summarize the methodology used in the experiment: 0)
build clang with GCC O2 without early inlining and measure
>> clang's
>> >>>> performance. GCC early inlining (einline) is similar to pre-inline
>> used by
>> >>>> late instrumentation.
>> >>>> 1) build clang with GCC O2 with early inlining and measure
performance.
>> >>>> The performance difference of 1) and 0) is denoted as E which
>> measures
>> >>>> the contribution of early inlining.
>> >>>> 2) build clang with GCC O2 + PGO without early inlining.
>> >>>> 3) build clang with GCC O2 + PGO with early inlining.
>> >>>> The performance difference of 3) and 2) is denoted as EC. It
constitutes
>> >>>> roughly two parts a) early inlining contribution b) context
>> sensitive
>> profiling enabled with early inlining.
>> >>>> The contribution of context sensitive profiling can be estimated
by
>> EC
>> >>>> -
>> >>>> E above.
>>
-------------------------------------------------------------------------------
Config wall_time_for_use speedup_vs_(0)
>> >>>> speedup_vs_(1)
>> >>>> (0) base w/o einline 84.946 1.000
>> 0.934
>> >>>> (1) base O2 79.310 1.071
>> 1.000
>> >>>> (2) profile-arcs w/o einline 63.518 1.337
>> 1.249
>> >>>> (3) profile-arcs 48.364 1.756
>> 1.640
>> >>>> We see the following:
>> >>>> 1) GCC PGO with early inlining improves clang performance by 64.0%
>> (v.s.
>> >>>> base O2 w/ early inline).
>> >>>> 2) GCC PGO w/o early inlining improves clang performance by 33.7%
>> (v.s.
>> >>>> base O2 w/o early inline).
>> >>>> 3) Early inlining performance contribution is about 7.1%.
>> >>>> 4) Profile context sensitivity contribution is estimated to be
>> 22.2%
>> (i.e. 64.0% -33.7% - 7.1%), which is pretty significant.
>> Rong,
>> Sorry for the late response. Just wanted to clarify my understanding of
data in (2) Performance impact of context sensitivity.
>> On clang as an application:
>> 3) Early inlining contribution is about 7.1%,
> This is the effect of pre-inlining without profile guidance.
>> 2) PGO w/o early inlining contribution is about 33.7%,
>> 4) so the additional combined effect of 2 and 3 is about 22.2%,
correct?
> Not combined effect -- but remaining effect (by excluding 2 and 3)
>> In other words, just avoiding inlining small/simple callees and
updating
>> their profile counts in the call graph by the main inliner - all
through
>> the use of early inlining - improves clang performance by 22.2%.
> Not sure what you mean here. 22% is the estimate of the effect of CS
profile due to clones of profile counters during instrumentation
(through
> pre-inlining). Profile update with inlining always exist including in
2).

If we compare times for:
(2) profile-arcs w/o einline - 63.518 secs, v.s.
(3) profile-arcs - 48.364 secs,
we get about 31.3% improvement due to early inline with PGO.

If we compare times for:
(0) base w/o einline - 84.946, v.s.
(1) base O2 - 79.310.
we get about 7.1% improvement due to early inline without PGO.

What can we attribute the difference of 24.2% (31.3 - 7.1) to?
31.3% is the total contribution of early inline with PGO.
Is 24.2% the context-sensitivity part of it, meaning that the profile
counts in the call graph are more precise duing the inlining process,
inlining decisions are better, etc.?

yes -- that is it.

David