RFC: PGO Late instrumentation for LLVM

Date: Tue, 1 Sep 2015 14:21:16 -0700
From: Rong Xu via llvm-dev <llvm-dev@lists.llvm.org>
Cc: llvm-dev <llvm-dev@lists.llvm.org>, David Li <davidxl@google.com>

*(2) Performance impact of context sensitivity*
LLVM does not use the profile information fully in the back-end

optimizations, for instance, inlining does not fully use the profile counts

-- it only marks hot/cold function attribute based on function entry

counts. To evaluate the impact of profile context sensitivity, GCC is used

in the experiment. Note that GCC PGO improves clang performance a lot

more

than clang PGO.
First we summarize the methodology used in the experiment:
0) build clang with GCC O2 without early inlining and measure

clang's

performance. GCC early inlining (einline) is similar to pre-inline

used by

late instrumentation.
1) build clang with GCC O2 with early inlining and measure
performance.
The performance difference of 1) and 0) is denoted as E which

measures

the contribution of early inlining.
2) build clang with GCC O2 + PGO without early inlining.
3) build clang with GCC O2 + PGO with early inlining.
The performance difference of 3) and 2) is denoted as EC. It
constitutes
roughly two parts a) early inlining contribution b) context sensitive

profiling enabled with early inlining.

The contribution of context sensitive profiling can be estimated by

EC

-
E above.
-------------------------------------------------------------------------------

Config wall_time_for_use speedup_vs_(0)

speedup_vs_(1)
(0) base w/o einline 84.946 1.000

0.934

(1) base O2 79.310 1.071

1.000

(2) profile-arcs w/o einline 63.518 1.337

1.249

(3) profile-arcs 48.364 1.756

1.640

We see the following:
1) GCC PGO with early inlining improves clang performance by 64.0%

(v.s.

base O2 w/ early inline).
2) GCC PGO w/o early inlining improves clang performance by 33.7%

(v.s.

base O2 w/o early inline).
3) Early inlining performance contribution is about 7.1%.
4) Profile context sensitivity contribution is estimated to be 22.2%

(i.e. 64.0% -33.7% - 7.1%), which is pretty significant.

Rong,
Sorry for the late response. Just wanted to clarify my understanding of
data in (2) Performance impact of context sensitivity.

On clang as an application:
3) Early inlining contribution is about 7.1%,
2) PGO w/o early inlining contribution is about 33.7%,

4) so the additional combined effect of 2 and 3 is about 22.2%, correct?
In other words, just avoiding inlining small/simple callees and updating
their profile counts in the call graph by the main inliner - all through
the use of early inlining - improves clang performance by 22.2%.

Thanks,
Ivan

> Date: Tue, 1 Sep 2015 14:21:16 -0700
> From: Rong Xu via llvm-dev <llvm-dev@lists.llvm.org>
> Cc: llvm-dev <llvm-dev@lists.llvm.org>, David Li <davidxl@google.com>
Subject: Re: [llvm-dev] RFC: PGO Late instrumentation for LLVM

>>>> *(2) Performance impact of context sensitivity*
>>>> LLVM does not use the profile information fully in the back-end
optimizations, for instance, inlining does not fully use the profile counts
>>>> -- it only marks hot/cold function attribute based on function entry
counts. To evaluate the impact of profile context sensitivity, GCC is used
>>>> in the experiment. Note that GCC PGO improves clang performance a lot
more
>>>> than clang PGO.
>>>> First we summarize the methodology used in the experiment:
>>>> 0) build clang with GCC O2 without early inlining and measure
clang's
>>>> performance. GCC early inlining (einline) is similar to pre-inline
used by
>>>> late instrumentation.
>>>> 1) build clang with GCC O2 with early inlining and measure
>>>> performance.
>>>> The performance difference of 1) and 0) is denoted as E which
measures
>>>> the contribution of early inlining.
>>>> 2) build clang with GCC O2 + PGO without early inlining.
>>>> 3) build clang with GCC O2 + PGO with early inlining.
>>>> The performance difference of 3) and 2) is denoted as EC. It
>>>> constitutes
>>>> roughly two parts a) early inlining contribution b) context sensitive
profiling enabled with early inlining.
>>>> The contribution of context sensitive profiling can be estimated by
EC
>>>> -
>>>> E above.
>>>>
-------------------------------------------------------------------------------
Config wall_time_for_use speedup_vs_(0)
>>>> speedup_vs_(1)
>>>> (0) base w/o einline 84.946 1.000
0.934
>>>> (1) base O2 79.310 1.071
1.000
>>>> (2) profile-arcs w/o einline 63.518 1.337
1.249
>>>> (3) profile-arcs 48.364 1.756
1.640
>>>> We see the following:
>>>> 1) GCC PGO with early inlining improves clang performance by 64.0%
(v.s.
>>>> base O2 w/ early inline).
>>>> 2) GCC PGO w/o early inlining improves clang performance by 33.7%
(v.s.
>>>> base O2 w/o early inline).
>>>> 3) Early inlining performance contribution is about 7.1%.
>>>> 4) Profile context sensitivity contribution is estimated to be 22.2%
(i.e. 64.0% -33.7% - 7.1%), which is pretty significant.

Rong,
Sorry for the late response. Just wanted to clarify my understanding of
data in (2) Performance impact of context sensitivity.

On clang as an application:
3) Early inlining contribution is about 7.1%,

This is the effect of pre-inlining without profile guidance.

2) PGO w/o early inlining contribution is about 33.7%,

4) so the additional combined effect of 2 and 3 is about 22.2%, correct?

Not combined effect -- but remaining effect (by excluding 2 and 3)

In other words, just avoiding inlining small/simple callees and updating
their profile counts in the call graph by the main inliner - all through
the use of early inlining - improves clang performance by 22.2%.

Not sure what you mean here. 22% is the estimate of the effect of CS
profile due to clones of profile counters during instrumentation (through
pre-inlining). Profile update with inlining always exist including in 2).

David