Contribute a new precise pointer analysis to LLVM

Hi All,

This is Lian Li from Oracle Labs in Brisbane Australia.

We have developed a precise and highly efficient pointer analysis
framework on top of LLVM, The approach is flow, context, and field
sensitive, details are described in the two papers below:

"Boosting the performance of flow-sensitive points-to analysis using
value flow" (in ESEC-FSE 2011), and
"Precise and scalable context-sensitive pointer analysis via value
flow graph" (in ISMM 2013).

The analysis was initially developed for the purpose of bug checking,
and is now extended to support compiler optimizations. We have tested
it with existing compiler optimizations in LLVM, and have seen
promising results.

We are now considering to make this analysis available to the LLVM
community, and contribute resources for future maintenance needs,
provided that there is enough interest. We think that a new precise
pointer analysis in LLVM can enable more new optimization and analysis
techniques to be developed in LLVM.

Any people interested in seeing a new precise pointer analysis in LLVM?

Regards,
Lian Li

Hi Lian,

I am certainly interested in seeing this; do you have performance numbers (compile time)? Also, can you share more information about the promising optimization results you mentioned?

Thanks,
Hal

Hi Hal,

Thanks for your interest.

We tested with the following existing compiler optimizations in LLVM
with SPECINT2006 benchmarks:
-dse (dead store elimination),
-gvn (global value numbering),
-licm (loop invariant code motion),
-bb-vectorize (basic block vectorization),
-memcpyopt (memcpy optimization),
-sink (code sinking),
-loop-idom (recognize loop idioms),
-argpromotion (argument promotion).

As far as I know, this list includes all optmizations in LLVM that
requires alias information. With our pointer analysis, for the three
optimizations "-dse, -gvn, -licm", we have observed significant
improvements in terms of the number of optimizations being conducted.
Take licm as an example, with our analysis, the number of instructions
being hoisted out of loops more than doubled for some benchmarks. We
have not seen big changes in terms of runtime performance though.

We measure compile time by first linking all modules of the same
benchmark together, then running the analysis against the linked
module. It finishes in less than 1 minute over applications such as
httpd and sendmail, both with more than 100,000 LOC. For some
benchmarks, it is time-consuming: the analysis time for GCC is close
to 45 minutes. But if you run the analysis over individual files (as
in your normal compilation process), there is no observable overhead
in terms of compile time.

Regards,
Lian

This sounds very interesting. Even if it isn’t fast enough to be used with (for example) clang -O2 or -O3, having such a precise analysis would be a great baseline for doing other alias analysis work.

Are you interested in contributing this to LLVM itself, or just asking if people would be interested to see the code in some other form?

-Chris

Thanks, Chris.

We are interested in contributing it to LLVM itself. Our manager
agrees to commit resources for maintenance needs if it is accepted by
the community.

Regards,
Lian

I notice you guys formulate your CFL reachability problem as a
balanced parentheses problem.
What algorithm do you use to solve it?

Are you aware of recent work that comes up with linear time and n log
n time algorithms to solve this class of problems:
http://www.cse.cuhk.edu.hk/lyu/_media/paper/pldi2013.pdf

In particular, the time bound from the paper:

"However, if we need the precise pointer information for all
variables, CFL-based
pointer analysis is generally more expensive as it has O(L^3 N^3)
complexity [29], where L is the size of the grammar and N is the
size of the graph."

should now been reduced to O(n + m log m) time, assuming you are
using a standard balanced parentheses formulation.

Hi Daniel,

I want to clarify that our analysis is not based on CFL-reachability.
We apply CFL-reachability to matching context information where the
exist from a function to a call-site must match
the entry from the corresponding call-site. The problem is a simple
balanced parentheses problem in CFL-reachability, and it can be
computed
efficiently.

The paper you mentioned is a very nice paper that proposed an
efficient algorithm to solve Dyck-CFL-reachability problem in
bidirectional graphs. However, precise pointer analysis cannot be
formulated as a Dyck-CFL-reachability problem. The alias analysis
presented in that paper is an over approximation. Hence the time bound
from our paper still applies.

Regards,
Lian

Hi Daniel,

I want to clarify that our analysis is not based on CFL-reachability.
We apply CFL-reachability to matching context information where the
exist from a function to a call-site must match
the entry from the corresponding call-site.

Yes, sorry, I pulled the wrong quote, it was late.

The problem is a simple
balanced parentheses problem in CFL-reachability, and it can be
computed
efficiently.

Yes, it can, however, it was not clear from the paper that you were :slight_smile:
You cite a fairly old paper next to that that essentially gives a >O(1)
query time bound for this, but your
problem exactly fits into the bidirectional graph problem, unless i'm
missing something.

The paper you mentioned is a very nice paper that proposed an
efficient algorithm to solve Dyck-CFL-reachability problem in
bidirectional graphs.

Right, and your matching call sites can be formulated as one.

However, precise pointer analysis cannot be

formulated as a Dyck-CFL-reachability problem.

FWIW: This remains to be seen, and certainly also depends on your
definitions of "precise".
I agree you cannot currently formulate precise andersen style field
sensitive pointer analysis as this.
As for whether it is impossible to get within a very very small percentage,
...

The alias analysis

Thanks, Chris.

We are interested in contributing it to LLVM itself. Our manager
agrees to commit resources for maintenance needs if it is accepted by
the community.

This is great. Please make sure Oracle legal sign off on explicitly granting LLVM the use of the patents associated with the work.

On the compile time front, can you do a comparison with the LLVM default and your new AA? You can just build the LLVM test suite.

Thanks,

Evan

Hi Daniel,

I want to clarify that our analysis is not based on CFL-reachability.
We apply CFL-reachability to matching context information where the
exist from a function to a call-site must match
the entry from the corresponding call-site.

Yes, sorry, I pulled the wrong quote, it was late.

The problem is a simple
balanced parentheses problem in CFL-reachability, and it can be
computed
efficiently.

Yes, it can, however, it was not clear from the paper that you were :slight_smile:
You cite a fairly old paper next to that that essentially gives a >O(1)
query time bound for this, but your
problem exactly fits into the bidirectional graph problem, unless i'm
missing something.

We use the traditional dynamic programming approach to computing value
flows with matching call sites, similar to the implementation of an
IFDS framwork.
The worst case complexity sounds high, but it is not expensive in
practice with effective memoisation techniques.
Actually in our approach, context matching is not the most expensive part.
The new CFL-reachability algorithm may be able to further improve
performance, but we haven't tried that yet.

The paper you mentioned is a very nice paper that proposed an
efficient algorithm to solve Dyck-CFL-reachability problem in
bidirectional graphs.

Right, and your matching call sites can be formulated as one.

However, precise pointer analysis cannot be
formulated as a Dyck-CFL-reachability problem.

FWIW: This remains to be seen, and certainly also depends on your
definitions of "precise".
I agree you cannot currently formulate precise andersen style field
sensitive pointer analysis as this.
As for whether it is impossible to get within a very very small percentage,
...

The analysis presented in the paper approximates on pointer
dereference, which is the tricky and complex part in pointer analysis.
It reminds me of the one-level approach proposed by Manuvir Das in
2000 ("Unification-based Pointer Analysis with Directional
Assignments"), which applies similar approximation to pointer
dereferences.
How to formulate Andersen Style analysis as Dyck-CFL-reachability
remains open, although you can argue that it is not important for
pointer analysis to solve pointer dereferences precisely.

Regards,
Lian

Thanks Evan, we are working with Oracle legal now for IP and license
related issues.
We will run the experiment with the LLVM test suite, and back to you
with the numbers.

Regards,
Lian

Hi Evan,

We did an experiment using the LLVM test suite: we compare the
overhead of using our analysis to the LLVM default, both with -O2
option.

The overall overhead of compiling the whole test suite using our
analysis is 36.5%.
The biggest overhead is observed in
"SingleSource/Benchmarks/Misc/flops-5", where we are 5 times slower:
0.07s (with our analysis) compared to 0.0133s(default).
This may not be accurate as the compilation time in both cases are quite small.

Remind that the compilation time is reported by running our analysis
over individual files, the compilation time overhead will be larger if
we link all individual compilation units together and run the
analysis over the whole application.

Regards,
Lian

Hi Evan,

We did an experiment using the LLVM test suite: we compare the
overhead of using our analysis to the LLVM default, both with -O2
option.

It might also be interesting to try with -O3; I don't know if we have any significant vectorizable loops in the test suite with a large number of arrays, but if we do, this kind of analysis should help.

The overall overhead of compiling the whole test suite using our
analysis is 36.5%.

Maybe a good candidate for -fexpensive-optimizations? (or -O4?)

The biggest overhead is observed in
"SingleSource/Benchmarks/Misc/flops-5", where we are 5 times slower:
0.07s (with our analysis) compared to 0.0133s(default).
This may not be accurate as the compilation time in both cases are
quite small.

In my experience, compile-time measurements from a normal test-suite configuration less than 0.1s are too noisy to use.

Remind that the compilation time is reported by running our analysis
over individual files, the compilation time overhead will be larger
if
we link all individual compilation units together and run the
analysis over the whole application.

Yes, LTO tests would be interesting too.

Thanks again,
Hal

> Hi Evan,
>
> We did an experiment using the LLVM test suite: we compare the
> overhead of using our analysis to the LLVM default, both with -O2
> option.

It might also be interesting to try with -O3; I don't know if we have any
significant vectorizable loops in the test suite with a large number of
arrays, but if we do, this kind of analysis should help.

We perform loop vectorization at -O2 now?

>
> The overall overhead of compiling the whole test suite using our
> analysis is 36.5%.

Maybe a good candidate for -fexpensive-optimizations? (or -O4?)

It would also be helpful to know what the execution performance impact is
to provide context.

> Hi Evan,
>
> We did an experiment using the LLVM test suite: we compare the
> overhead of using our analysis to the LLVM default, both with -O2
> option.

It might also be interesting to try with -O3; I don't know if we have
any significant vectorizable loops in the test suite with a large
number of arrays, but if we do, this kind of analysis should help.

We perform loop vectorization at -O2 now?

You're right, we do. I had forgotten about that. The code in Driver/Tools.cpp now reads:

/// \brief Vectorize at all optimization levels greater than 1 except for -Oz.
static bool shouldEnableVectorizerAtOLevel(const ArgList &Args) {

(and we SLP vectorize at -O1 too)

-Hal

It might also be interesting to try with -O3; I don't know if we have any
significant vectorizable loops in the test suite with a large number of
arrays, but if we do, this kind of analysis should help.

We perform loop vectorization at -O2 now?

We did an experimental with -O3, and the compilation overhead is about 32%.

The compilation time is reported by running the analysis once over the
linked bc file.
It turns out that for the LLVM test suite, running the analysis over
the linked bc file doesn't introduce much extra overhead.

Note that we run the analysis only once in the experiment. The
analysis is not rerun after it is invalidated by an optimization pass.

>
> The overall overhead of compiling the whole test suite using our
> analysis is 36.5%.

Maybe a good candidate for -fexpensive-optimizations? (or -O4?)

It would also be helpful to know what the execution performance impact is to
provide context.

According to our experiments with SPEC, we haven't observed
significant impact in terms of execution performance.
In our experiments, we run a list of 8 optimizations that require
alias analysis (including -licm, -gvn, -bb-vectorize,...), and for
each optimization we rerun our analysis so that
its result can be used by the optimization pass. The biggest
performance improvement we observed is close to 3%, but for some
benchmarks we also have a slow down of 2%.
The overall performance improvement is less than 0.5%.

Regards,
Lian

Would it be possible to measure performance improvement for a large application or two that compiles with -fno-strict-aliasing?

I would be very interested in a maintained precise pointer analysis
with decent scalability!

My research group routinely does various kinds of analysis that would directly
benefit from a reliable pointer analysis. While scalability is indeed
important,
correctness (especially on edge cases that occur in 'real' code) is the
biggest failure point of the precise analyses we've used (and built) to date.

On that, what sort of work has been done to ensure the results of your
analysis are correct?

LLVM's optimizations are a good litmus test for whether the analysis gets
basic things right, but last time I checked they only query for simple
relations. This means they don't tend to be good checks for ensuring
correctness, something that's critical but also difficult to check
for something as complicated as a precise pointer analysis on
something as non-trivial as LLVM IR.

As an aside, this *might* mean LLVM is not tailored to maximally benefit
from a more precise analysis: it would be a waste of engineering
effort and compile time to have optimization passes that make queries
about properties our current analyses will never be able to prove.
Just a thought.

Regarding correctnes, our group has had great success teasing out
analysis bugs with instrumentation passes that add code to dynamically
assert the properties claimed by the analysis or by clients
of the analysis. Callgraph checking is a great example
that can involve some rather tricky situations while
being cheap to check dynamically. More powerful checks
are also useful, but have a higher runtime cost.

My co-worker Joshua (on this thread) implemented these as
part of one of our projects, and we've previously discussed
spending some time to make it available to the community.
Please let either of us know if this sounds useful :).

~Will