Google Summer of Code Idea

This email is written on the premise that LLVM will be involved in
GSOC again this year. I noted that the wiki's open projects page [0]
has several possible projects, that seem suitable for a summer of code
project. I am writing this email to this list with the hope of
getting some feedback on specifics for a GSOC application, and also
wondering if any potential mentors are interested in these ideas.

The Idea:

I would be looking to implement a context-sensitive alias analysis.
Whilst I accept that this isn't the most efficient category of alias
analysis, I think the implementation of such an analysis, that trades
off computational efficiency for a more precise computation of
aliasing would be a beneficial addition to the LLVM project. In terms
of which algorithm to actually implement, I am considering
implementing a BDD based algorithm, possible candidate include:

Whaley + Lam [2] which simply clones points on the call-graph to
generate different contexts and then performs a context insensitive
analysis. This uses BDDs to reduce memory consumption.
Zhu [3] Which utilises a BDD base approach for a custom context
sensitive, flow insenstitive algorithm,

Were I to use a BDD based approach, I would undoubtedly rely on an
existing implementation, for example buddy[4], rather than
implementing my own. The opinions of any experienced LLVM hackers
would be helpful on algorithm choice. I am reasonably flexible about
choosing a different . At the moment I am personally undecided. I am
interested to hear any feedback on these ideas, specifically, but in
no particular order, with regards the following points:

1. Is this too ambitious for a google summer of code project?
2. Would implementing any of these algorithms be tricky due to some
peculiarity of LLVM?
3. Do existing optimisations and DFAs that depend on aliasing
information interface with the alias analysis in a context sensitive
manner, or do they require rewriting?
4. Is anyone willing to mentor this project?

Me:

I am PhD student at the University of Warwick, England, where my
current research is focused on specifying compiler optimisations in a
domain specification programming language and formally verifying the
soundness of those optimisations. As part of this project I am in the
process of implementing a tool for optimising Java Byte code by
compiling optimising phases from these specifications. During a final
year project I also co-wrote a compiler that worked on parallelising
Java source code. I am also well versed in traditional compiler
design literature, including lattice based data flow analysis.

Any feedback on the aforementioned ideas would be welcome. Regards,

  Richard Warburton

[0] http://llvm.org/OpenProjects.html
[1] http://llvm.org/docs/AliasAnalysis.html
[2] Whaley & Lam, Cloning based Context-Sensitive Pointer Alias
Analysis Using Binary Decision Diagrams, Programming Language Design &
Implementation 2004
[3] Jianwen Zhu, Symbolic Pointer Analysis, International Conference
on Computer Aided Design, 2002
[4] buddy download | SourceForge.net

This email is written on the premise that LLVM will be involved in
GSOC again this year.

I hope so too!

I would be looking to implement a context-sensitive alias analysis.
Whilst I accept that this isn't the most efficient category of alias
analysis, I think the implementation of such an analysis, that trades
off computational efficiency for a more precise computation of
aliasing would be a beneficial addition to the LLVM project. In terms
of which algorithm to actually implement, I am considering
implementing a BDD based algorithm, possible candidate include:

Ok. I think the most important thing to keep in mind, if you want this to be useful for LLVM, is for it to be sound in the presence of incomplete programs. I think it would be very interesting to have a BDD based analysis in LLVM, it would be useful for performance comparisons and many other things, even if it isn't turned on by default. However, it must be sound.

Also, LLVM benefits quite a bit from mod/ref info for function. I don't know if you've thought about it at all, but it is an important problem. If you're interested, my thesis describes these issues in detail.

1. Is this too ambitious for a google summer of code project?

It depends on your familiarity with the domain. If you haven't worked in the area of alias analysis (and applications) it probably is. There are lot of smaller subprojects that would be useful for llvm though.

2. Would implementing any of these algorithms be tricky due to some
peculiarity of LLVM?

No, I don't think so. LLVM has hosted a lot of AA work already.

3. Do existing optimisations and DFAs that depend on aliasing
information interface with the alias analysis in a context sensitive
manner, or do they require rewriting?

They can benefit from context sensitive mod/ref info.

4. Is anyone willing to mentor this project?

I don't know :slight_smile: Maybe someone from Vikram's research group would be willing to mentor you. If you are interested in pursuing this, I'd suggest making a proposal and we can figure out mentorship when the time is closer,

-Chris

Ok. I think the most important thing to keep in mind, if you want
this to be useful for LLVM, is for it to be sound in the presence of
incomplete programs. I think it would be very interesting to have a
BDD based analysis in LLVM, it would be useful for performance
comparisons and many other things, even if it isn't turned on by
default. However, it must be sound.

Both of my suggested algorithms have been implemented in Java - which,
by virtue of reflection, means one cannot statically determine which
class certain object are referring to. In this instance I believe
they simply assume the most conservative case (ie mayuse/maydef at
that point could be anything). I suspect that this approach could be
equally applied to an unknown library.

Also, LLVM benefits quite a bit from mod/ref info for function. I
don't know if you've thought about it at all, but it is an important
problem. If you're interested, my thesis describes these issues in
detail.

I'll peruse this, are there any other relevant, LLVM specific texts
that are appropriate for this, and not linked from the documentation
page[0] ?

> 1. Is this too ambitious for a google summer of code project?

It depends on your familiarity with the domain. If you haven't worked
in the area of alias analysis (and applications) it probably is.
There are lot of smaller subprojects that would be useful for llvm
though.

In order that I may be to gauge what options there are, can you
suggest some examples of these subprojects.

regards,

  Richard Warburton

[0] http://llvm.cs.uiuc.edu/docs/

Also, LLVM benefits quite a bit from mod/ref info for function. I
don't know if you've thought about it at all, but it is an important
problem. If you're interested, my thesis describes these issues in
detail.

I'll peruse this, are there any other relevant, LLVM specific texts
that are appropriate for this, and not linked from the documentation
page[0] ?

Not that I know of.

1. Is this too ambitious for a google summer of code project?

It depends on your familiarity with the domain. If you haven't worked
in the area of alias analysis (and applications) it probably is.
There are lot of smaller subprojects that would be useful for llvm
though.

In order that I may be to gauge what options there are, can you
suggest some examples of these subprojects.

There are lots of mini projects, revolving around use of better alias analysis:

1. The alias analysis API supports the getModRefBehavior method, which allows the implementation to give details analysis of the functions. For example, we could implement full knowledge of printf/scanf side effects, which would be useful (PR1604).

2. We need some way to reason about errno. Consider a loop like this:

     for ()
       x += sqrt(loopinvariant);

    We'd like to transform this into:

     t = sqrt(loopinvariant);
     for ()
       x += t;

    This transformation is safe, because the value of errno isn't otherwise changed in the loop and the exit value of errno from the loop is the same. We currently can't do this, because sqrt clobbers errno, so it isn't "readonly" or "readnone" and we don't have a good way to model this.

   The hard part of this project is figuring out how to describe errno in the optimizer: each libc #defines errno to something different it seems. Maybe the solution is to have a __builtin_errno_addr() or something and change sys headers to use it.

3. An easy project is to add the 'nocapture' attribute to the LLVM IR (PR2055) and have passes infer and propagate it around. Its presence can significantly improve local alias analysis at very low cost.

4. The globals mod/ref pass basically does really simple and cheap bottom-up context sensitive alias analysis. It being simple and cheap are really important, but there are simple things that we could do to better capture the effects of functions that access pointer arguments. This can be really important for C++ methods, which spend lots of time accessing pointers off 'this'.

5. There are lots of ways to optimize out and improve handling of memcpy/memset (PR452)

6. It would be excellent to replace loops with scalar stores in them into memset/memcpy calls. This dramatically speeds up programs like 'viterbi' in the testsuite.

7. There may still be some ideas in PR1373 left.

-Chris

I just updated the Open Projects page with several new things, including those in the email I just sent out.

-Chris

This email is written on the premise that LLVM will be involved in
GSOC again this year. I noted that the wiki's open projects page [0]
has several possible projects, that seem suitable for a summer of code
project. I am writing this email to this list with the hope of
getting some feedback on specifics for a GSOC application, and also
wondering if any potential mentors are interested in these ideas.

The Idea:

Were I to use a BDD based approach, I would undoubtedly rely on an
existing implementation, for example buddy[4], rather than
implementing my own. The opinions of any experienced LLVM hackers
would be helpful on algorithm choice. I am reasonably flexible about
choosing a different . At the moment I am personally undecided. I am
interested to hear any feedback on these ideas, specifically, but in
no particular order, with regards the following points:

1. Is this too ambitious for a google summer of code project?
2. Would implementing any of these algorithms be tricky due to some
peculiarity of LLVM?

No
I've got a BDD based version of andersen's around already (uncommitted).
It's slower in time (by a factor of 8 or so), but very slightly (1-2
megabytes of memory) more memory efficient on some cases. Nowadays,
the bitmap version is actually more memory efficient than the BDD
version, even if you turn off BDD caching, etc.

The thing you have to understand is that these context sensitive BDD
algorithms work okay on java, but fall down pretty badly on C/C++,
because it is a much bigger problem. Whaley's is just brute force,
and Zhu's is pretty darn complex.

Both are amenable to the same style of offline variable optimizations
that Hardekopf, et al (Ben and a friend have been helping out with
implementation of andersen's in LLVM), and that may actually make them
usable for real world programs.

3. Do existing optimisations and DFAs that depend on aliasing
information interface with the alias analysis in a context sensitive
manner, or do they require rewriting?

It should work okay.

4. Is anyone willing to mentor this project?

I would.

I think you will be surprised how badly these algorithms work on even
moderate sized C++ programs :frowning:

No
I've got a BDD based version of andersen's around already (uncommitted).
It's slower in time (by a factor of 8 or so), but very slightly (1-2
megabytes of memory) more memory efficient on some cases. Nowadays,
the bitmap version is actually more memory efficient than the BDD
version, even if you turn off BDD caching, etc.

The thing you have to understand is that these context sensitive BDD
algorithms work okay on java, but fall down pretty badly on C/C++,
because it is a much bigger problem. Whaley's is just brute force,
and Zhu's is pretty darn complex.

I was interested to see whether Whaley's algorithm really works
particularly, given its simplicity and the claims made in the papers.

Both are amenable to the same style of offline variable optimizations
that Hardekopf, et al (Ben and a friend have been helping out with
implementation of andersen's in LLVM), and that may actually make them
usable for real world programs.

Are you referring to both the PLDI and the SAS papers?

> 4. Is anyone willing to mentor this project?
I would.

Thanks :wink:

I think you will be surprised how badly these algorithms work on even
moderate sized C++ programs :frowning:

I'm slightly confused here as to whether you mean in terms of
precision of analysis, or compute time taken by running the algorithm.

  Richard

> No
> I've got a BDD based version of andersen's around already (uncommitted).
> It's slower in time (by a factor of 8 or so), but very slightly (1-2
> megabytes of memory) more memory efficient on some cases. Nowadays,
> the bitmap version is actually more memory efficient than the BDD
> version, even if you turn off BDD caching, etc.
>
> The thing you have to understand is that these context sensitive BDD
> algorithms work okay on java, but fall down pretty badly on C/C++,
> because it is a much bigger problem. Whaley's is just brute force,
> and Zhu's is pretty darn complex.

I was interested to see whether Whaley's algorithm really works
particularly, given its simplicity and the claims made in the papers.

Certainly, it works, and is not hard to implement.

All you need to do is to add a context number to each constraint, and
then you can clone them at every call and increment context number
fairly easily.
Storing constraint graphs in BDD's is also trivial (ben's source has a
few solvers that do this, IIRC).

> Both are amenable to the same style of offline variable optimizations
> that Hardekopf, et al (Ben and a friend have been helping out with
> implementation of andersen's in LLVM), and that may actually make them
> usable for real world programs.

Are you referring to both the PLDI and the SAS papers?

Yes, as well as some other optimizations that have not been presented.

> > 4. Is anyone willing to mentor this project?
> I would.

Thanks :wink:

> I think you will be surprised how badly these algorithms work on even
> moderate sized C++ programs :frowning:

I'm slightly confused here as to whether you mean in terms of
precision of analysis, or compute time taken by running the algorithm.

Compute time.
None of these algorithms are useful at all in the real world if it
takes 3 years to compute the answer.

Precision of them, is of course, great, assuming you use a sane heap model.

--Dan

I have implemented a tool based on the Phoenix compiler framework at
MSR, which can dump relations as BDD from C/C++ code and feed them
into John Whaley's bddbddb for further analysis. As Dan said,
currently the performance is not good enough for C/C++ code (due to
more complex relations, BDD variable orders, etc.). I am still trying
several optimizations. John also told that it took quite a lot of
effort to get the Java version to scale.

I am interested to see an implementation based on LLVM.

Xi

My guess is that if you try doing variable substitution style
optimizations, you will get very very serious speedups.

Usually, out of 80-100k constraint variables, you can unify 60k of them.
In my testing, it reduces the number of constraint variables by 80+%
in non-context sensitive analysis.
bddbddb is already going to be rather good at solving the query
relations, so to get faster, you have to start reducing the size of
the problem itself.

FWIW, Ben Hardekopf asked me to post this for him on the subject

Regarding Whaley and Lam's BDD-based approach to context-sensitive
analysis, there is a paper by the same research group (Avots et al in
ICSE'05) that adapts their technique for C. The results are not very
encouraging -- their system doesn't scale nearly as well for C as it did
for Java. The primary difference is that unlike Java, in C top-level
variables can have their address taken and be passed between functions,
which requires the analysis to maintain 2 contexts for each points-to
relation: one for the pointer and one for the pointee.

The other problem with their method is that only the top-level variables
are treated context-sensitively; everything in the heap (e.g. for Java,
all object fields) is context-insensitive. I spoke with John Whaley about
this and he said they've tried to use a context-sensitive heap model, but
they couldn't get it to scale at all.

My personal feeling (backed up by results from some other research groups,
e.g. Chen et al in ISPAN'02 and Nystrom et al in PASTE'04) is that a
context-sensitive heap model is critical for getting the maximum benefit
from a context-sensitive pointer analysis, and I don't know that BDDs, at
least as currently used, are the right approach for this.

I was impressed by a completely different approach, described by Nystrom
et al in SAS'04 and Nystrom's PhD thesis, for doing context-sensitive
pointer analysis for C with a context-sensitive heap model. Like Whaley
and Lam, and unlike Lattner and Adve's Data Structure Analysis, their
technique is inclusion-based (i.e. a context-sensitive version of
Andersen's), but unlike Whaley and Lam it also uses a context-sensitive
heap model, and it's targeted at C. They managed to scale the analysis to
a couple hundred thousand lines of code. I actually have several ideas on
how to improve their analysis and make it more scalable, but I haven't had
time to implement them -- mainly because I haven't wanted to re-implement
their whole analysis in LLVM. I would definitely be interested in any
effort to do this, and I think it would have a lot of benefit. In fact, I
supervised another student who implemented a simplified version of this
analysis for a class project; while the code itself probably isn't
re-usable, the student made a number of good observations on the analysis
that I would be happy to share with anyone who is interested.

Thanks,

Ben

Regarding Whaley and Lam's BDD-based approach to context-sensitive
analysis ...

<snip>

Thanks for posting this. In which case its probably a good idea, if I
don't go down this route to context sensitivity.

I was impressed by a completely different approach, described by Nystrom
et al in SAS'04 and Nystrom's PhD thesis, for doing context-sensitive
pointer analysis for C with a context-sensitive heap model. Like Whaley
and Lam, and unlike Lattner and Adve's Data Structure Analysis, their
technique is inclusion-based (i.e. a context-sensitive version of
Andersen's), but unlike Whaley and Lam it also uses a context-sensitive
heap model, and it's targeted at C. They managed to scale the analysis to
a couple hundred thousand lines of code. I actually have several ideas on
how to improve their analysis and make it more scalable, but I haven't had
time to implement them -- mainly because I haven't wanted to re-implement
their whole analysis in LLVM. I would definitely be interested in any
effort to do this, and I think it would have a lot of benefit. In fact, I
supervised another student who implemented a simplified version of this
analysis for a class project; while the code itself probably isn't
re-usable, the student made a number of good observations on the analysis
that I would be happy to share with anyone who is interested.

I'll have a read through Nystrom's approach. Are there publically
availible implementations of this algorithm anywhere else? If he is
already in this region in terms of scalability, then improvements
might make a context sensitive analysis actually practical. Either
way if the claims of this paper hold true then it seems like a viable
option for implementation. Would it be possible to get Ben
Hardekopf's suggested improvements and his student's ideas posted to
the list? Regards,

  Richard