[GSoC 2016] Better Alias Analysis By Default - Mid Term Summary

Dear LLVM Community,

This is a brief summary of what I've done so far for CFL-AA, and what I plan to do next.

tl;dr: CFL-AA is getting saner. Low-hanging fruits on its improvement have almost been picked up. I can either make CFL-AA more precise (with certain performance cost), or teach other passes to capitalize on CFL-AA better as the next step. Comments and suggestions are more than welcomed.

Before my project starts, I asked on the mailing list for testing CFL-AA. According to Geoff Berry (http://lists.llvm.org/pipermail/llvm-dev/2016-May/099900.html),
- Enabling CFL-AA in the complilation pipeline did not cause any correctness issues.
- Enabling CFL-AA in the complilation pipeline did not affect performance of the compiled program in any significant way.

In other words, CFL-AA was very conservative at that moment, to the point that it did not offer much beyond what BasicAA can do. Fortunately, over the last month the situation has been improved. Here is a list of precision enhancement I made for CFL-AA:

- [r271322] Remove aliasing relations between GEP pointers and GEP indices. Before this patch, CFL-AA will claim that a and b may-alias each other in the following codes:

int a[10], b[10];
a[N] = ...;
b[N] = ...;

This seemingly insane behavior was actually there to safeguard certain cases where people do crazy stuffs with pointer arithmetics. Later we figured out that those crazy behaviors were in fact UB and therefore we could drop support for it.

- [r271421] Teach CFL-AA to understand function attributes. CFL-AA used to treat opaque function calls in a very conservative manner. However, we know that library calls such as malloc(), free() and strcmp() are marked with special attributes we can use to help resolve aliasing queries. With this patch, we can safely rely on CFL-AA to say "noalias" for a and b in the following code snippet:

char* a = (char*) malloc(len);
char* b = (char*) malloc(len);
... // popluate string a and b here
if (strcmp(a, b) == 0)

- [r272040] Improve precision for inttoptr/ptrtoint. If CFL-AA found this snippet:

int a, b, *p = &a, *q = &b;
int c= (int)p + (int)q;

It used to report may-alias for p and q, just because both of them are casted to integers. This was later relaxed in a way that pointers are only treated conservatively when you cast it back from integers.

- [r272335, r272690, r273219, r273229] Improvement on inter-procedural analysis. This is the one single feature of CFL-AA that enables it to really offer something that BasicAA doesn't. The work is still ongoing, but the central idea here is that we can make CFL-AA look past function calls and yield almost the same result as if the function call was inlined. So if we have

int *p = ..., *q = /* something other than *p */;
foo(&p, &q);
... // use p and q here

As long as CFL-AA can see the body of foo(), and as long as foo() doesn't do anything that makes p alias q, CFL-AA won't count p and q as may-alias pairs when analyzing the codes after the call to foo(). I don't think this is currently possible with BasicAA.

Looking forward, there are certainly many other ways to further improve CFL-AA's precision (e.g. adding field sensitivity, and making the analysis inclusion-based rather than equality-based). However, I feel like I've almost pushed the analysis to a point where in exchange for more precision we may inevitably start to pay extra cost and observe some noticeable performance drops. If the performance problem turns out to be not that bad, I could just keep going where I'm headed.

But if it becomes a problem, there is also a plan B: Notice that what I've said is all about alias queries. In LLVM, clients of alias analysis also care about other things like mod-ref info, in which area CFL-AA hasn't improved that much. Also, code snippets I listed before to justify my patches look pathetically simple and very unrealistic. What I really want to look into is some real-world client passes running on real-world benchmarks, and see if any precision gets lost during the interaction between the client pass and the alias analysis pass.

If you have any comments or suggestions, please let me know. Also, if you have any concrete examples in mind that existing AAs (including CFL-AA) can't do but you them to be handled properly, please tell me! I am more than happy to make everyone's code a little bit faster :slight_smile:

What exactly is the semantic justification here? I'm asking because
there are a number of crucial use cases for aliasing global arrayish
variables behind the compiler like linker sets. We had quite some fun in
NetBSD with newer GCC introducing breakage by exploiting UB in fun ways.
It would be nice to avoid breaking valid applications in system/embedded



The use-case we dropped support for, specifically, was:

void foo() {
int a;
int *ap = &a;
int *ap2 = (int*)((char*)NULL + (uintptr_t)&a);
// now ap and ap2 may alias

Our justification is that LLVM (specifically, BasicAA) already explicitly doesn’t support this pattern, so we shouldn’t have to support it in CFLAA.

Dear Joerg,

I also want to add that except for the interprocedural analysis part, whatever changes I made to CFL-AA I always tried avoid being semantically more aggressive than BasicAA. So if certain UB gets exploited in CFL-AA now, BasicAA would probably do the same as well.