Alias Analysis Semantics

Hello all,

I’ve read the documentation on alias analysis, and I think I understand it literally, but I just want to be sure, because it seems a bit strange.

As it says on this web page,

The MayAlias response is used whenever the two pointers might refer to the same object.

The PartialAlias response is used when the two memory objects are known to be overlapping in some way, but do not start at the same address.

The MustAlias response may only be returned if the two memory objects are guaranteed to always start at exactly the same location. A MustAlias response implies that the pointers compare equal.

Reading this, it seems the “MustAlias” result is exceptionally strong.

Consider this loop

std::vector A(100);

int* x,y;

for(int i=0; i<100; i++) {

x=&A[i];

y=&A[i];

*y=*x;
}

Here it seems obvious that the load and store on the last line must “alias” in some sense, but the load and store in fact have different values between different iterations of the loop, so if we interpret “always start at exactly the same location” literally, that would mean “mustalias” is false

Is MustAlias considering only the most recent execution? Or is it only considering within the same iteration of a loop? Does MustAlias use the same aliasing model as the other results?

Thank you for clearing this up for me.

Jeremy

Hello all,

I've read the documentation on alias analysis, and I think I understand it
literally, but I just want to be sure, because it seems a bit strange.

As it says on this web page,

The MayAlias response is used whenever the two pointers might refer to the
same object.

The PartialAlias response is used when the two memory objects are known to
be overlapping in some way, but do not start at the same address.

The MustAlias response may only be returned if the two memory objects are
guaranteed to always start at exactly the same location. A MustAlias
response implies that the pointers compare equal.

Reading this, it seems the "MustAlias" result is exceptionally strong.

Yes.

Consider this loop

std::vector<int> A(100);
int* x,y;

for(int i=0; i<100; i++) {
   x=&A[i];
   y=&A[i];
   *y=*x;
}

Here it seems obvious that the load and store on the last line must "alias"
in some sense, but the load and store in fact have different values between
different iterations of the loop, so if we interpret "always start at
exactly the same location" literally, that would mean "mustalias" is false

I think one of us reads that sentence "wrong" :slight_smile:
The two memory objects here are "x" and "y". They are guaranteed to
always start at the same location, and are pointer equal.
Thus, they are MustAlias.
I don't think "guaranteed to always start at the same location" meant
"each individual pointer is guaranteed to always have the same value
over the lifetime of the program". If so, must-alias would be a
proxy for loop-invariant + a bunch of other stuff, and you'd be able
to simply do replacement of values if they mustalias. But you can't.

In LLVM, a MustAlias b does not imply that the values of a and b
individually do not change, but that "a and b" are pointer equal
always.

What I describe is how, AFAICT, it is implemented in all of the alias passes.
For example, SCEVAA, which actually computes the recurrences for these
loop expressions, has this:
  // This is ScalarEvolutionAliasAnalysis. Get the SCEVs!
   const SCEV *AS = SE->getSCEV(const_cast<Value *>(LocA.Ptr));
   const SCEV *BS = SE->getSCEV(const_cast<Value *>(LocB.Ptr));

   // If they evaluate to the same expression, it's a MustAlias.
   if (AS == BS) return MustAlias;

IE if the two recurrences for the pointers are equal, it's a mustalias.

This should return mustalias for x and y above.

Is MustAlias considering only the most recent execution?

At least as defined, MustAlias should be an invariant that holds over
the lifetime of the program.
This is not a flow-sensitive or context-sensitive result.

Or is it only
considering within the same iteration of a loop? Does MustAlias use the
same aliasing model as the other results?

What do you mean by aliasing model?

Thanks Daniel!

I think you’ve cleared up some of my misconceptions, but I still am a bit confused about some of the corner cases.

Suppose we had something like this

std::vector A(100);

int* x,y;

x=&A[0];

for(int i=0; i<100; i++) {
y=&A[i];
*y=*x;
x=&A[i+1];
}

Would the load and store instructions be MustAlias? I think what it boils down to is the distinction between static instructions and dynamic instructions and how they are handled by alias analysis.

I originally interpreted MustAlias as saying /all/ dynamic instructions A must alias with /all/ dynamic instructions B. However, you’re saying that this is not the case. So it seems to me that instead we are only checking if certain dynamic instructions alias. It’s not clear to me how the dynamic instructions are matched up. For example, in the above code, you could match up the instructions by execution count, and find that they never alias. Or you could match up each dynamic load with the most recent dynamic store, in which case they May/Must alias (depending on how you look at it). This is what I mean by aliasing model.

I’m also a bit unclear how this case would be handled:

std::vector A(100);

int* x,y;

x=&A[0];

for(int i=0; i<100; i++) {
A[i]=5;
}

int z=A[2];

MayAlias? MustAlias? NeverAlias? It depends on which dynamic instructions the semantics refer to.

Thank you,
Jeremy

Thanks Daniel!

I think you've cleared up some of my misconceptions, but I still am a bit
confused about some of the corner cases.

Suppose we had something like this

std::vector<int> A(100);
int* x,y;
x=&A[0];

for(int i=0; i<100; i++) {
   y=&A[i];
   *y=*x;
   x=&A[i+1];
}

Would the load and store instructions be MustAlias?

Yes. They always have the same pointer value at the point of *y = *x.
Though note, in SSA form, you would likely end up with two or three
x's (x_1 = &A[0], x_2 = phi(x_1, x_3), x_3 = &A[i+1]), and only
mustalias(y, x_2) would be true.

I think what it boils
down to is the distinction between static instructions and dynamic
instructions and how they are handled by alias analysis.

I originally interpreted MustAlias as saying /all/ dynamic instructions A
must alias with /all/ dynamic instructions B. However, you're saying that
this is not the case.

Right. It's a query about specific loads and stores.

So it seems to me that instead we are only checking if
certain dynamic instructions alias.

It's not clear to me how the dynamic
instructions are matched up.

The question is what happens at the point of the load/store. The
queries you perform in LLVM are about a specific load and a specific
store.

At the point of *y = *x, y MUSTALIAS x.
There are other points this is not true, but that's not relevant to
*this store* and *this load*.

There is no generic query mechanism to ask "does this relationship
hold over all points of the program" (and if there was, it would
return false :P)

I'm also a bit unclear how this case would be handled:

std::vector<int> A(100);
int* x,y;
x=&A[0];

for(int i=0; i<100; i++) {
   A[i]=5;
}
int z=A[2];

MayAlias? MustAlias? NeverAlias? It depends on which dynamic instructions
the semantics refer to.

These are all MayAlias.

Hey Daniel,

Thanks again for the help. I’m still a bit confused about the interface to the alias analysis. It seems like we are talking about different interfaces. Has it changed from what the documentation says? As far as I can tell, the documentation takes a specific Value*, and no information about which dynamic execution it is talking about.

When you say “Right. It’s a query about specific loads and stores.” it sounds like you are saying that it does consider dynamic executions, and that you can make queries about specific dynamic executions. I can’t seem to find this option anywhere in the API…

Perhaps someone else could clear the confusion, because I feel like there is a gap of communication.

For example, in this code what would the result be for the first store and the first load?

#include
#include
int main(int c, char** argv) {
int* A=(int*)malloc(100*sizeof(int));

if(c>1) {
for(int i=0; i<5; i++) {
A[i]=5;
}
} else {
A[1]=5;
}
printf("%d\n",A[4]);
free(A);
}

MustAlias? Because whenever they are executed, they do alias. Or MayAlias?

Jeremy

Hey Daniel,

Thanks again for the help. I'm still a bit confused about the interface to
the alias analysis. It seems like we are talking about different
interfaces.

Sorry, yes, i was thinking of a different interface.

Has it changed from what the documentation says? As far as I
can tell, the documentation takes a specific Value*, and no information
about which dynamic execution it is talking about.

The actual standard low level interface is:

virtual AliasResult alias(const Location &LocA, const Location &LocB);
(There is a value wrapper for this)

I think part of the confusion here is that you keep redefining
pointers and expecting the results of queries to change.

Outside of globals, LLVM is in SSA form. Thus, these are queries will
end up being about specific pointers that will only be defined once.
In all of your examples, you have asked about local variables. These
local variables will be put in SSA. Every time you redefine the
pointer value, it will generate a new definition that is separate from
the old ones.
If you transform your examples above into globals, you may get
different answers.

When you say "Right. It's a query about specific loads and stores." it
sounds like you are saying that it does consider dynamic executions, and
that you can make queries about specific dynamic executions. I can't seem
to find this option anywhere in the API..

Yes, sorry, I was thinking of the MemoryDependence interface, which
does give you that information.
AliasAnalysis does not.

Perhaps someone else could clear the confusion, because I feel like there is
a gap of communication.

For example, in this code what would the result be for the first store and
the first load?

#include <cstdio>
#include <cstdlib>
int main(int c, char** argv) {
int* A=(int*)malloc(100*sizeof(int));

if(c>1) {
  for(int i=0; i<5; i++) {
    A[i]=5;
  }
} else {
  A[1]=5;
}
printf("%d\n",A[4]);
free(A);
}

MustAlias? Because whenever they are executed, they do alias. Or MayAlias?

Since you seem to be having trouble going all the way down the rabbit
hole here, let me take a different approach.

In this example, A[i] is not an LLVM location or value *.
You will never see A[i] in the LLVM IR to query about.
You will see two pointers as this:
   %i.0 = phi i32 [ 0, %4 ], [ %11, %10 ]
   %8 = sext i32 %i.0 to i64
   %9 = getelementptr inbounds i32* %2, i64 %8
   store i32 5, i32* %9, align 4

and
%16 = getelementptr inbounds i32* %2, i64 4
   %17 = load i32* %16, align 4

The query will be alias(%9, %16)

In this case, it cannot prove it is MustAlias, but it is actually MustAlias.
It will, however, return MayAlias because it is not powerful enough to prove it.

If you want to know what llvm *does* in any situation, i would use:
opt -scev-aa -tbaa -basicaa -aa-eval -print-no-aliases
-print-may-aliases -print-must-aliases

However, note that this will not give you "what could it answer" only
"what does it answer", which is very different.

Sigh, too early in the morning, pasted the right output, wrote the
answer while looking at the wrong pass output.
These two pointers will always be MayAlias.
LLVM later actually deletes the loop, and transforms the above into a
memset + a load from memset, and ends up with two mustalias pointers
in that case.

Let me take your previous example in an attempt to hopefully explain why “dynamic instructions” do not matter to the answer:
You asked about:

std::vector A(100);

int* x,y;
x=&A[0];

for(int i=0; i<100; i++) {
y=&A[i];
*y=*x;
x=&A[i+1];
}

This is not what it looks like in LLVM.

In LLVM, it looks like this:

std::vector A(100);

int* x,y;

x_1=GEP A, 0, 0

for(int i=0; i<100; i++) {

i_2 = phi (0, i_3)
x_2 = phi(x_1, x_3)
y_1 = GEP A, 0, i_2
temp = load x_2
store y_1, temp
temp2 = add i_2, 1
x_3 = GEP A, 0, temp2
i_3 = add i_2, 1
}

As you can see, every time you redefine the value of the pointer x to a new value, it gets a new name.

alias(x_1, y_1) = mayalias
alias(x_2, y_1) = mustalias
alias(x_3, y_1) = mayalias

The only interesting question, IMHO, is whether alias(x_3,y_1) is MayAlias or NoAlias.
x_2 clearly has the same value as y_1 at all times throughout the program.

Hi Daniel,

Sorry for taking so long to respond. I spoke with a colleague more familiar with llvm who thought he could clear up my confusion, but we both came out of the conversation confused. I will try my best to explain the ambiguity.

In an DAG, alias queries would be completely unambiguous. Every instruction would only be executed once, and every SSA value really would have a single static assignment. Any instruction would have a single value per execution, and when you ask alias(x,y) it would be a question of whether those single memory addresses overlapped.

However, once backedges are introduced, and instruction can be executed multiple times. Every time the execution passes over a backedge, any instructions that are executed again, may have completely different values. Now, when you ask alias(x,y), it’s ambiguous which pairs of addresses that query considers.

In some cases, it seems clear. For example, in the loops we have been talking about, it seems like you are saying that it only compares the values in the same iteration, but does not say anything about aliasing between iterations. These are perfectly reasonable semantics for loops, but it does not define what the alias queries mean in general. For example, consider the following CFG:

​If we ask a query alias(x,y) about instructions in basic blocks B and C respectively, what will those answers mean? There is always a backedge between their executions so are they considered to be different iterations and therefore they never alias?

I get that there may be some way of understanding the alias queries in which “dynamic instructions” do not matter to the answer, but whatever that way is, it must also be possible to use that way to explain things in terms of the “dynamic instructions” that the CPU ends up executing. And at least for me that would resolve any confusion I have about the AliasAnalysis.

Thank you again for your time,

Jeremy

From: "Jeremy Salwen" <jeremysalwen@gmail.com>
To: "Daniel Berlin" <dberlin@dberlin.org>
Cc: "llvmdev" <llvmdev@cs.uiuc.edu>
Sent: Wednesday, August 20, 2014 7:58:46 PM
Subject: Re: [LLVMdev] Alias Analysis Semantics

Hi Daniel,

Sorry for taking so long to respond. I spoke with a colleague more
familiar with llvm who thought he could clear up my confusion, but
we both came out of the conversation confused. I will try my best to
explain the ambiguity.

In an DAG, alias queries would be completely unambiguous. Every
instruction would only be executed once, and every SSA value really
would have a single static assignment. Any instruction would have a
single value per execution, and when you ask alias(x,y) it would be
a question of whether those single memory addresses overlapped.

However, once backedges are introduced, and instruction can be
executed multiple times. Every time the execution passes over a
backedge, any instructions that are executed again, may have
completely different values. Now, when you ask alias(x,y), it's
ambiguous which pairs of addresses that query considers.

In some cases, it seems clear. For example, in the loops we have been
talking about, it seems like you are saying that it only compares
the values in the same iteration, but does not say anything about
aliasing between iterations. These are perfectly reasonable
semantics for loops, but it does not define what the alias queries
mean in general. For example, consider the following CFG:

If we ask a query alias(x,y) about instructions in basic blocks B and
C respectively, what will those answers mean? There is always a
backedge between their executions so are they considered to be
different iterations and therefore they never alias?

Perhaps Danny will have a better answer, but I think about it this way: AliasAnalysis only answers queries about variable relationships at points along the dynamic execution path of the program. For a query to be well formed, there much be some point (some instruction in the LLVM IR) at which both values being queried are simultaneously live. In your example, there is no block dominated by both B and C, and so no point in the program where it would be legal to refer to both x and y. As a result, the query itself is meaningless. Do you have some transformation or analysis in mind that would issue such a query?

-Hal

Hi Hal,

Thank you for your email, that makes a lot of sense to me. I am working on some tools to use memory profiling to speculatively replace memory loads and stores with value forwarding in hardware implementations. I’d like to compare the profiled data to static alias analysis, so it would be super useful if there was a way to answer the questions about aliasing across backedges that AliasAnalysis considers meaningless. Perhaps the MemoryDependence would allow this?

Thank you,

Jeremy

From: "Jeremy Salwen" <jeremysalwen@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvmdev" <llvmdev@cs.uiuc.edu>, "Daniel Berlin" <dberlin@dberlin.org>
Sent: Thursday, August 21, 2014 12:07:03 AM
Subject: Re: [LLVMdev] Alias Analysis Semantics

Hi Hal,

Thank you for your email, that makes a lot of sense to me. I am
working on some tools to use memory profiling to speculatively
replace memory loads and stores with value forwarding in hardware
implementations. I'd like to compare the profiled data to static
alias analysis, so it would be super useful if there was a way to
answer the questions about aliasing across backedges that
AliasAnalysis considers meaningless. Perhaps the MemoryDependence
would allow this?

MemoryDependenceAnalysis might do what you want (in the sense that it will do PHI translation and, IIRC, could tell you that along some execution path some store to x in B might be read by some load of y in C, for example). There is also a DependenceAnalysis analysis for loops that will give you inter-iteration dependence information. It might be easier to help you if you could provide some small example of the transformation you'd like to perform.

-Hal

I think I see the fundamental issue you are looking at. It is mentioned implicitly in the discussions, but not called out. In your CFG example there is no backedge (head always dominates tail), only retreat edges. So your graph is irreducible. For such graphs “simultaneous live” means that there be can be two dynamic execution paths where the variables (memory locations, objects etc) are simultaneously live, but they may not be live at the same time on all execution paths. Since LLVM uses SSA, all variables (memory locations, objects, … ) are strictly defined, so there cannot be irreducible dependence graphs. I believe this assumption is built into the alias queries. So to catch cases like in your scenario I think you need to base your queries on classical dataflow analysis.

-Gerolf