PROPOSAL: struct-access-path aware TBAA

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.

Fundamental rules from C11

What does this mean? "The type of z extends the type of x, and the type of z is a super-type of y"?

-Krzysztof

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.

Fundamental rules from C11
--------------------------
An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
1. a type compatible with the effective type of the object,
2. a qualified version of a type compatible with the effective type of the object,
3. a type that is the signed or unsigned type corresponding to the effective type of the object,
4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
6. a character type.

Example
-------
  struct A {
    int x;
    int y;
  };
  struct B {
    A a;
    int z;
  };
  struct C {
    B b1;
    B b2;
    int *p;
  };

Type DAG:
  int <- A::x <- A
  int <- A::y <- A <- B::a <- B <- C::b1 <- C
  int <----------------- B::z <- B <- C::b2 <- C
  any pointer <--------------------- C::stuck_out_tongue: <- C

The type DAG has two types of TBAA nodes:
1> the existing scalar nodes
2> the struct nodes (this is different from the current tbaa.struct)
   A struct node has a unique name plus a list of pairs (field name, field type).
   For example, struct node for "C" should look like
   !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
   where !3 is the struct node for "B", !2 is pointer type.

Given a field access
  struct B *bp = ...;
  bp->a.x = 5;
we annotate it as B::a.x.

In the case of multiple structures containing substructures, how are
you differentiating?

IE given

struct A {
struct B b;
}
struct C {
struct B b;
}

How do you know the above struct B *bp =...; is B::b from C and not B::b from A?

(I agree you can know in the case of direct aggregates, but I argue
you have no way to know in the case of pointer arguments without
interprocedural analysis)
It gets worse the more levels you have.

ie if you add
struct B {
struct E e;
}

and have struct E *e = ...
how do you know it's the B::e contained in struct C, or the B::e
contained in struct A?

Again, i agree you can do both scalar and direct aggregates, but not
aggregates and scalars through pointers.

Implementing the Hierarchy
--------------------------
We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
Each tag can be a sequence of nodes in the type DAG.
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]

This can get quite deep quite quickly.
Are there actually cases where knowing the outermost aggregate type +
{byte offset, size} of field access does not give you the exact same
disambiguation capability?

I ask because in GCC, we determined the answer was "no", since
pointers require special analysis anyway.

  > x |
         extends
  > z |
      super-type
           > y |

What does this mean? "The type of z extends the type of x, and the type of z is a super-type of y"?

Extends x into a subobject z, and z is a super-type of y.

-Manman

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.

Fundamental rules from C11
--------------------------
An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
1. a type compatible with the effective type of the object,
2. a qualified version of a type compatible with the effective type of the object,
3. a type that is the signed or unsigned type corresponding to the effective type of the object,
4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
6. a character type.

Example
-------
struct A {
   int x;
   int y;
};
struct B {
   A a;
   int z;
};
struct C {
   B b1;
   B b2;
   int *p;
};

Type DAG:
int <- A::x <- A
int <- A::y <- A <- B::a <- B <- C::b1 <- C
int <----------------- B::z <- B <- C::b2 <- C
any pointer <--------------------- C::stuck_out_tongue: <- C

The type DAG has two types of TBAA nodes:
1> the existing scalar nodes
2> the struct nodes (this is different from the current tbaa.struct)
  A struct node has a unique name plus a list of pairs (field name, field type).
  For example, struct node for "C" should look like
  !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
  where !3 is the struct node for "B", !2 is pointer type.

Given a field access
struct B *bp = ...;
bp->a.x = 5;
we annotate it as B::a.x.

In the case of multiple structures containing substructures, how are
you differentiating?

IE given

struct A {
struct B b;
}
struct C {
struct B b;
}

How do you know the above struct B *bp =...; is B::b from C and not B::b from A?

(I agree you can know in the case of direct aggregates, but I argue
you have no way to know in the case of pointer arguments without
interprocedural analysis)
It gets worse the more levels you have.

ie if you add
struct B {
struct E e;
}

and have struct E *e = ...
how do you know it's the B::e contained in struct C, or the B::e
contained in struct A?

Again, i agree you can do both scalar and direct aggregates, but not
aggregates and scalars through pointers.

I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.

Implementing the Hierarchy
--------------------------
We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
Each tag can be a sequence of nodes in the type DAG.
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]

This can get quite deep quite quickly.
Are there actually cases where knowing the outermost aggregate type +
{byte offset, size} of field access does not give you the exact same
disambiguation capability?

The answer is no. We should be able to deduce the access path from {byte offset, size}.
However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.

Thanks,
Manman

Well, that part is easy, it's just overlap, once you know they are
both contained in the same outermost type.

You have to track what the types are in the frontend to generate the
access path anyway, so you can compute the offsets + sizes into those
types at the same time.

Essentially, it it translating your representation into something that
can be intersected in constant time, unless i'm missing something
about how you are annotating the access paths.

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.

Fundamental rules from C11
--------------------------
An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
1. a type compatible with the effective type of the object,
2. a qualified version of a type compatible with the effective type of the object,
3. a type that is the signed or unsigned type corresponding to the effective type of the object,
4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
6. a character type.

Example
-------
struct A {
  int x;
  int y;
};
struct B {
  A a;
  int z;
};
struct C {
  B b1;
  B b2;
  int *p;
};

Type DAG:
int <- A::x <- A
int <- A::y <- A <- B::a <- B <- C::b1 <- C
int <----------------- B::z <- B <- C::b2 <- C
any pointer <--------------------- C::stuck_out_tongue: <- C

The type DAG has two types of TBAA nodes:
1> the existing scalar nodes
2> the struct nodes (this is different from the current tbaa.struct)
A struct node has a unique name plus a list of pairs (field name, field type).
For example, struct node for "C" should look like
!4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
where !3 is the struct node for "B", !2 is pointer type.

Given a field access
struct B *bp = ...;
bp->a.x = 5;
we annotate it as B::a.x.

In the case of multiple structures containing substructures, how are
you differentiating?

IE given

struct A {
struct B b;
}
struct C {
struct B b;
}

How do you know the above struct B *bp =...; is B::b from C and not B::b from A?

(I agree you can know in the case of direct aggregates, but I argue
you have no way to know in the case of pointer arguments without
interprocedural analysis)
It gets worse the more levels you have.

ie if you add
struct B {
struct E e;
}

and have struct E *e = ...
how do you know it's the B::e contained in struct C, or the B::e
contained in struct A?

Again, i agree you can do both scalar and direct aggregates, but not
aggregates and scalars through pointers.

I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.

Implementing the Hierarchy
--------------------------
We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
Each tag can be a sequence of nodes in the type DAG.
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]

This can get quite deep quite quickly.
Are there actually cases where knowing the outermost aggregate type +
{byte offset, size} of field access does not give you the exact same
disambiguation capability?

The answer is no. We should be able to deduce the access path from {byte offset, size}.
However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.

Well, that part is easy, it's just overlap, once you know they are
both contained in the same outermost type.

How do you figure out the outermost common type of two access 'x' and 'y'?
Once that is figured out, it should be easy to check alias('x', 'y').

You have to track what the types are in the frontend to generate the
access path anyway, so you can compute the offsets + sizes into those
types at the same time.

Generating the offsets+sizes are not hard.

Essentially, it it translating your representation into something that
can be intersected in constant time, unless i'm missing something
about how you are annotating the access paths.

If we can query alias('x','y') in constant time, irrelevant to the depth of each access, that will be great.

-Manman

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.

Fundamental rules from C11
--------------------------
An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
1. a type compatible with the effective type of the object,
2. a qualified version of a type compatible with the effective type of the object,
3. a type that is the signed or unsigned type corresponding to the effective type of the object,
4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
6. a character type.

Example
-------
struct A {
  int x;
  int y;
};
struct B {
  A a;
  int z;
};
struct C {
  B b1;
  B b2;
  int *p;
};

Type DAG:
int <- A::x <- A
int <- A::y <- A <- B::a <- B <- C::b1 <- C
int <----------------- B::z <- B <- C::b2 <- C
any pointer <--------------------- C::stuck_out_tongue: <- C

The type DAG has two types of TBAA nodes:
1> the existing scalar nodes
2> the struct nodes (this is different from the current tbaa.struct)
A struct node has a unique name plus a list of pairs (field name, field type).
For example, struct node for "C" should look like
!4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
where !3 is the struct node for "B", !2 is pointer type.

Given a field access
struct B *bp = ...;
bp->a.x = 5;
we annotate it as B::a.x.

In the case of multiple structures containing substructures, how are
you differentiating?

IE given

struct A {
struct B b;
}
struct C {
struct B b;
}

How do you know the above struct B *bp =...; is B::b from C and not B::b from A?

(I agree you can know in the case of direct aggregates, but I argue
you have no way to know in the case of pointer arguments without
interprocedural analysis)
It gets worse the more levels you have.

ie if you add
struct B {
struct E e;
}

and have struct E *e = ...
how do you know it's the B::e contained in struct C, or the B::e
contained in struct A?

Again, i agree you can do both scalar and direct aggregates, but not
aggregates and scalars through pointers.

I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.

Implementing the Hierarchy
--------------------------
We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
Each tag can be a sequence of nodes in the type DAG.
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]

This can get quite deep quite quickly.
Are there actually cases where knowing the outermost aggregate type +
{byte offset, size} of field access does not give you the exact same
disambiguation capability?

The answer is no. We should be able to deduce the access path from {byte offset, size}.
However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.

Well, that part is easy, it's just overlap, once you know they are
both contained in the same outermost type.

How do you figure out the outermost common type of two access 'x' and 'y'?

How are you annotating each access with a path as you go right now?
It should be the same thing to just annotate it then.
Without thinking too hard:
If it's all direct aggregate access in the original source, whatever
you see as the leftmost part should work.

I haven't thought hard about it, but I believe it should wrok.

Note that again, neither works for pointers unless you can guarantee
you have the entire access path, which is hard:

union u {
struct A a;
struct B b;
};

int foo(struct A *a, struct B *b) {
....

}

void bar() {
union u whoops;
foo(&whoops.a, &whoops.b);
}

If you annotated foo with a partial access path, you may or may not
get the right answer.
You would think your type dag solves this, because you'll see they
share a supertype.

But it's actually worse than that:

file: bar.h
union u {
struct A a;
struct B b;
};

int foo(struct A *, struct B*);

file: foo.c

int foo(struct A *a, struct B *b) {
....

}

file: bar.c

#include "bar.h"

void bar() {
union u whoops;
foo(&whoops.a, &whoops.b);
}

You will never see compiling foo.c that tells you both types are in a
union together, or how to annotate these access paths.

Hence the "you need to punt on pointers without more analysis" claim I made :slight_smile:

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.

Fundamental rules from C11
--------------------------
An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
1. a type compatible with the effective type of the object,
2. a qualified version of a type compatible with the effective type of the object,
3. a type that is the signed or unsigned type corresponding to the effective type of the object,
4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
6. a character type.

Example
-------
struct A {
int x;
int y;
};
struct B {
A a;
int z;
};
struct C {
B b1;
B b2;
int *p;
};

Type DAG:
int <- A::x <- A
int <- A::y <- A <- B::a <- B <- C::b1 <- C
int <----------------- B::z <- B <- C::b2 <- C
any pointer <--------------------- C::stuck_out_tongue: <- C

The type DAG has two types of TBAA nodes:
1> the existing scalar nodes
2> the struct nodes (this is different from the current tbaa.struct)
A struct node has a unique name plus a list of pairs (field name, field type).
For example, struct node for "C" should look like
!4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
where !3 is the struct node for "B", !2 is pointer type.

Given a field access
struct B *bp = ...;
bp->a.x = 5;
we annotate it as B::a.x.

In the case of multiple structures containing substructures, how are
you differentiating?

IE given

struct A {
struct B b;
}
struct C {
struct B b;
}

How do you know the above struct B *bp =...; is B::b from C and not B::b from A?

(I agree you can know in the case of direct aggregates, but I argue
you have no way to know in the case of pointer arguments without
interprocedural analysis)
It gets worse the more levels you have.

ie if you add
struct B {
struct E e;
}

and have struct E *e = ...
how do you know it's the B::e contained in struct C, or the B::e
contained in struct A?

Again, i agree you can do both scalar and direct aggregates, but not
aggregates and scalars through pointers.

I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.

Implementing the Hierarchy
--------------------------
We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
Each tag can be a sequence of nodes in the type DAG.
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]

This can get quite deep quite quickly.
Are there actually cases where knowing the outermost aggregate type +
{byte offset, size} of field access does not give you the exact same
disambiguation capability?

The answer is no. We should be able to deduce the access path from {byte offset, size}.
However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.

Well, that part is easy, it's just overlap, once you know they are
both contained in the same outermost type.

How do you figure out the outermost common type of two access 'x' and 'y'?

How are you annotating each access with a path as you go right now?
It should be the same thing to just annotate it then.

I thought your suggestion was to replace
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
with
!C::b2.a.x := [ "tbaa.offset", !C, offset within C, size ]

!B::a := [ "tbaa.path", !B, "B:a", !A ]
with
!B::a := [ "tbaa.offset", !B, offset within B, size]

Then we already lost the access path at IR level.

Without thinking too hard:
If it's all direct aggregate access in the original source, whatever
you see as the leftmost part should work.

I haven't thought hard about it, but I believe it should work.

At llvm IR level, it is not that straight-forward to figure out the outermost common type of C::b2.a.x and B::a with tbaa.offset.
To me, the outermost common type should be "B", then we need to check the relative offset from B for both accesses.

-Manman

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.

Fundamental rules from C11
--------------------------
An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
1. a type compatible with the effective type of the object,
2. a qualified version of a type compatible with the effective type of the object,
3. a type that is the signed or unsigned type corresponding to the effective type of the object,
4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
6. a character type.

Example
-------
struct A {
int x;
int y;
};
struct B {
A a;
int z;
};
struct C {
B b1;
B b2;
int *p;
};

Type DAG:
int <- A::x <- A
int <- A::y <- A <- B::a <- B <- C::b1 <- C
int <----------------- B::z <- B <- C::b2 <- C
any pointer <--------------------- C::stuck_out_tongue: <- C

The type DAG has two types of TBAA nodes:
1> the existing scalar nodes
2> the struct nodes (this is different from the current tbaa.struct)
A struct node has a unique name plus a list of pairs (field name, field type).
For example, struct node for "C" should look like
!4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
where !3 is the struct node for "B", !2 is pointer type.

Given a field access
struct B *bp = ...;
bp->a.x = 5;
we annotate it as B::a.x.

In the case of multiple structures containing substructures, how are
you differentiating?

IE given

struct A {
struct B b;
}
struct C {
struct B b;
}

How do you know the above struct B *bp =...; is B::b from C and not B::b from A?

(I agree you can know in the case of direct aggregates, but I argue
you have no way to know in the case of pointer arguments without
interprocedural analysis)
It gets worse the more levels you have.

ie if you add
struct B {
struct E e;
}

and have struct E *e = ...
how do you know it's the B::e contained in struct C, or the B::e
contained in struct A?

Again, i agree you can do both scalar and direct aggregates, but not
aggregates and scalars through pointers.

I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.

Implementing the Hierarchy
--------------------------
We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
Each tag can be a sequence of nodes in the type DAG.
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]

This can get quite deep quite quickly.
Are there actually cases where knowing the outermost aggregate type +
{byte offset, size} of field access does not give you the exact same
disambiguation capability?

The answer is no. We should be able to deduce the access path from {byte offset, size}.
However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.

Well, that part is easy, it's just overlap, once you know they are
both contained in the same outermost type.

How do you figure out the outermost common type of two access 'x' and 'y'?

How are you annotating each access with a path as you go right now?
It should be the same thing to just annotate it then.

I thought your suggestion was to replace
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
with
!C::b2.a.x := [ "tbaa.offset", !C, offset within C, size ]

!B::a := [ "tbaa.path", !B, "B:a", !A ]
with
!B::a := [ "tbaa.offset", !B, offset within B, size]

Yes, it is.

Then we already lost the access path at IR level.

But you are *generating* the above metadata at the clang level, no?
That is were I meant you should annotate them as you go.

Without thinking too hard:
If it's all direct aggregate access in the original source, whatever
you see as the leftmost part should work.

I haven't thought hard about it, but I believe it should work.

At llvm IR level, it is not that straight-forward to figure out the outermost common type of C::b2.a.x and B::a with tbaa.offset.

The metadata should already have the outermost common type when it's
output, so it should just be there at the LLVM level.

To me, the outermost common type should be "B", then we need to check the relative offset from B for both accesses

Agreed.

What cases does this proposal solve that the current analyses don't? Do you have a motivating example?

-Krzysztof

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.

Fundamental rules from C11
--------------------------
An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
1. a type compatible with the effective type of the object,
2. a qualified version of a type compatible with the effective type of the object,
3. a type that is the signed or unsigned type corresponding to the effective type of the object,
4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
6. a character type.

Example
-------
struct A {
int x;
int y;
};
struct B {
A a;
int z;
};
struct C {
B b1;
B b2;
int *p;
};

Type DAG:
int <- A::x <- A
int <- A::y <- A <- B::a <- B <- C::b1 <- C
int <----------------- B::z <- B <- C::b2 <- C
any pointer <--------------------- C::stuck_out_tongue: <- C

The type DAG has two types of TBAA nodes:
1> the existing scalar nodes
2> the struct nodes (this is different from the current tbaa.struct)
A struct node has a unique name plus a list of pairs (field name, field type).
For example, struct node for "C" should look like
!4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
where !3 is the struct node for "B", !2 is pointer type.

Given a field access
struct B *bp = ...;
bp->a.x = 5;
we annotate it as B::a.x.

In the case of multiple structures containing substructures, how are
you differentiating?

IE given

struct A {
struct B b;
}
struct C {
struct B b;
}

How do you know the above struct B *bp =...; is B::b from C and not B::b from A?

(I agree you can know in the case of direct aggregates, but I argue
you have no way to know in the case of pointer arguments without
interprocedural analysis)
It gets worse the more levels you have.

ie if you add
struct B {
struct E e;
}

and have struct E *e = ...
how do you know it's the B::e contained in struct C, or the B::e
contained in struct A?

Again, i agree you can do both scalar and direct aggregates, but not
aggregates and scalars through pointers.

I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.

Implementing the Hierarchy
--------------------------
We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
Each tag can be a sequence of nodes in the type DAG.
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]

This can get quite deep quite quickly.
Are there actually cases where knowing the outermost aggregate type +
{byte offset, size} of field access does not give you the exact same
disambiguation capability?

The answer is no. We should be able to deduce the access path from {byte offset, size}.
However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.

Well, that part is easy, it's just overlap, once you know they are
both contained in the same outermost type.

How do you figure out the outermost common type of two access 'x' and 'y'?

How are you annotating each access with a path as you go right now?
It should be the same thing to just annotate it then.
Without thinking too hard:
If it's all direct aggregate access in the original source, whatever
you see as the leftmost part should work.

I haven't thought hard about it, but I believe it should wrok.

Note that again, neither works for pointers unless you can guarantee
you have the entire access path, which is hard:

union u {
struct A a;
struct B b;
};

int foo(struct A *a, struct B *b) {
....

}

void bar() {
union u whoops;
foo(&whoops.a, &whoops.b);
}

We should have the same issue with the current TBAA if replacing struct A with int and struct B with float.
Inside foo, we will assume access to *a will not alias with access to *b.
This issue is specific to TBAA, not struct-access-path aware TBAA.
Correct me if I am wrong :]

Manman

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.

Fundamental rules from C11
--------------------------
An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
1. a type compatible with the effective type of the object,
2. a qualified version of a type compatible with the effective type of the object,
3. a type that is the signed or unsigned type corresponding to the effective type of the object,
4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
6. a character type.

Example
-------
struct A {
int x;
int y;
};
struct B {
A a;
int z;
};
struct C {
B b1;
B b2;
int *p;
};

Type DAG:
int <- A::x <- A
int <- A::y <- A <- B::a <- B <- C::b1 <- C
int <----------------- B::z <- B <- C::b2 <- C
any pointer <--------------------- C::stuck_out_tongue: <- C

The type DAG has two types of TBAA nodes:
1> the existing scalar nodes
2> the struct nodes (this is different from the current tbaa.struct)
A struct node has a unique name plus a list of pairs (field name, field type).
For example, struct node for "C" should look like
!4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
where !3 is the struct node for "B", !2 is pointer type.

Given a field access
struct B *bp = ...;
bp->a.x = 5;
we annotate it as B::a.x.

In the case of multiple structures containing substructures, how are
you differentiating?

IE given

struct A {
struct B b;
}
struct C {
struct B b;
}

How do you know the above struct B *bp =...; is B::b from C and not B::b from A?

(I agree you can know in the case of direct aggregates, but I argue
you have no way to know in the case of pointer arguments without
interprocedural analysis)
It gets worse the more levels you have.

ie if you add
struct B {
struct E e;
}

and have struct E *e = ...
how do you know it's the B::e contained in struct C, or the B::e
contained in struct A?

Again, i agree you can do both scalar and direct aggregates, but not
aggregates and scalars through pointers.

I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.

Implementing the Hierarchy
--------------------------
We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
Each tag can be a sequence of nodes in the type DAG.
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]

This can get quite deep quite quickly.
Are there actually cases where knowing the outermost aggregate type +
{byte offset, size} of field access does not give you the exact same
disambiguation capability?

The answer is no. We should be able to deduce the access path from {byte offset, size}.
However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.

Well, that part is easy, it's just overlap, once you know they are
both contained in the same outermost type.

How do you figure out the outermost common type of two access 'x' and 'y'?

How are you annotating each access with a path as you go right now?
It should be the same thing to just annotate it then.

I thought your suggestion was to replace
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
with
!C::b2.a.x := [ "tbaa.offset", !C, offset within C, size ]

!B::a := [ "tbaa.path", !B, "B:a", !A ]
with
!B::a := [ "tbaa.offset", !B, offset within B, size]

Yes, it is.

Then we already lost the access path at IR level.

But you are *generating* the above metadata at the clang level, no?
That is were I meant you should annotate them as you go.

I may not get what you are saying.
But if we don't keep it in the LLVM IR, then the information is not available at the IR level.
For alias query at IR level, we have to somehow regenerate the information if possible.

Without thinking too hard:
If it's all direct aggregate access in the original source, whatever
you see as the leftmost part should work.

I haven't thought hard about it, but I believe it should work.

At llvm IR level, it is not that straight-forward to figure out the outermost common type of C::b2.a.x and B::a with tbaa.offset.

The metadata should already have the outermost common type when it's
output, so it should just be there at the LLVM level.

The outermost common type is between two accesses, and the metadata is attached to each access.
I don't quite get why the metadata should have the outermost common type.

Thanks,
Manman

What cases does this proposal solve that the current analyses don't? Do you have a motivating example?

Given
  struct A {
    int x;
    int y;
  };
  struct B {
    A a;
    int z;
  };
  struct C {
    B b1;
    B b2;
  };
  struct D {
    C c;
  };

with struct-access-path aware TBAA, C::b1.a.x does not alias with D::c.b2.a.x.
without it, the 2 scalar accesses can alias since both have int type.

Manman

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.

Fundamental rules from C11
--------------------------
An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
1. a type compatible with the effective type of the object,
2. a qualified version of a type compatible with the effective type of the object,
3. a type that is the signed or unsigned type corresponding to the effective type of the object,
4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
6. a character type.

Example
-------
struct A {
int x;
int y;
};
struct B {
A a;
int z;
};
struct C {
B b1;
B b2;
int *p;
};

Type DAG:
int <- A::x <- A
int <- A::y <- A <- B::a <- B <- C::b1 <- C
int <----------------- B::z <- B <- C::b2 <- C
any pointer <--------------------- C::stuck_out_tongue: <- C

The type DAG has two types of TBAA nodes:
1> the existing scalar nodes
2> the struct nodes (this is different from the current tbaa.struct)
A struct node has a unique name plus a list of pairs (field name, field type).
For example, struct node for "C" should look like
!4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
where !3 is the struct node for "B", !2 is pointer type.

Given a field access
struct B *bp = ...;
bp->a.x = 5;
we annotate it as B::a.x.

In the case of multiple structures containing substructures, how are
you differentiating?

IE given

struct A {
struct B b;
}
struct C {
struct B b;
}

How do you know the above struct B *bp =...; is B::b from C and not B::b from A?

(I agree you can know in the case of direct aggregates, but I argue
you have no way to know in the case of pointer arguments without
interprocedural analysis)
It gets worse the more levels you have.

ie if you add
struct B {
struct E e;
}

and have struct E *e = ...
how do you know it's the B::e contained in struct C, or the B::e
contained in struct A?

Again, i agree you can do both scalar and direct aggregates, but not
aggregates and scalars through pointers.

I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.

Implementing the Hierarchy
--------------------------
We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
Each tag can be a sequence of nodes in the type DAG.
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]

This can get quite deep quite quickly.
Are there actually cases where knowing the outermost aggregate type +
{byte offset, size} of field access does not give you the exact same
disambiguation capability?

The answer is no. We should be able to deduce the access path from {byte offset, size}.
However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.

Well, that part is easy, it's just overlap, once you know they are
both contained in the same outermost type.

How do you figure out the outermost common type of two access 'x' and 'y'?

How are you annotating each access with a path as you go right now?
It should be the same thing to just annotate it then.

I thought your suggestion was to replace
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
with
!C::b2.a.x := [ "tbaa.offset", !C, offset within C, size ]

!B::a := [ "tbaa.path", !B, "B:a", !A ]
with
!B::a := [ "tbaa.offset", !B, offset within B, size]

Yes, it is.

Then we already lost the access path at IR level.

But you are *generating* the above metadata at the clang level, no?
That is were I meant you should annotate them as you go.

I may not get what you are saying.
But if we don't keep it in the LLVM IR, then the information is not available at the IR level.

I agree, but it's not being regenerated from the LLVM IR, it's being
generated *into* the LLVM IR by clang or something.
I'm saying you can generate the offset, size info there as well.

In the interest of trying to not talk past each other, let's start at
the beginning a bit

1. When in the compilation process were you planning on generating the
tbaa.struct metadata
2. How were you planning on generating it (IE given a bunch of source
code or clang AST's, what was the algorithm being used to generate
it)?

For alias query at IR level, we have to somehow regenerate the information if possible.

Without thinking too hard:
If it's all direct aggregate access in the original source, whatever
you see as the leftmost part should work.

I haven't thought hard about it, but I believe it should work.

At llvm IR level, it is not that straight-forward to figure out the outermost common type of C::b2.a.x and B::a with tbaa.offset.

The metadata should already have the outermost common type when it's
output, so it should just be there at the LLVM level.

The outermost common type is between two accesses, and the metadata is attached to each access.
I don't quite get why the metadata should have the outermost common type.

You need some identifier to differentiate what it is an offset + size
into. You don't actually need the type, just a name.
The same way you have identifiers named "C" so that you can tell that
"C", "B", "B!a.x" is not the same path as "D", "B", "B!a.x".

You need to do the same thing here, so that if you have "C", "{0, 4}"
and "D", "{0, 4"}, you know that they don't alias because C != D.

Type-based alias analysis is intrinsically unsafe. I don't think it make big sense to see the whole program
being compiled and stitch together all type-include-graph together just for this optimization. We would otherwise
have to implement this optimization in interprocedural analysis phase (with whole program mode).

That said, if would be nice if front-end could figure out if a type is only accessiable inside a translation
unit; TBAA can take advantage this info to make it is unsafer.

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.

Fundamental rules from C11
--------------------------
An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
1. a type compatible with the effective type of the object,
2. a qualified version of a type compatible with the effective type of the object,
3. a type that is the signed or unsigned type corresponding to the effective type of the object,
4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
6. a character type.

Example
-------
   struct A {
     int x;
     int y;
   };
   struct B {
     A a;
     int z;
   };
   struct C {
     B b1;
     B b2;
     int *p;
   };

Type DAG:
   int <- A::x <- A
   int <- A::y <- A <- B::a <- B <- C::b1 <- C
   int <----------------- B::z <- B <- C::b2 <- C
   any pointer <--------------------- C::stuck_out_tongue: <- C

The type DAG has two types of TBAA nodes:
1> the existing scalar nodes
2> the struct nodes (this is different from the current tbaa.struct)
    A struct node has a unique name plus a list of pairs (field name, field type).
    For example, struct node for "C" should look like
    !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
    where !3 is the struct node for "B", !2 is pointer type.

Given a field access
   struct B *bp = ...;
   bp->a.x = 5;
we annotate it as B::a.x.

In the case of multiple structures containing substructures, how are
you differentiating?

IE given

struct A {
struct B b;
}
struct C {
struct B b;
}

How do you know the above struct B *bp =...; is B::b from C and not B::b from A?

If I understand correct, the proposed graph is DAG, not tree, and it should be able to tackling the case
where a type is included more than once.

On the other hand, the info which is annotated to the memory access is kind of "immediate enclosing aggregate type",
which should be unique.

(I agree you can know in the case of direct aggregates, but I argue
you have no way to know in the case of pointer arguments without
interprocedural analysis)
It gets worse the more levels you have.

ie if you add
struct B {
struct E e;
}

and have struct E *e = ...
how do you know it's the B::e contained in struct C, or the B::e
contained in struct A?

Again, i agree you can do both scalar and direct aggregates, but not
aggregates and scalars through pointers.

That is exactly the power of TBAA. If the both memory accesses are direct load/store,
it dose not even need TBAA, check their base/offset/size is sufficient for disambiguation.

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on
fields that are scalars or structs.

Fundamental rules from C11
--------------------------
An object shall have its stored value accessed only by an lvalue
expression that has one of the following types: [footnote: The intent of
this list is to specify those circumstances in which an object may or may
not be aliased.]
1. a type compatible with the effective type of the object,
2. a qualified version of a type compatible with the effective type of
the object,
3. a type that is the signed or unsigned type corresponding to the
effective type of the object,
4. a type that is the signed or unsigned type corresponding to a
qualified version of the effective type of the object,
5. an aggregate or union type that includes one of the aforementioned
types among its members (including, recursively, a member of a subaggregate
or contained union), or
6. a character type.

Example
-------
   struct A {
     int x;
     int y;
   };
   struct B {
     A a;
     int z;
   };
   struct C {
     B b1;
     B b2;
     int *p;
   };

Type DAG:
   int <- A::x <- A
   int <- A::y <- A <- B::a <- B <- C::b1 <- C
   int <----------------- B::z <- B <- C::b2 <- C
   any pointer <--------------------- C::stuck_out_tongue: <- C

The type DAG has two types of TBAA nodes:
1> the existing scalar nodes
2> the struct nodes (this is different from the current tbaa.struct)
    A struct node has a unique name plus a list of pairs (field name,
field type).
    For example, struct node for "C" should look like
    !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3,
"C::p", metadata !2}
    where !3 is the struct node for "B", !2 is pointer type.

Given a field access
   struct B *bp = ...;
   bp->a.x = 5;
we annotate it as B::a.x.

In the case of multiple structures containing substructures, how are
you differentiating?

IE given

struct A {
struct B b;
}
struct C {
struct B b;
}

How do you know the above struct B *bp =...; is B::b from C and not B::b
from A?

If I understand correct, the proposed graph is DAG, not tree, and it should
be able to tackling the case
where a type is included more than once.

This was not clear from the proposal, whether the type hierarchy is
duplicated all the way down as necessary, or attempted-to-be-shared by
having types with multiple parents.

On the other hand, the info which is annotated to the memory access is kind
of "immediate enclosing aggregate type",
which should be unique.

(I agree you can know in the case of direct aggregates, but I argue
you have no way to know in the case of pointer arguments without
interprocedural analysis)
It gets worse the more levels you have.

ie if you add
struct B {
struct E e;
}

and have struct E *e = ...
how do you know it's the B::e contained in struct C, or the B::e
contained in struct A?

Again, i agree you can do both scalar and direct aggregates, but not
aggregates and scalars through pointers.

That is exactly the power of TBAA.

I understand quite a bit about TBAA :slight_smile:

My point was that his check will produce the wrong TBAA answer because
it will have an incomplete path in his representation. He will
declare no-alias, yet it it will be may-alias under the actual TBAA
rules.

If the both memory accesses are direct
load/store,
it dose not even need TBAA, check their base/offset/size is sufficient for
disambiguation.

False, in a lot of cases. For starters, roughly any union including
aggregate, you will have overlaps that can be disambiguated using TBAA
but not otherwise.

Maybe I were not clear in my previous mail, or maybe we are talking about different topics.

My point was, if the pair of memory accesses are direct access, in most cases, the base/offset/size
should be able to sufficient to figure out if they are alias or not. A tricky case would be
subscripped variables, like a.b[i][j] vs a.c[m][n] where a.c and b.c are enclosed in a common union.
  If mem-dep test failure to figure out if there is dependence, in that case, TBAA might help.

  It seems what we discussing here is somewhat deviating the original topic:-)

Based on discussions with John McCall

We currently focus on field accesses of structs, more specifically, on fields that are scalars or structs.

Fundamental rules from C11
--------------------------
An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
1. a type compatible with the effective type of the object,
2. a qualified version of a type compatible with the effective type of the object,
3. a type that is the signed or unsigned type corresponding to the effective type of the object,
4. a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
5. an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
6. a character type.

Example
-------
struct A {
int x;
int y;
};
struct B {
A a;
int z;
};
struct C {
B b1;
B b2;
int *p;
};

Type DAG:
int <- A::x <- A
int <- A::y <- A <- B::a <- B <- C::b1 <- C
int <----------------- B::z <- B <- C::b2 <- C
any pointer <--------------------- C::stuck_out_tongue: <- C

The type DAG has two types of TBAA nodes:
1> the existing scalar nodes
2> the struct nodes (this is different from the current tbaa.struct)
A struct node has a unique name plus a list of pairs (field name, field type).
For example, struct node for "C" should look like
!4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
where !3 is the struct node for "B", !2 is pointer type.

Given a field access
struct B *bp = ...;
bp->a.x = 5;
we annotate it as B::a.x.

In the case of multiple structures containing substructures, how are
you differentiating?

IE given

struct A {
struct B b;
}
struct C {
struct B b;
}

How do you know the above struct B *bp =...; is B::b from C and not B::b from A?

(I agree you can know in the case of direct aggregates, but I argue
you have no way to know in the case of pointer arguments without
interprocedural analysis)
It gets worse the more levels you have.

ie if you add
struct B {
struct E e;
}

and have struct E *e = ...
how do you know it's the B::e contained in struct C, or the B::e
contained in struct A?

Again, i agree you can do both scalar and direct aggregates, but not
aggregates and scalars through pointers.

I don't have immediate plan of pointer analysis. For the given example, we will treat field accesses from bp (pointer to struct B) as B::x.x, bp can be from either C or A.

Implementing the Hierarchy
--------------------------
We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.
Each tag can be a sequence of nodes in the type DAG.
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]

This can get quite deep quite quickly.
Are there actually cases where knowing the outermost aggregate type +
{byte offset, size} of field access does not give you the exact same
disambiguation capability?

The answer is no. We should be able to deduce the access path from {byte offset, size}.
However, I don't know an easy way to check alias(x,y) given {byte offset, size} for x and y.

Well, that part is easy, it's just overlap, once you know they are
both contained in the same outermost type.

How do you figure out the outermost common type of two access 'x' and 'y'?

How are you annotating each access with a path as you go right now?
It should be the same thing to just annotate it then.

I thought your suggestion was to replace
!C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
with
!C::b2.a.x := [ "tbaa.offset", !C, offset within C, size ]

!B::a := [ "tbaa.path", !B, "B:a", !A ]
with
!B::a := [ "tbaa.offset", !B, offset within B, size]

Yes, it is.

Then we already lost the access path at IR level.

But you are *generating* the above metadata at the clang level, no?
That is were I meant you should annotate them as you go.

I may not get what you are saying.
But if we don't keep it in the LLVM IR, then the information is not available at the IR level.

I agree, but it's not being regenerated from the LLVM IR, it's being
generated *into* the LLVM IR by clang or something.
I'm saying you can generate the offset, size info there as well.

I thought your suggestion was to replace tbaa.path because it "can get quite deep quite quickly", but here you are saying as well.
So clarification: you are suggesting to add {offset, size} into metadata tbaa.path, right :slight_smile:

I am going to add a few examples here:
Given
struct A {
   int x;
   int y;
};
struct B {
   A a;
   int z;
};
struct C {
   B b1;
   B b2;
   int *p;
};
struct D {
   C c;
};

with the proposed struct-access-path aware TBAA, we will say "C::b1.a" will alias with "B::a.x", "C::b1.a" will alias with "B",
"C::b1.a" does not alias with "D::c.b2.a.x".

The proposal is about the format of metadata in IR and how to implement alias queries in IR:

metadata for the type DAG

  1> the existing scalar nodes
  2> the struct nodes (this is different from the current tbaa.struct)
  A struct node has a unique name plus a list of pairs (field name, field type).
  For example, struct node for "C" should look like
  !4 = metadata !{"C", "C::b1", metadata !3, "C::b2", metadata !3, "C::p", metadata !2}
  where !3 is the struct node for "B", !2 is pointer type.

  An example type DAG:
  int <- "A::x" <- A
  int <- "A::y" <- A <- "B::a" <- B <- "C::b1" <- C
  int <------------------- "B::z" <- B <- "C::b2" <- C
  any pointer <-------------------------- "C::p" <- C

We can attach metadata to both scalar accesses and aggregate accesses. Let's call scalar tags and aggregate tags.

  Each tag can be a sequence of nodes in the type DAG.
  !C::b2.a.x := [ "tbaa.path", !C, "C::b2", !B, "B::a", !A, "A::x", !int ]
  where !C, !B, !A are metadata nodes in the type DAG, "C::b2" "B::a" "A::x" are strings.

I also had some discussion with Shuxin, we can start by a conservative and simpler implementation:

metadata for the type DAG

  1> the existing scalar nodes
  2> the struct nodes
  A struct node has a unique name plus a list of enclosed types
  For example, struct node for "C" should look like
  !4 = metadata !{"C", metadata !3, metadata !2}

  An example type DAG:
  int <------------- A <- B <-- C
  int <-------------------- B
  any pointer <----------- <- C

Each tag will have the last field access.

  !A.x := [ "tbaa.path", !A, "A::x", !int ] will be attached to field access C::b2.a.x
  The actual access has extra contextual information than the tag and the tag is more conservative than the actual access.
  We can increase number of accesses in the tag to get more accurate alias result.

-Manman