I'm new to the llvm community. I'm learning how things work. I noticed
that there has been some interest in improving how unions are handled. Bug
21725 is one example. I figured it might be a interesting place to start.
I discussed this with a couple people, and below is a suggestion on how to
represent unions. I would like some comments on how this fits in with how
things are done in llvm.
Later,
Steven Perron
Motivation
The main motivation for this is functional correctness of code using unions.
See Invalid Bug ID one example. The problem
is
that the basic alias analysis (or any other alias analysis) will not always
have
enough information to be able see that the two references definitely
overlap.
When this happens, the lack of alias information for unions could allow
undesired optimizations.
Current state
For this RFC, we are interested in how the type based aliasing is
represented
for non-scalar data types. In C/C++, this will be arrays, structs, classes,
and
unions. We will not mention classes again because for this purpose they are
the
exact same as structs.
We will start with structs. To help with the aliasing of structs, There is
a
format for the type based aliasing meta data known as the struct-path. Here
is
example IR:
%0 = load i32, i32* getelementptr inbounds (%struct.S, %struct.S* @s, i32
0, %i32 2), align 4, !tbaa !2
%1 = load i32, i32* getelementptr inbounds (%struct.S, %struct.S* @s, i32
0, %i32 3), align 4, !tbaa !7
...
!2 = !{!3, !6, i64 800}
!3 = !{!"S", !4, i64 0, !4, i64 400, !6, i64 800, !6, i64 804}
!4 = !{!"omnipotent char", !5, i64 0}
!5 = !{!"Simple C/C++ TBAA"}
!6 = !{!"int", !4, i64 0}
!7 = !{!3, !6, i64 804}
The two loads are loads of different int members of the same struct. The
type
based aliasing gives them different meta data tags (!2 and !7). This allows
the
analysis to determine that they are different solely from the type based
aliasing information. Looking at !2, it points to !3 with offset 800. This
offset is used to identify which member of the struct is being accessed.
Next we will consider arrays. When there is an array access, the type based
aliasing views it as an access of the type of the element. In general, this
is
not a problem. For C/C++, there is no real difference between an array
reference and a dereference of a pointer. However, this causes a lose of
information when accessing arrays that are part of a struct or union.
Here is an example:
%arrayidx = getelementptr inbounds [100 x i32], [100 x i32]* getelementptr
%inbounds (%struct.S, %struct.S* @s, i32 0, i32 0), i64 0, i64 %idxprom
%1 = load i32, i32* %arrayidx, align 4, !tbaa !1
...
!1 = !{!2, !2, i64 0}
!2 = !{!"int", !3, i64 0}
This is an access to an array of integers in a struct of type S. Note that
the
type on the load is "int". The type bases alias information has lost that
this
is a reference to a member of a struct.
We finish by considering unions. Like arrays, an access to a member of a
union
is treated as if is it type of the type of the member being accessed. Here
is
an example:
%4 = load i32, i32* getelementptr inbounds (%union.U, %union.U* @u, i32 0,
i32 %0), align 4, !tbaa !2
...
!2 = !{!3, !3, i64 0}
!3 = !{!"int", !4, i64 0}
Here we lose the information that the reference is part of a union, so we
cannot
tell that this memory may overlap with something of a different type. Clang
currently relies on other aliasing analysis like the basic alias analysis to
observer that the two references are aliased.
Proposed changes for arrays
To deal with array references, they should be treated as if all array
accesses
are accesses to the first member of the array. For the purpose of type
based
aliasing, this should be good enough. The offset in the tag would be the
offset of the start of the array plus, if applicable, the offset needed to
access the member of the array element as if the element was a top-level
struct
member at offset 0. Note that the values for the indexing of the array is
not
included. Thus, accesses to an element of the last dimension of a
multidimensional array appears the same as an access to a flattened version
of
the multidimensional array. Then, in the struct path, the array member's
offset
will be the starting offset of the array, as it is now. What will change in
the
struct path is the type node for the array. The type node will become the
type
of the array element instead of "omnipotent char", as it is now.
This change is a must to be able to get union aliasing correct because we
need
to know when an array is part of a union that may overlap with other
members.
Proposed changes for unions
To properly deal with unions, there are a few issues that need to be dealt
with.
1) Unions have to somehow be represented in the TBAA. As was pointed out
above,
this information is currently lost.
2) With the struct path TBAA, it is assumed that the length of a member
extends
to the start of the next. This will not be true for unions.
3) With the current struct path TBAA, you can determine which member is
reference by looking at the offset. That is not true with unions because
every
member starts at offset 0. We will need another way of doing this for
unions.
Any solution we come up with will have to get around these three problems.
To deal with unions, we will use the change in bug 21725 (see link above).
This
will treat a union as if it is a struct. However, the important change to
make
the unions work is that the length of the memory reference will have to be
taken
into consideration when looking at the type based aliasing. This should be
It was not clear how you'll take the length into consideration. Are
you talking about inferences like "I know the access was 4 bytes long,
so it can't be accessing the `short` field"? If so, we'd need to
verify (and maintain) that no passes ask AA about a subsegment of a
memory access (i.e. "does that store alias the first two bytes of this
i32 load"); or at least prevent them from using TBAA when they refine
their queries this way.