Alias analysis results

Hello,

I'm trying to get the whole picture of what we are doing in terms
of alias analysis in LLVM. Being new to this I may be missing
some well known things so my apologies if I seem to ask obvious
things.

There are lots of questions arise in my head as I read related
bug reports, sources and documentation, of which looks to be the
most simple is,

Why the current TBAA implementation returns NoAlias for pairs of
accesses for which it has no idea if they really alias?

Let me elaborate on this. My understanding of what the TBAA
implementation is supposed to respond is whether a given pair of
accesses is allowed to alias by the rules of the input language.
(Which in turn leads me to another question, that is, whether
we are really trying to perform C/C++ alias analysis in a
[language-neutral] codegen pass, but let's postpone this for
later.)

This consideration assumes that what the TBAA implementation
actually returns as a result is orthogonal to the
Must/No/MayAlias set, that used to be the result type of various
AA, including TBAA. I would expect that the result of the
complete alias analysis includes both the information on whether
given accesses alias and, as a separate element, whether they are
allowed to alias by the rules of the language. Then we may have
combinations like (MustAlias, AllowedAlias) that seem to be the
common case and combinations like (MustAlias, NotAllowedAlias)
that I would expect to a) generate the breaks-alias-rules kind of
warnings and b) proceed further as any other MustAlias case.

The latter combination is what I would expect for illegal type
puns, for whatever definition of "illegal". Note that for such
cases we currently get MustAlias from BasicAA and NoAlias from
TBAA. In some comments this is considered okay whereas I have to
admit that I don't understand how that may possibly be okay,
meaning it's hard to imagine how a pair of accesses can be
aliasing and non-aliasing at the same time. To me it appears
like that the root cause if this inconsistency is not that any of
these analyses is not correct, but that we are trying to
represent their responses in an inadequate form and we are trying
to use BasicAA as a faster replacement for TBAA in simple cases.

Any thoughts are highly appreciated.

Thanks,

+hal

Hi,

I’m not an expert on all of our AA bits, but I’ve been in the area some. If I’m wrong, I’m sure someone will jump in and we’ll both end up learning something. :slight_smile:

Why the current TBAA implementation returns NoAlias for pairs of
accesses for which it has no idea if they really alias?

For the same reason that LLVM assumes many other things it can’t prove. The language says “if you do ${x}, the behavior is undefined,” so optimizing on the assumption that the user isn’t doing ${x} is OK.

That said, optimizing on aliasing assumptions has been contentious in the past, so frontends that emit TBAA data (e.g. clang) generally support flags like -fno-strict-aliasing to relax this behavior.

that I would expect to a) generate […] warnings and […]

FWIW, it’s generally nontrivial to emit helpful warnings from LLVM. This is because LLVM doesn’t have the AST that the IR was generated from, so it has no source-level information to speak of (except the bits we encode in debuginfo, but that doesn’t exactly make for great diags).

In other words, a warning emitted by LLVM would probably be as helpful as “hey, somewhere in the function ‘_Z3fooi’, or one of the functions that we happened to inline into it, we found some pretty obvious type punning.”

it’s hard to imagine how a pair of accesses can be
aliasing and non-aliasing at the same time

As you’ve noted, this is an artifact of taking two logically independent pieces of information and merging them into one. While this might not be ideal from a theoretical point of view, I don’t see why an optimization would care about the difference between “proven NoAlias” and “proven illegal to alias.” In either case, it’s allowed to optimize based on the assumption that those things don’t alias. So, from a practical standpoint, I don’t understand what making that distinction would buy us.

we are trying to use BasicAA as a faster replacement for
TBAA in simple cases

AIUI, the idea of putting BasicAA in front of TBAA isn’t an efficiency hack: its purpose is to catch “obvious” type punning, and handle it gracefully for the user (your ‘b)’ above).

You might find Hal’s recent work on TySan (https://reviews.llvm.org/D32197) interesting; it’s a sanitizer that tries to catch and complain about type punning at run-time. :slight_smile:

George

(MustAlias, NotAllowedAlias)
The latter combination is what I would expect for illegal type
puns, for whatever definition of “illegal”. Note that for such
cases we currently get MustAlias from BasicAA and NoAlias from
TBAA. In some comments this is considered okay whereas I have to
admit that I don’t understand how that may possibly be okay,
meaning it’s hard to imagine how a pair of accesses can be
aliasing and non-aliasing at the same time. To me it appears
like that the root cause if this inconsistency is not that any of
these analyses is not correct, but that we are trying to
represent their responses in an inadequate form

There was another thread on how broken the handling of C/C++ union is with TBAA. TBAA does not model unions and therefore it used to return NoAlias between union member accesses which was the cause for few misoptimizations. I believe the hotfix for this was put into clang, disabling tbaa metadata for unions.

Kevin

Hello,

I'm trying to get the whole picture of what we are doing in terms
of alias analysis in LLVM. Being new to this I may be missing
some well known things so my apologies if I seem to ask obvious
things.

There are lots of questions arise in my head as I read related
bug reports, sources and documentation, of which looks to be the
most simple is,

Why the current TBAA implementation returns NoAlias for pairs of
accesses for which it has no idea if they really alias?

TBAA is broken in various ways, but in general, it is because it knows they
can't legally do so.

Let me elaborate on this. My understanding of what the TBAA
implementation is supposed to respond is whether a given pair of
accesses is allowed to alias by the rules of the input language.

No.
TBAA is a bad name for what LLVM is doing, because it's not really TBAA at
that point. I've objected to the name before as it confuses people.
It's really anti-alias sets/subsets structured as a tree. Loads/stores are
tagged with which set they belong to.

In C/C++, those sets are generated by TBAA rules. You could generate
them however you want.

(Which in turn leads me to another question, that is, whether

we are really trying to perform C/C++ alias analysis in a
[language-neutral] codegen pass, but let's postpone this for
later.)

No, we are not trying to do so.

This consideration assumes that what the TBAA implementation
actually returns as a result is orthogonal to the
Must/No/MayAlias set,

No, it is exactly a noalias set.

The latter combination is what I would expect for illegal type
puns, for whatever definition of "illegal". Note that for such
cases we currently get MustAlias from BasicAA and NoAlias from
TBAA.

This is, IMHO, very very broken and should be fixed in clang.

In some comments this is considered okay whereas I have to
admit that I don't understand how that may possibly be okay,
meaning it's hard to imagine how a pair of accesses can be
aliasing and non-aliasing at the same time.

Note that it will always be the case
It is entirely possible for one alias analysis to say must alias, and
another to say no-alias, depending on the source of the info.

To me it appears
like that the root cause if this inconsistency is not that any of
these analyses is not correct, but that we are trying to
represent their responses in an inadequate form and we are trying
to use BasicAA as a faster replacement for TBAA in simple cases.

I don't really see this at all.

THe use of BasicAA to go with TBAA is, IMHO, just broken.
it's not trying to be faster, it's doing it for correctness.