alias set collapse and LICM

I'm current facing an issue related to AliasSetTracker/LICM: the
transitive closure of the alias sets is materially more conservative
than individual aliasing queries and this conservatism leads to
generally worse optimization in LICM.

For instance, consider this module:

declare i32 @only_reads() readonly
declare void @escape(i32*, i32*, i32*)

define void @f(i32 %count) {
entry:
  %a = alloca i32
  %b = alloca i32
  %c = alloca i32
  call void @escape(i32* %a, i32* %b, i32* %c)
  %enter = icmp sle i32 0, %count
  br i1 %enter, label %loop, label %exit

loop:
  %idx = phi i32 [ 0, %entry ], [ %idx.inc, %loop ]
  %idx.inc = add i32 %idx, 1
  %take.be = icmp slt i32 %idx, %count

  %a.load = load i32, i32* %a
  store i32 %a.load, i32* %b

  %v = call i32 @only_reads()
  store i32 %v, i32* %c

  br i1 %take.be, label %loop, label %exit

exit:
  ret void
}

BasicAA knows that %a, %b and %c are all pairwise NoAlias, and given
that, LICM should be able to hoist `%a.load` out of the loop. That
does not happen because the read only call to `@only_reads` collapses
the alias sets for %a, %b and %c into one and so as far as LICM is
concerned, they all alias each other. The store to `%b` now
"clobbers" `%a.load`.

One potential fix is to do N^2 alias queries (every mem operation in
the loop with every other mem operation) before hoisting a load, but I
suspect AliasSetTracker exists mainly to make LICM faster than N^2, so
I'm ruling this solution out.

The solution I like: have LICM create two AliasSetTracker
objects. The first one tracks all memory operations, as the alias set
tracker object does now. The second one only tracks mods. When
hoisting a load we check if any alias set in the mod tracker aliases
the load location -- if not, the load is okay to hoist to the
preheader from the data dependence point of view. In the worst case,
this is 2x more (than now) work for building the alias sets and no more
extra work when checking if we can hoist a load.

Before I dive into implementing this, I have a few questions:

* is this even the right problem to solve? Or should I look at other
   passes for this optimization?

* do you think this will be too expensive? If so, does it make sense
   to do this only on -O3?`

-- Sanjoy

From: llvmdev-bounces@cs.uiuc.edu [mailto:llvmdev-bounces@cs.uiuc.edu]
On Behalf Of Sanjoy Das
Subject: [LLVMdev] alias set collapse and LICM

declare i32 @only_reads() readonly
declare void @escape(i32*, i32*, i32*)
define void @f(i32 %count) {
entry:
  %a = alloca i32
  %b = alloca i32
  %c = alloca i32
  call void @escape(i32* %a, i32* %b, i32* %c)
  %enter = icmp sle i32 0, %count
  br i1 %enter, label %loop, label %exit
loop:
  %idx = phi i32 [ 0, %entry ], [ %idx.inc, %loop ]
  %idx.inc = add i32 %idx, 1
  %take.be = icmp slt i32 %idx, %count
  %a.load = load i32, i32* %a
  store i32 %a.load, i32* %b
  %v = call i32 @only_reads()
  store i32 %v, i32* %c
  br i1 %take.be, label %loop, label %exit
exit:
  ret void
}

BasicAA knows that %a, %b and %c are all pairwise NoAlias, and given
that, LICM should be able to hoist `%a.load` out of the loop. That
does not happen because the read only call to `@only_reads` collapses
the alias sets for %a, %b and %c into one

Can you explain why the alias sets are collapsed into one here? Can that collapsing simply be avoided for read-only calls without creating a second alias set?

- Chuck

Can you explain why the alias sets are collapsed into one here?

If I'm reading the code correctly, it is because the readonly call
aliases all of %a, %b and %c. Since two pointers can be in two
different alias sets only if they do not alias, %a has to be in the
same alias set as the read only call and so does %b and %c.
Therefore, all of them have to be in the same alias set.

Can that collapsing simply be avoided for read-only calls without creating a second alias set?

I have not yet come up with a simple scheme to solve that problem
within AliasSetTracker. Obviously, that does not mean that there
isn't one.

-- Sanjoy

[+Danny]

From: "Sanjoy Das" <sanjoy@playingwithpointers.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Cc: "Philip Reames" <listmail@philipreames.com>, "Hal Finkel" <hfinkel@anl.gov>, "Chandler Carruth"
<chandlerc@google.com>
Sent: Monday, April 27, 2015 4:58:09 PM
Subject: alias set collapse and LICM

I'm current facing an issue related to AliasSetTracker/LICM: the
transitive closure of the alias sets is materially more conservative
than individual aliasing queries and this conservatism leads to
generally worse optimization in LICM.

For instance, consider this module:

declare i32 @only_reads() readonly
declare void @escape(i32*, i32*, i32*)

define void @f(i32 %count) {
entry:
  %a = alloca i32
  %b = alloca i32
  %c = alloca i32
  call void @escape(i32* %a, i32* %b, i32* %c)
  %enter = icmp sle i32 0, %count
  br i1 %enter, label %loop, label %exit

loop:
  %idx = phi i32 [ 0, %entry ], [ %idx.inc, %loop ]
  %idx.inc = add i32 %idx, 1
  %take.be = icmp slt i32 %idx, %count

  %a.load = load i32, i32* %a
  store i32 %a.load, i32* %b

  %v = call i32 @only_reads()
  store i32 %v, i32* %c

  br i1 %take.be, label %loop, label %exit

exit:
  ret void
}

BasicAA knows that %a, %b and %c are all pairwise NoAlias, and given
that, LICM should be able to hoist `%a.load` out of the loop. That
does not happen because the read only call to `@only_reads` collapses
the alias sets for %a, %b and %c into one and so as far as LICM is
concerned, they all alias each other. The store to `%b` now
"clobbers" `%a.load`.

Each alias set currently keeps track of its AccessType (Ref/Mod/ModRef). Maybe we should not collapse Ref access sets with Mod ones?

-Hal

Each alias set currently keeps track of its AccessType (Ref/Mod/ModRef). Maybe we should not collapse Ref access sets with Mod ones?

AFAICT, then the read only call will have to be in two alias sets -- a
Ref alias set (also containing %a) and a Mod alias set (also
containing %b and %c). The collapse is driven by the fact that a
alias sets are disjoint -- to not collapse I think we'll have to break
that invariant.

-- Sanjoy

From: "Sanjoy Das" <sanjoy@playingwithpointers.com>
To: "Charles R Caldarale" <Chuck.Caldarale@unisys.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Monday, April 27, 2015 5:44:18 PM
Subject: Re: [LLVMdev] alias set collapse and LICM

> Can you explain why the alias sets are collapsed into one here?

If I'm reading the code correctly, it is because the readonly call
aliases all of %a, %b and %c. Since two pointers can be in two
different alias sets only if they do not alias, %a has to be in the
same alias set as the read only call and so does %b and %c.
Therefore, all of them have to be in the same alias set.

That's correct. OTOH, I think we can loosen this somewhat (having separate Mod and Ref alias sets, for example). We'd need to change the interface, however, and update the clients. Luckily, there are very few clients.

-Hal

From: "Sanjoy Das" <sanjoy@playingwithpointers.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Philip Reames" <listmail@philipreames.com>, "Chandler Carruth" <chandlerc@google.com>, "LLVM Developers Mailing
List" <llvmdev@cs.uiuc.edu>, "Daniel Berlin" <dberlin@dberlin.org>
Sent: Monday, April 27, 2015 5:55:31 PM
Subject: Re: alias set collapse and LICM

> Each alias set currently keeps track of its AccessType
> (Ref/Mod/ModRef). Maybe we should not collapse Ref access sets
> with Mod ones?

AFAICT, then the read only call will have to be in two alias sets --
a
Ref alias set (also containing %a) and a Mod alias set (also
containing %b and %c). The collapse is driven by the fact that a
alias sets are disjoint -- to not collapse I think we'll have to
break
that invariant.

That's correct. There are only a few passes that use AST, however, and I don't think updating them would be too much trouble.

-Hal

You can’t win here (believe me, i’ve tried, and better people than me have tried, for years :P).

No matter what you do, the partitioning will never be 100% precise. The only way to solve that in general is to pairwise query over the partitioning.

Your basic problem is that all of the alias analyses do not come up with the same partitioning of the heap.

Worse, some are not even transitive, or even symmetric.

This means one single variable may be in multiple disjoint alias sets in reality.

(For example, GEP, PHI will give you different AA results than PHI, GEP in some cases. That is mostly a FIXME, but there are plenty of cases you can come up with where info tells you something about one variable and it’s relationship to another, but not to a third. The non-symmetric case is rarer, but again, possible.)

Let’s say you modify AliasSetTracker to not collapse, and handle multiple disjoint partitions at once.

(Here is here i will draw the equivalence between this and what we’ve done before in GCC :slight_smile: )

The only way to maintain correctness (in general) and not collapse is to maintain multiple alias sets, and say which are affected by which instructions.

(IE you have 1 instruction, and it may MOD 5 different alias sets, each of which represents a different partitioning of the heap).

You will discover very quickly, that for most partitioning you can think of, the cost of maintaining the alias sets over a set of instructions with is greater than the cost of additional queries you may come up with in the first place

In other words, depending on how precise your partitioning you may have variables that clobber 10 or 20 or 30 of the disjoint alias sets. Handling this is a lot more expensive than saying “hey, i want to move this one load up another instruction. Does it really conflict?”. This is because you really only care about individual aliases in the alias set at any given time, but unless you go through all the addresses ahead of time, and had a way to say “okay alias set tracker, i only care about variable x, y, z, and even then, only x in block a, y in block b, etc”, you are paying the cost of doing whatever queries are necessary to prove something about all variables in an alias set.

MemorySSA was essentially built with this problem in mind (and thus, tries to provide chains that let you do querying on. You ask it about a load, it can O(1) tell you the nearest dominating thing that MOD’s it (IE to which getModRefInfo(Instruction, Location) says Mod).

This means algorithms that use it to hoist, if they are looking to place things in positions that dominate the original load, can be done in O(N) overall.

It can O(N) tell you the nearest thing (not dominating) that MOD’s it, where N is the number of MOD’ing accesses in between the dominating access and the actual MOD’ing access. We could make this O(1) for a space cost (we compute it but throw it away)

It can O(N) tell you the farthest down you can move something (and give you O(N) the nearest post-dominated thing, or the nearest thing).

It can O(1) tell you the nearest set of uses, and the next “MOD’ing access” (IE the next thing to which getModRefInfo(instruction) says Mod). The nearest common postdominator of all of these is a valid placement point for the store, but may not be the farthest down you can move it.

But if it’s outside the loop, you may not care to try to move it further.

This means store sinking algorithms are N^2 to sink as far down as you can (though you can get a cheap place, of course, for O(N)).

O(N) is the worst case, you can do some pretty good caching to get better.
It’s also N=Number of MOD’ing accesses, not N = Number of instructions. They may or may not be the same thing, most of the time they are not.

It’s possible to get the same time bounds for store sinking as load sinking by building MemorySSI instead of MemorySSA :slight_smile:

You can’t win here (believe me, i’ve tried, and better people than me have tried, for years :P).

No matter what you do, the partitioning will never be 100% precise. The only way to solve that in general is to pairwise query over the partitioning.

Your basic problem is that all of the alias analyses do not come up with the same partitioning of the heap.

Worse, some are not even transitive, or even symmetric.

This means one single variable may be in multiple disjoint alias sets in reality.

(For example, GEP, PHI will give you different AA results than PHI, GEP in some cases. That is mostly a FIXME, but there are plenty of cases you can come up with where info tells you something about one variable and it’s relationship to another, but not to a third. The non-symmetric case is rarer, but again, possible.)

Let’s say you modify AliasSetTracker to not collapse, and handle multiple disjoint partitions at once.

(Here is here i will draw the equivalence between this and what we’ve done before in GCC :slight_smile: )

The only way to maintain correctness (in general) and not collapse is to maintain multiple alias sets, and say which are affected by which instructions.

(IE you have 1 instruction, and it may MOD 5 different alias sets, each of which represents a different partitioning of the heap).

You will discover very quickly, that for most partitioning you can think of, the cost of maintaining the alias sets over a set of instructions with is greater than the cost of additional queries you may come up with in the first place

In other words, depending on how precise your partitioning you may have variables that clobber 10 or 20 or 30 of the disjoint alias sets.

This should be “instructions”, not variables.

You can't win here (believe me, i've tried, and better people than me have
tried, for years :P).
No matter what you do, the partitioning will never be 100% precise. The
only way to solve that in general is to pairwise query over the
partitioning.

Your basic problem is that all of the alias analyses do not come up with the
same partitioning of the heap.

Worse, some are not even transitive, or even symmetric.

Am I correct in saying that the lack of transitivity fundamentally
cannot be helped -- a MayAlias composed with a MayAlias is not a
MayAlias. The simplest example of this I can come up with is {A ,
C?A:B , B} where A `NoAlias` B and C is a runtime value that cannot be
computed at compile-time. Then A `MayAlias` C?A:B and C?A:B
`MayAlias` B but A `NoAlias` B.

This means one single variable may be in multiple disjoint alias sets in
reality.

Agreed. I did not propose solving the problem within ASTracker for
this very reason.

(For example, GEP, PHI will give you different AA results than PHI, GEP in
some cases. That is mostly a FIXME, but there are plenty of cases you can
come up with where info tells you something about one variable and it's
relationship to another, but not to a third. The non-symmetric case is
rarer, but again, possible.)

Let's say you modify AliasSetTracker to not collapse, and handle multiple
disjoint partitions at once.

(Here is here i will draw the equivalence between this and what we've done
before in GCC :slight_smile: )

The only way to maintain correctness (in general) and not collapse is to
maintain multiple alias sets, and say which are affected by which
instructions.

(IE you have 1 instruction, and it may MOD 5 different alias sets, each of
which represents a different partitioning of the heap).

You will discover very quickly, that for most partitioning you can think of,
the cost of maintaining the alias sets over a set of instructions with is
greater than the cost of additional queries you may come up with in the
first place

In other words, depending on how precise your partitioning you may have
variables that clobber 10 or 20 or 30 of the disjoint alias sets. Handling
this is a lot more expensive than saying "hey, i want to move this one load
up another instruction. Does it really conflict?". This is because you
really only care about individual aliases in the alias set at any given
time, but unless you go through all the addresses ahead of time, and had a
way to say "okay alias set tracker, i only care about variable x, y, z, and
even then, only x in block a, y in block b, etc", you are paying the cost of
doing whatever queries are necessary to prove something about *all*
variables in an alias set.

MemorySSA was essentially built with this problem in mind (and thus, tries
to provide chains that let you do querying on. You ask it about a load, it
can O(1) tell you the nearest dominating thing that MOD's it (IE to which
getModRefInfo(Instruction, Location) says Mod).

So my proposal is somewhat related to what you're suggesting --
tracking the stores through a separate ASTracker lets me basically ask
a conservative version of getModRefInfo over the entire loop body.
And given that the loop body will be strongly connected, I think the
list of clobbering stores reported by MemorySSA will be exactly equal
to the list of stores reported by AST, control flow should not matter
since everything in the loop body reaches everything else.

-- Sanjoy

You can't win here (believe me, i've tried, and better people than me have
tried, for years :P).
No matter what you do, the partitioning will never be 100% precise. The
only way to solve that in general is to pairwise query over the
partitioning.

Your basic problem is that all of the alias analyses do not come up with the
same partitioning of the heap.

Worse, some are not even transitive, or even symmetric.

Am I correct in saying that the lack of transitivity fundamentally
cannot be helped -- a MayAlias composed with a MayAlias is not a
MayAlias.

Correct. I also imagine you can abuse noalias to get a noalias b, b
noalias c, a mayalias c.

(But i haven't thought too hard about it)

As a trivial

The simplest example of this I can come up with is {A ,
C?A:B , B} where A `NoAlias` B and C is a runtime value that cannot be
computed at compile-time. Then A `MayAlias` C?A:B and C?A:B
`MayAlias` B but A `NoAlias` B.

Yes.

l

This means one single variable may be in multiple disjoint alias sets in
reality.

Agreed. I did not propose solving the problem within ASTracker for
this very reason.

(For example, GEP, PHI will give you different AA results than PHI, GEP in
some cases. That is mostly a FIXME, but there are plenty of cases you can
come up with where info tells you something about one variable and it's
relationship to another, but not to a third. The non-symmetric case is
rarer, but again, possible.)

Let's say you modify AliasSetTracker to not collapse, and handle multiple
disjoint partitions at once.

(Here is here i will draw the equivalence between this and what we've done
before in GCC :slight_smile: )

The only way to maintain correctness (in general) and not collapse is to
maintain multiple alias sets, and say which are affected by which
instructions.

(IE you have 1 instruction, and it may MOD 5 different alias sets, each of
which represents a different partitioning of the heap).

You will discover very quickly, that for most partitioning you can think of,
the cost of maintaining the alias sets over a set of instructions with is
greater than the cost of additional queries you may come up with in the
first place

In other words, depending on how precise your partitioning you may have
variables that clobber 10 or 20 or 30 of the disjoint alias sets. Handling
this is a lot more expensive than saying "hey, i want to move this one load
up another instruction. Does it really conflict?". This is because you
really only care about individual aliases in the alias set at any given
time, but unless you go through all the addresses ahead of time, and had a
way to say "okay alias set tracker, i only care about variable x, y, z, and
even then, only x in block a, y in block b, etc", you are paying the cost of
doing whatever queries are necessary to prove something about *all*
variables in an alias set.

MemorySSA was essentially built with this problem in mind (and thus, tries
to provide chains that let you do querying on. You ask it about a load, it
can O(1) tell you the nearest dominating thing that MOD's it (IE to which
getModRefInfo(Instruction, Location) says Mod).

So my proposal is somewhat related to what you're suggesting --
tracking the stores through a separate ASTracker lets me basically ask
a conservative version of getModRefInfo over the entire loop body.

Yes.
MemorySSA just lets you do this per-load instead.

And given that the loop body will be strongly connected, I think the
list of clobbering stores reported by MemorySSA will be exactly equal
to the list of stores reported by AST, control flow should not matter
since everything in the loop body reaches everything else.

Depends how good AST is.

Given a loop like this:

; Function Attrs: nounwind ssp
define void @test10(i32 %N, double* nocapture %G) #0 {
entry:
  %0 = add i32 %N, -1
  %1 = icmp sgt i32 %0, 1
  br i1 %1, label %bb.nph, label %return

bb.nph: ; preds = %entry
  %tmp = sext i32 %0 to i64
  %tmp8 = add i64 %tmp, -1
  br label %bb

bb: ; preds = %bb, %bb.nph
  %indvar = phi i64 [ 0, %bb.nph ], [ %tmp11, %bb ]
  %scevgep = getelementptr double, double* %G, i64 %indvar
  %tmp9 = add i64 %indvar, 2
  %scevgep10 = getelementptr double, double* %G, i64 %tmp9
  %tmp11 = add i64 %indvar, 1
  %scevgep12 = getelementptr double, double* %G, i64 %tmp11
  %2 = load double, double* %scevgep12, align 8
  %3 = load double, double* %scevgep10, align 8
  %4 = fadd double %2, %3
  %5 = load double, double* %scevgep, align 8
  %6 = fadd double %4, %5
  store double %6, double* %scevgep12, align 8
  %exitcond = icmp eq i64 %tmp11, %tmp8
  br i1 %exitcond, label %return, label %bb

return: ; preds = %bb, %entry
  ret void
}

MemorySSA will return that load scevgep12 is not clobbered in the loop
(in one path through the phi it leads all the way back to the entry
block, in the other path, it leads all the way back to the entry block
because scevgep12 becomes scevgep10 along that path), load scevgep10
gets back to live on entry block and is not clobbered by the loop, and
load scevgep is clobbered by the store below it.

IE:
; Function Attrs: nounwind ssp
define void @test10(i32 %N, double* nocapture %G) #0 {
entry:
  %0 = add i32 %N, -1
  %1 = icmp sgt i32 %0, 1
  br i1 %1, label %bb.nph, label %return

bb.nph: ; preds = %entry
  %tmp = sext i32 %0 to i64
  %tmp8 = add i64 %tmp, -1
  br label %bb

bb: ; preds = %bb, %bb.nph
; 3 = MemoryPhi({bb.nph,liveOnEntry},{bb,1})
  %indvar = phi i64 [ 0, %bb.nph ], [ %tmp11, %bb ]
  %scevgep = getelementptr double, double* %G, i64 %indvar
  %tmp9 = add i64 %indvar, 2
  %scevgep10 = getelementptr double, double* %G, i64 %tmp9
  %tmp11 = add i64 %indvar, 1
  %scevgep12 = getelementptr double, double* %G, i64 %tmp11
; MemoryUse(liveOnEntry)
  %2 = load double, double* %scevgep12, align 8
; MemoryUse(liveOnEntry)
  %3 = load double, double* %scevgep10, align 8
  %4 = fadd double %2, %3
; MemoryUse(3)
  %5 = load double, double* %scevgep, align 8
  %6 = fadd double %4, %5
; 1 = MemoryDef(3)
  store double %6, double* %scevgep12, align 8
  %exitcond = icmp eq i64 %tmp11, %tmp8
  br i1 %exitcond, label %return, label %bb

return: ; preds = %bb, %entry
; 2 = MemoryPhi({entry,liveOnEntry},{bb,1})
  ret void
}

bb: ; preds = %bb, %bb.nph
; 3 = MemoryPhi({bb.nph,liveOnEntry},{bb,1})
  %indvar = phi i64 [ 0, %bb.nph ], [ %tmp11, %bb ]
  %scevgep = getelementptr double, double* %G, i64 %indvar
  %tmp9 = add i64 %indvar, 2
  %scevgep10 = getelementptr double, double* %G, i64 %tmp9
  %tmp11 = add i64 %indvar, 1
  %scevgep12 = getelementptr double, double* %G, i64 %tmp11
; MemoryUse(liveOnEntry)
  %2 = load double, double* %scevgep12, align 8
; MemoryUse(liveOnEntry)
  %3 = load double, double* %scevgep10, align 8

Note that it knows these are both, memory-wise, not clobbered by the
store in the loop, even though the operands are loop variant (and you
could not hoist these loads out of the loop).

Later, GVN will use this fact to prove the equivalence of these two
loads, and use it to PRE load scevgep12 into a phi of(new load in the
header, %3)

  %4 = fadd double %2, %3
; MemoryUse(3)
  %5 = load double, double* %scevgep, align 8

^^^^ Note here it's MemoryUse(3) because the PHI dominates it.
MemorySSA, while computing this, walks and hits the store below it (1
= MemoryDef(3)). But that store does not dominate it, so it is not
valid to say MemoryUse(1). Instead, it unwinds to the nearest
dominating thing.

(that is what i meant by we could make this query O(1), because it
computes the info and throws it away)

We later use this to replace load of scevgep with a phi of (new load
in the header, %6)

I never saw a good conclusion to this thread. Clearly, collapsing all of the alias sets whenever the loop contains a call with “AnyWhere” ModRef behavior is not an effective design. AliasSet partitions need to remain disjoint, but that doesn’t mean all memory access needs to be partitioned. There should be one special set of accesses that simply alias with everything without causing the partitions to collapse. I haven’t thought through how to make this change with our AliasSetTracker API though.

Andy

So, you can't have disjoint sets, and have one of set that says "i
access everything". Because it would contain everything :slight_smile:

Thus, I assume by your description you meant "we want to maintain
multiple disjoint partitions of the same set of accesses" (which is
not quite the same as just having one special partition).

When one partition is a refinement of the other (as should be the case
here), this is known as the nested set union problem, and in fact,
some very recent research was done into it:

http://www.cs.princeton.edu/~dhlarkin/papers/esa14-nested-set-union.pdf

So, you can’t have disjoint sets, and have one of set that says “i
access everything”. Because it would contain everything :slight_smile:

Thus, I assume by your description you meant “we want to maintain
multiple disjoint partitions of the same set of accesses” (which is
not quite the same as just having one special partition).

When one partition is a refinement of the other (as should be the case
here), this is known as the nested set union problem, and in fact,
some very recent research was done into it:

http://www.cs.princeton.edu/~dhlarkin/papers/esa14-nested-set-union.pdf

I’m suggesting that the tracker have a special set that is not included in the partition and assumed to alias with everything in the partition. So we would have a partition over the subset of memory accesses that are actually interesting. All the sets in that partition are disjoint. The set outside of the partition is not. Every time you query for aliasing, you need to check if one of the accesses is in the special set.

Andy

Ah, so this seems identical to what Sanjoy suggested initially :slight_smile:

I think he suggested maintaining two partitions, one for Refs and one for Mods. ModRef accesses would be in both partitions. That might also be helpful, but might not be necessary if all the readonly calls are simply removed from the partition.

Andy

Maybe it's me :slight_smile:
I read his suggestion as "create one set that tracks all memory
operations", and one set that tracks "mods" he cared about.
That way, he could remove the readonly call from the set of "things he
cared about". Which is what i read your suggestion as.

(As an even more meta question, it's not clear to me with LICM is
using alias set tracker instead of memory dependence of some sort,
since it's really trying to discover where the memory dependences for
a given load/store are so it knows where it can place things. Instead,
it hands alias set tracker all the things in the loop and says "hey,
is this modded in the loop", which is, IMHO, the wrong way around :P)

I just want to point out that the readonly case is illustrative of a broader issue. The alias set tracker mechanism is currently inherently transitive. That is, a load can be considered clobbered even if doesn't alias a store; if it aliases another load which aliases the store, it's considered clobbered. The readonly call is the case we've hit and the one that really matters to us, but it's worth keeping the broader problem in mind.

I came across this issue again this week and ended up implementing a much less principled fix in our local tree. For sufficiently small loops, we now resort to pair wise modref checks if AST reports a possible clobber. In several test cases I looked at, the precise aliasing was sufficient to enable LICM sufficient for us to recognize vectorizable loops which were otherwise getting missed. As you can imagine, that's a major performance win on those benchmarks.

For upstream, I now know of four possible implementation strategies:

Option 1 - Resort to pairwise modref queries for sufficiently small loops. This has the advantage of being already implemented and easy to upstream, but would introduce a capped N^2 algorithm and thus a performance cliff. It might be worth upstreaming this with the threshold defaulted to 0 (i.e. nop) just as a testing mechanism for identifying places AST's imprecision is hurting us.

Option 2 - As a variant of option 1, we can keep track of potentially clobbering instructions seen during a single walk of the loop. With this, the above N^2 algorithm becomes #L*#S, and we can easily bound the maximum number of clobbering instructions we track. I think this is actually a reasonable compromise and is worth considering. This gives us fully precise aliasing for most loops within LICM at reasonable cost.

Option 3 - We can track readonly calls separately within AST. This wouldn't change the inherent purpose of AST (i.e. it would still be about transitive aliasing), but would enable more precise results in the specific case we care about. This only helps the readonly call case and does not help with general aliasing collapse issue. It also requires every consumer of the AST interface to be updated to explicit check the "alias all" set. (Note: This could help other "unknown instruction" cases as well as readonly calls. It just doesn't solve the imprecise aliasing transitivity for known instructions.)

Option 4 - We can split AST into two sets of alias sets. One set of set is based on aliasing ref behaviour, the other on mod behaviour. The rules for adding to the first set of sets are the existing AST rules. A load would only be added to an mod based set if a store within the set aliased it directly. We would still merge all of sets that a load belong to. The difference is that a load would not be considered to belong to a set unless a store within that set aliases it, not just another load. With LICM, we'd use the mod-set for hoisting of loads, and the ref-set for sinking of stores. This is a fairly involved change.

My mild preference would be for option 2, followed by 4, then 1, then 3. What do others think? I'm willing to implement any of 1, 2, and likely 3. Option 4 is likely more work than I can commit to given that the symptom biting me is now fixed locally.

Philip

Maybe it's me :slight_smile:
I read his suggestion as "create one set that tracks all memory
operations", and one set that tracks "mods" he cared about.
That way, he could remove the readonly call from the set of "things he
cared about". Which is what i read your suggestion as.

(As an even more meta question, it's not clear to me with LICM is
using alias set tracker instead of memory dependence of some sort,
since it's really trying to discover where the memory dependences for
a given load/store are so it knows where it can place things. Instead,
it hands alias set tracker all the things in the loop and says "hey,
is this modded in the loop", which is, IMHO, the wrong way around :P)

Rephrasing LICM's aliasing queries on top of MemorySSA might actually be reasonable. Doing it using the existing MDA would be a non-starter performance wise. Where does MemorySSA stand at this point? I don't believe it's landed yet has it?

Though, the only place LICM tries to place things is the loop preheader. The existing aliasing and guaranteed to execute logic there is actually pretty powerful. It's not clear that MemorySSA is strictly better here.