Alias-based Loop Versioning

There is a work taking place by multiple people in this area and more is expected to happen and I’d like to make sure we’re working toward a common end goal.

I tried to collect the use-cases for run-time memory checks and the specific memchecks required for each:

  1. Loop Vectorizer: each memory access is checked against all other memory accesses in the loop (except read vs read)
  2. Loop Distribution: only memory accesses in different partitions are checked against each other. The loop vectorizer will add its own checks for the vectorizable distributed loops
  3. Loop Fusion: accesses from the to-be-merged loops should be checked against each other
  4. LICM: if hoisting a load, stores needs to be check. When sinking a store, all accesses are checked
  5. Load-elimination in GVN: all intervening stores need to be checked.
  6. Instruction scheduling for in-order targets: same as 5

Currently only the first two are implemented. Ashutosh has a pending review for LICM/independent pass. I am also working on loop-aware load-elimination requiring memchecks (5 from the above list).

The two main approaches are whether to do this in a separate pass or to do it locally in the passes that benefit from versioning. I tried to collect the pros and cons of each.

  1. Separate pass

The benefit of this approach is that current passes would not have to be modified to take advantage of the new freedom to due more independence of the memory access operations. AA will capture the noalias annotation of inserted by this pass and present it to the passes.

Memchecks present an overhead at run time so one question is how we ensure that any optimization will amortize the cost of these checks.

Depending on the optimization, answering this could be pretty involved (i.e. almost like running the pass itself). Consider the loop vectorizer. In order to answer whether versioning the loop would make it vectorizable, you’d have to run most of the legality and profitability logic from the pass.

Also which accesses do we need to check? Depending on the optimization we may not need to check each access against each other access, which being quadratic can be a significant difference.

  1. Each pass performs its own versioning

Under this approach, each pass would make the calculation locally whether the benefit of versioning outweighs the overhead of the checks. The current Loop Vectorizer is a good example for this. It effectively assumes no may-alias and if the number of required checks are under a certain threshold it assumes that the vectorization gain will outweigh the cost of the checks.

Making decision locally is not ideal in this approach. I.e. if we can amortize the cost of the same checks with a combination of optimization from multiple passes, neither pass would make the decision locally to version.

Also, it’s probably beneficial to perform a single loop versioning even if multiple passes would like to check different accesses. E.g. rather than:

Checks_1
/
/
OrigLoop Checks_2
\ /
\ /
\ NoAlias_1 NoAlias_2
\ | /
\ | /
\ | /
Join

But instead:

Checks_1+Check_2
/
/
OrigLoop NoAlias_2
\ /
\ /
\ /
Join

This is effectively creating a fast-path and a slow-path version of the loop. We would probably need some metadata annotation so that subsequent passes could amend the same checking block.

  1. There are some more futuristic ideas like to always version fully, disambiguating all accesses and then have the optimizers transform both the versions of the loop. Then a later pass would decide which checks were necessary and what additional optimizations were done in the speculative version of the loop . Then finally make the cost decision of whether to keep the speculative version along with the checks or remove them. This would probably need a fairly sophisticated set of metadata.

I think that a combination of 1 and 2 makes sense. This is where we will effectively end up after Ashutosh’s patch. This would hopefully give us the best of both worlds: the aggressiveness/precision of making the call locally in complex passes and the option of simplicity/orthogonality of the separate pass.

Thoughts?

Adam

From: "Adam Nemet" <anemet@apple.com>
To: "Dev" <llvmdev@cs.uiuc.edu>, "Ashutosh Nema" <Ashutosh.Nema@amd.com>, "Hal Finkel" <hfinkel@anl.gov>
Sent: Thursday, May 21, 2015 11:53:08 AM
Subject: Alias-based Loop Versioning

There is a work taking place by multiple people in this area and more
is expected to happen and I’d like to make sure we’re working toward
a common end goal.

I tried to collect the use-cases for run-time memory checks and the
specific memchecks required for each:

1. Loop Vectorizer: each memory access is checked against all other
memory accesses in the loop (except read vs read)
2. Loop Distribution: only memory accesses in different partitions
are checked against each other. The loop vectorizer will add its own
checks for the vectorizable distributed loops
3. Loop Fusion: accesses from the to-be-merged loops should be
checked against each other
4. LICM: if hoisting a load, stores needs to be check. When sinking a
store, all accesses are checked
5. Load-elimination in GVN: all *intervening* stores need to be
checked.
6. Instruction scheduling for in-order targets: same as 5

Currently only the first two are implemented. Ashutosh has a pending
review for LICM/independent pass. I am also working on loop-aware
load-elimination requiring memchecks (5 from the above list).

The two main approaches are whether to do this in a separate pass or
to do it locally in the passes that benefit from versioning. I tried
to collect the pros and cons of each.

1. Separate pass

The benefit of this approach is that current passes would not have to
be modified to take advantage of the new freedom to due more
independence of the memory access operations. AA will capture the
noalias annotation of inserted by this pass and present it to the
passes.

Memchecks present an overhead at run time so one question is how we
ensure that any optimization will amortize the cost of these checks.

Depending on the optimization, answering this could be pretty
involved (i.e. almost like running the pass itself). Consider the
loop vectorizer. In order to answer whether versioning the loop
would make it vectorizable, you’d have to run most of the legality
and profitability logic from the pass.

Also which accesses do we need to check? Depending on the
optimization we may not need to check each access against each other
access, which being quadratic can be a significant difference.

2. Each pass performs its own versioning

Under this approach, each pass would make the calculation locally
whether the benefit of versioning outweighs the overhead of the
checks. The current Loop Vectorizer is a good example for this. It
effectively assumes no may-alias and if the number of required
checks are under a certain threshold it assumes that the
vectorization gain will outweigh the cost of the checks.

Making decision locally is not ideal in this approach. I.e. if we can
amortize the cost of the same checks with a combination of
optimization from *multiple* passes, neither pass would make the
decision locally to version.

Also, it’s probably beneficial to perform a single loop versioning
even if multiple passes would like to check different accesses. E.g.
rather than:

Checks_1
/ \
/ \
OrigLoop Checks_2
\ / \
\ / \
\ NoAlias_1 NoAlias_2
\ | /
\ | /
\ | /
Join

But instead:

Checks_1+Check_2
/ \
/ \
OrigLoop NoAlias_2
\ /
\ /
\ /
Join

This is effectively creating a fast-path and a slow-path version of
the loop. We would probably need some metadata annotation so that
subsequent passes could amend the same checking block.

I think this is where we'd like to end up. A loop invocation is likely either "nice" or not nice, and dealing with them on that basis is probably the best performance vs. code size vs. compile time trade off. The only exception to that I feel might be likely is for loop splitting, and possibly for alignment checks (when we have code that accounts for that during vectorization). That having been said, I'm not basing these statements on data, but rather on my experience that programmers tend to write loops intended to be hot without the kinds of dependencies for which we need to check.

We'll also need to think about how this can be done in general. Different transformations will benefit from different checks, and collecting the logic for this in one central location is probably not a good idea. A better idea, perhaps, is to have each of these transformation passes split off their profitability analysis into an actual analysis pass that can be used by the check-insertion pass in order to rank and select the checks desirable for each loop.

-Hal

I am not clear what check_1+check_2 really means. For ex: if check_2 is a subset of check_1 and check_1+check_2 implies union, then the new check_1+check_2 may be overly conservative resulting in lost opportunities.

From: "Dibyendu Das" <Dibyendu.Das@amd.com>
To: "Adam Nemet" <anemet@apple.com>, "Dev" <llvmdev@cs.uiuc.edu>, "Ashutosh Nema" <Ashutosh.Nema@amd.com>, "Hal
Finkel" <hfinkel@anl.gov>
Sent: Saturday, May 23, 2015 5:45:27 AM
Subject: RE: [LLVMdev] Alias-based Loop Versioning

I am not clear what check_1+check_2 really means. For ex: if check_2
is a subset of check_1 and check_1+check_2 implies union, then the
new check_1+check_2 may be overly conservative resulting in lost
opportunities.

That's correct. However, it is not clear that these opportunities are important in practice. There are other trade-offs here as well (code size, compile time, etc.), and FWIW, in this context code size can also be an important performance considerations because if you have too many copies of loop you'll affect inlining, icache locality, etc.

We'd really need data here to determine whether or not this is really a good idea.

-Hal

It’s a good thought in general Adam, but I worried about following scenarios:

1) As Dibyendu already mentioned Check1 + Check2 is not very clear. If your intent is superset/union
of check1 & check2 then I’m not sure it will always help passes those needs smaller checks (i.e. loop distribution)

Every pass has a different need of runtime check, i.e. vectorizer checks each memory against all others
except (read vs read), loop distribution checks memory in different partition. Once we create a common
superset memcheck how we will ensure there is no loss of opportunity for passes those have smaller
checks. (i.e. loop distribution)

Consider vectorizer generates memcheck for A, B, C, D addresses compare against each other and loop distribution
generates memcheck for A,B vs C,D. Loop distribution actually don’t check memory in same partition.
If we rely on common memcheck (in this case its vectorizer’s memcheck) then we will end up checking
pointers in same partition.

2) Running legality and profitability of other optimization in loop versioning pass may not be always helpful,
Memcheck based optimization are scheduled in pipeline at different places. If we run loop versioning early and
it intern check profitability of other optimizations (i.e. vectorizer) and it generates a version of loop & memcheck.
Consider later some optimization changes few instructions which makes vectoizer illegal then the cost of
generated versioned loop & memcheck is not justified.

Thanks,
Ashutosh

Thanks for the feedback. Sounds like that at this point in time we can’t really settle on a single strategy. We probably want to support all of these uses-cases:

1. A common early loop-versioning pass, probably fairly conservative initially (e.g. if you need a single memcheck to remove all may-aliasing from a hight-trip-count loop it’s probably a good idea to version). Even if the pass would not be on by default, it would allow for quick experimentation and headroom studies.

2. Pass-specific loop-versioning scheduled right before the pass. One example here could be the LICM-specific versioning pass although currently it’s not written that way. In this case the versioning decision is made based on the profitability analysis of the pass. This is where Hal’s idea may come in for designing passes where the legality/profitability analysis part is split from the transform part.

3. Versioning within the actual transform pass. Loop vectorization and loop distribution is the example here. LV is probably the stronger example to illustrate why it sometimes makes sense to do this within a pass: besides failed memchecks, LV also falls back on the original loop if the induction variable may overflow, so the versioning is further amended by the pass.

Given this I think the way forward is to create a utility class for loop versioning that we could use from all of these places.

The class will get the initial set of candidate pointers from LoopAccessAnalysis just like now. It should allow its user to filter memchecks that are required for the given transformation. After knowing the number of memchecks, the pass can decide whether to go ahead with the versioning or not. If yes, it will emit the checks, clone the loop and add the noalias metadata.

Optionally the class could also try to locate a prior versioning of the loop and amend the checks and the noalias metadata. (This is the idea of my original post to only version once and keep amending the checking block: checks_1+checks_2.)

Adam

Thanks Adam for listing down these points.

In addition to your points, was thinking to keep this under an option:

We can enforce memcheck dependent passes to use same memcheck which created
earlier. But its conservative because common memcheck will be superset check
(check1+check2+.....checkN).

We may keep this under an option i.e. ‘-use-common-memcheck’.
Under this Loop versioning will create a full memcheck by considering all memory locations,
also it will mark meta-data property to loop so it can be later identified. Other memcheck
dependent passes will look for meta data property and if versioned loop and memcheck
already created they won't create new memcheck & new version loop.

Regards,
Ashutosh