RFC: Loop versioning for LICM

I like to propose a new loop multi versioning optimization for LICM.

For now I kept this for LICM only, but it can be used in multiple places.

The main motivation is to allow optimizations stuck because of memory

alias dependencies. Most of the time when alias analysis is unsure about

memory access and it says may-alias. This un surety from alias analysis restrict

some of the memory based optimizations to proceed further.

We observed some cases with LICM, where things are beyond aliasing.

In cases where alias analysis is unsure we like to use loop versioning as an alternative.

Loop Versioning will creates version of the loop with aggressive alias and the other

with conservative (default) alias. Aggressive alias version of loop will have all the

memory access marked as no-alias. These two version of loop will be preceded by a

memory runtime check. This runtime check consists of bound checks for all unique memory

accessed in loop, and it ensures aliasing of memory. Based on this check result at runtime

any of the loops gets executed, if memory is non aliased then aggressive aliasing loop

gets executed, else when memory is aliased then non aggressive aliased version gets executed.

By setting no-alias to memory accessed in aggressive alias version of loop, enable other

optimization to continue further.

Following are the top level steps:

  1. Perform loop do versioning feasibility check.

  2. If loop is a candidate for versioning then create a memory bound check, by considering

all the memory access in loop body.

  1. Clone original loop and set all memory access as no-alias in new loop.

  2. Set original loop & versioned loop as a branch target of runtime check result.

  3. Call LICM on aggressive alias versioned of loop(For now LICM is scheduled later and not directly

called from LoopVersioning pass).

Consider following test:

1 int foo(int * var1, int * var2, int * var3, unsigned itr) {

2 unsigned i = 0, j = 0;

3 for(; i < itr; i++) {

4 for(; j < itr; j++) {

5 var1[j] = itr + i;

6 var3[i] = var1[j] + var3[i];

7 }

8 }

9 }

At line #6 store to var3 can be moved out by LICM(promoteLoopAccessesToScalars)

but because of alias analysis un surety about memory access it unable to move it out.

After Loop versioning IR:

for.body3.loopVersion: ; preds = %for.body3.loopVersion.preheader, %for.body3.loopVersion

%indvars.iv.loopVersion = phi i64 [ %indvars.iv.next.loopVersion, %for.body3.loopVersion ], [ %2, %for.body3.loopVersion.preheader ]

%arrayidx.loopVersion = getelementptr inbounds i32* %var1, i64 %indvars.iv.loopVersion

store i32 %add, i32* %arrayidx.loopVersion, align 4, !tbaa !1, !alias.scope !11, !noalias !11

%indvars.iv.next.loopVersion = add nuw nsw i64 %indvars.iv.loopVersion, 1

%lftr.wideiv.loopVersion = trunc i64 %indvars.iv.loopVersion to i32

%exitcond.loopVersion = icmp eq i32 %lftr.wideiv.loopVersion, %0

br i1 %exitcond.loopVersion, label %for.inc11.loopexit38, label %for.body3.loopVersion

for.body3: ; preds = %for.body3.lr.ph, %for.body3

%indvars.iv = phi i64 [ %indvars.iv.next, %for.body3 ], [ %2, %for.body3.lr.ph ]

%arrayidx = getelementptr inbounds i32* %var1, i64 %indvars.iv

store i32 %add, i32* %arrayidx, align 4, !tbaa !1

%8 = load i32* %arrayidx7, align 4, !tbaa !1

%add8 = add nsw i32 %8, %add

store i32 %add8, i32* %arrayidx7, align 4, !tbaa !1

%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1

%lftr.wideiv = trunc i64 %indvars.iv to i32

%exitcond = icmp eq i32 %lftr.wideiv, %0

br i1 %exitcond, label %for.inc11, label %for.body3

In versioned loop difference is visible, 1 store has moved out.

Following are some high level details about current implementation:

  • LoopVersioning

LoopVersioning is main class which holds multi versioning functionality.

  • LoopVersioning :: isVersioningBeneficial

Its member to ‘LoopVersioning’

Does feasibility check for loop versioning.

a) Checks layout of loop.

b) Instruction level check.

c) memory checks.

  • LoopVersioning :: versionizeLoop

a) Clone original loo

b) Create a runtime memory check.

c) Add both loops under runtime check results target.

  • RuntimeMemoryCheck

This class take cares runtime memory check.

  • RuntimeMemoryCheck ::createRuntimeCheck

It creates runtime memory check.

In this patch used maximum loop nest threshold as 2, and maximum number

of pointers in runtime memory check as 5.

Later I like to make this as a utility so others can use it.

Requesting to go through patch for detailed approach.

Patch available at http://reviews.llvm.org/D7900

Suggestions are comments are welcome.

Regards,

Ashutosh

Hi Ashutosh,

Have you been following the recent Loop Access Analysis work? LAA was split out from the Loop Vectorizer that have been performing the kind of loop versioning that you describe. The main reason was to be able to share this functionality with other passes.

Loop Access Analysis is an analysis pass that computes basic memory dependence and the runtime checks. The versioning decision and then performing the transformation are left to the transform passes using this analysis.

If we decide that a stand-alone memcheck-based loop-versioning is desired we should probably use this analysis and possibly extend it instead of duplicating the code.

Adam

Hi Adam,

Thanks for looking into LoopVersioning work.

I have gone through recent LoopAccessAnalysis changes and found some of the stuff

overlaps (i.e. runtime memory check, loop access analysis etc.). LoopVersioning can

use some of the things from LAA.

LoopVersioning is a memory check based multi versioning optimization, it simply creates

aggressive alias version of loop preceded by a memory check. It’s not concerned about

the order of instructions and detailed dependency check that LoopVectorizer does.

It does some basic loop structure check, loop instruction checks & memory checks.

In general found LAA work is more inclined towards LoopVectorizer.

Found some of the possible reusable functions are biased towards LoopVectorizer,

they has specific condition checks for it.

It’s good to make some of the classes & function more generic and reusable.

Will be covering some of the points in this mail.

RuntimeCheckEmitter

“RuntimeCheckEmitter::addRuntimeCheck”

While creating runtime check I have found, some of the things are not getting considered.

  1. No need to check if two read only pointers intersect.

  2. Only need to check pointers between two different dependency sets.

  3. Only need to check pointers in the same alias set

I’m sure if we like this to be used by other optimization then not all optimization appreciate

above checks. Specifically LoopVersioning does not care about this, it expects all the pointers

in a loop should be considered for a memory check. Also it does not care about different

dependency set & different alias sets.

I suggest we can make these checks optional, and give flexibility to users of this class to set it.

For the same suggesting following change:

class RuntimeCheckEmitter {

…………

…………

/// Consider readonly pointer intersection in memcheck

bool CheckReadOnlyPointersIntersection;

/// Consider pointers in same dependency sets for memcheck.

bool CheckPointersInSameDependencySet;

/// Consider pointers in different Alias sets for memcheck

bool CheckPointersInDifferentAliasSet

Add the above 3 variables to class, and allow users of this class to set it.

In “RuntimeCheckEmitter::addRuntimeCheck” following 3 condition needs to

controlled by above conditional variables.

Change Following Check:

// No need to check if two readonly pointers intersect.

if (!PtrRtCheck->IsWritePtr[i] && !PtrRtCheck->IsWritePtr[j])

continue;

To:

// No need to check if two readonly pointers intersect.

if (!CheckReadOnlyPointersIntersection && !PtrRtCheck->IsWritePtr[i] &&

!PtrRtCheck->IsWritePtr[j])

continue;

Change Following Check:

// Only need to check pointers between two different dependency sets.

if (PtrRtCheck->DependencySetId[i] == PtrRtCheck->DependencySetId[j])

continue;

To:

// Only need to check pointers between two different dependency sets.

if (!CheckPointersInSameDependencySet &&

PtrRtCheck->DependencySetId[i] == PtrRtCheck->DependencySetId[j])

continue;

Change Following Check:

// Only need to check pointers in the same alias set.

if (PtrRtCheck->AliasSetId[i] != PtrRtCheck->AliasSetId[j])

continue;

To:

// Only need to check pointers in the same alias set.

if (!CheckPointersInDifferentAliasSet &&

PtrRtCheck->AliasSetId[i] != PtrRtCheck->AliasSetId[j])

continue;

By this we allowing RuntimeCheckEmitter as more flexible and providing user

more control to use it.

LoopAccessAnalysis::analyzeLoop

Here again its very specific to LoopVectorizer.

The way it handles stores & loads may not be appreciated by other optimization

expecting other treatment. I suggest we should think on flexibility for user to

override load & store handling. We can provide virtual methods for load & store

handling (i.e. analyzeLoads & analyzeStores). Also some of the optimization may not

like call instruction, or further they like to analyze call. We should also think on those

lines to make some provision.

AccessAnalysis & LoopAccessAnalysis are tied up dependency check, If some analysis

needs same functionality except dependency check then there should be provision available.

i.e. LoopVersioning needs similar stuff except dependency analysis, for now possibility is

extend & rewrite functions by removing dependency checks.

Regards,

Ashutosh

I’m concerned about this approach. LICM is deliberately somewhat conservative about aliasing to avoid compile time impact. In practice, we defer some of the harder cases to GVN in the form of full redundancy and partial redundancy elimination. It’s not clear to me that versioning a loop in LICM is a good use of either compile time or code size. In particular, I’m concerned about the code size increase that can result from recursive application of loop versioning. I think this is an interesting idea, but the profitability heuristics will need to be incredibly carefully tuned. It would be really really easy to have a net negative impact on the quality of generated code while doing something that seems like a good idea within LICM. Can you spell out the profitability heuristics you’re planning on using? You might want to look at what Sanjoy has been doing with IRCE. This is a specialized form of Index Set Splitting with the roughly the same problem to solve. Have you looked at trying to make LICM more precise with regards to aliasing? I was looking at this a while back and it seemed like there was plenty of room for cheap changes with payoff here. (One example: a read only call in a loop collapses all alias sets. Oops.) (It’s not clear to me whether you’re proposing a new pass, or extending LICM. If the former, much of my concern disappears. If the later, you’ll need to really justify the complexity.) If you’re expecting review, this needs to go to llvm-commits. I have not really looked at it, but one high level comment. You’ve got a couple of really long functions that need to be broken up into semantic pieces. Reviewing in the current form would be hard.

Hi Adam,

Thanks for looking into LoopVersioning work.

I have gone through recent LoopAccessAnalysis changes and found some of the stuff
overlaps (i.e. runtime memory check, loop access analysis etc.). LoopVersioning can
use some of the things from LAA.

LoopVersioning is a memory check based multi versioning optimization, it simply creates
aggressive alias version of loop preceded by a memory check. It’s not concerned about
the order of instructions and detailed dependency check that LoopVectorizer does.
It does some basic loop structure check, loop instruction checks & memory checks.

In general found LAA work is more inclined towards LoopVectorizer.
Found some of the possible reusable functions are biased towards LoopVectorizer,
they has specific condition checks for it.

I am about to post the patches to make LAA suitable for Loop Distribution. As you will hopefully find this will make the LAA more generic. I will cc you on the patches.

It’s good to make some of the classes & function more generic and reusable.
Will be covering some of the points in this mail.

RuntimeCheckEmitter
“RuntimeCheckEmitter::addRuntimeCheck”
While creating runtime check I have found, some of the things are not getting considered.
1) No need to check if two read only pointers intersect.
2) Only need to check pointers between two different dependency sets.
3) Only need to check pointers in the same alias set

I’m sure if we like this to be used by other optimization then not all optimization appreciate
above checks. Specifically LoopVersioning does not care about this, it expects all the pointers
in a loop should be considered for a memory check. Also it does not care about different
dependency set & different alias sets.

I suggest we can make these checks optional, and give flexibility to users of this class to set it.

I am not sure I follow. The logic is meant to reduce the number of memory checks necessary to ensure the independence of may-alias accesses. We can omit 1-3 but then we would end up with unnecessary checks that could unnecessarily slow things down.

For the same suggesting following change:
1)
class RuntimeCheckEmitter {
   …………
   …………
  /// Consider readonly pointer intersection in memcheck
  bool CheckReadOnlyPointersIntersection;
  /// Consider pointers in same dependency sets for memcheck.
  bool CheckPointersInSameDependencySet;
  /// Consider pointers in different Alias sets for memcheck
  bool CheckPointersInDifferentAliasSet

Add the above 3 variables to class, and allow users of this class to set it.

2)
In "RuntimeCheckEmitter::addRuntimeCheck" following 3 condition needs to
controlled by above conditional variables.

>
   Change Following Check:
      // No need to check if two readonly pointers intersect.
      if (!PtrRtCheck->IsWritePtr[i] && !PtrRtCheck->IsWritePtr[j])
        continue;
   To:
      // No need to check if two readonly pointers intersect.
      if (!CheckReadOnlyPointersIntersection && !PtrRtCheck->IsWritePtr[i] &&
            !PtrRtCheck->IsWritePtr[j])
        continue;
        
>
   Change Following Check:
      // Only need to check pointers between two different dependency sets.
      if (PtrRtCheck->DependencySetId[i] == PtrRtCheck->DependencySetId[j])
       continue;
   To:
      // Only need to check pointers between two different dependency sets.
      if (!CheckPointersInSameDependencySet &&
             PtrRtCheck->DependencySetId[i] == PtrRtCheck->DependencySetId[j])
       continue;

>
   Change Following Check:
      // Only need to check pointers in the same alias set.
      if (PtrRtCheck->AliasSetId[i] != PtrRtCheck->AliasSetId[j])
        continue;
   To:
      // Only need to check pointers in the same alias set.
      if (!CheckPointersInDifferentAliasSet &&
            PtrRtCheck->AliasSetId[i] != PtrRtCheck->AliasSetId[j])
        continue;
        
By this we allowing RuntimeCheckEmitter as more flexible and providing user
more control to use it.

LoopAccessAnalysis::analyzeLoop
Here again its very specific to LoopVectorizer.
The way it handles stores & loads may not be appreciated by other optimization
expecting other treatment. I suggest we should think on flexibility for user to
override load & store handling. We can provide virtual methods for load & store
handling (i.e. analyzeLoads & analyzeStores). Also some of the optimization may not
like call instruction, or further they like to analyze call. We should also think on those
lines to make some provision.

Can you please elaborate how you want to analyze function calls?

AccessAnalysis & LoopAccessAnalysis are tied up dependency check, If some analysis
needs same functionality except dependency check then there should be provision available.
i.e. LoopVersioning needs similar stuff except dependency analysis, for now possibility is
extend & rewrite functions by removing dependency checks.

I actually consider the coupling of dependency analysis and memcheck generation as a feature. Since dependence analysis can further disambiguate memory accesses (as in the DependencySet except you mentioned above) it can reduce the number of run-time memory checks necessary.

Adam

Thanks Philip for looking into LoopVersioning.

It’s not clear to me whether you’re proposing a new pass, or extending LICM

It’s a new pass, and we are not changing existing LICM functionality.

LoopVersioning is a memory check based multi versioning optimization. It will creates version

of the loop with aggressive alias and the other with conservative (default) alias. Aggressive

alias version of loop will have all the memory access marked as no-alias. Because of aggressive

aliasing, alias analysis will be sure about memory accesses and it will mark no-alias for all memory

accesses in alias sets. This will help some of the optimization to proceed further. One case is LICM

PromoteAliasSets (recently refactored as promoteLoopAccessesToScalars).

Post multi versioning in same pass we are planning to call “promoteLoopAccessesToScalars”

on aggressive alias versioned loop.

In particular, I’m concerned about the code size increase that can result from

recursive application of loop versioning.

I think this is an interesting idea, but the profitability heuristics will need to be

incredibly carefully tuned.

We are also concerned about profitability heuristics, and we are trying to minimize the

negative impact of this pass. Code size increase is a concern, but we are planning to touch

the cases where gain by this optimization will be good enough to justify. For now we made

this pass optional.

Profitability analysis contains basic loop structure analysis, loop instructions analysis and

loop memory access analysis. We are also trying to ensure the new versioned loop will get

benefit by LICM. Apart from this there are few thresholds:

  1. Maximum loop nest threshold allowed is 2(Planning to make this configurable).

  2. Maximum number of pointers threshold in memcheck allowed is 5(Planning to make this configurable).

I understand profitability plays a vital role here, in current implementation checks added may not

be enough for a right profitability analysis. Accepting a possible scope of improvement, as suggested will

look into Sanjoy work for IRCE.

I have not really looked at it, but one high level comment. You’ve got a couple of really long

functions that need to be broken up into semantic pieces. Reviewing in the current form would be hard.

At this point I’m not expecting a formal review, just emphasizing on the functionality as a part of RFC.

Sure will split out long functions.

Regards,

Ashutosh

I like to propose a new loop multi versioning optimization for LICM.

For now I kept this for LICM only, but it can be used in multiple places.

The main motivation is to allow optimizations stuck because of memory

alias dependencies. Most of the time when alias analysis is unsure about

memory access and it says may-alias. This un surety from alias analysis restrict

some of the memory based optimizations to proceed further.

We observed some cases with LICM, where things are beyond aliasing.

In cases where alias analysis is unsure we like to use loop versioning as an alternative.

Loop Versioning will creates version of the loop with aggressive alias and the other

with conservative (default) alias. Aggressive alias version of loop will have all the

memory access marked as no-alias. These two version of loop will be preceded by a

memory runtime check. This runtime check consists of bound checks for all unique memory

accessed in loop, and it ensures aliasing of memory. Based on this check result at runtime

any of the loops gets executed, if memory is non aliased then aggressive aliasing loop

gets executed, else when memory is aliased then non aggressive aliased version gets executed.

I am about to post the patches to make LAA suitable for Loop Distribution. As you will hopefully find this will make the LAA more generic. I will cc you on the patches.

Sure Adam.

RuntimeCheckEmitter
“RuntimeCheckEmitter::addRuntimeCheck”
While creating runtime check I have found, some of the things are not getting considered.
1) No need to check if two read only pointers intersect.
2) Only need to check pointers between two different dependency sets.
3) Only need to check pointers in the same alias set

I’m sure if we like this to be used by other optimization then not all optimization appreciate
above checks. Specifically LoopVersioning does not care about this, it expects all the pointers
in a loop should be considered for a memory check. Also it does not care about different
dependency set & different alias sets.

I suggest we can make these checks optional, and give flexibility to users of this class to set it.

I am not sure I follow. The logic is meant to reduce the number of memory checks necessary to ensure the independence of may-alias accesses. We can omit 1-3 but then we would end up with unnecessary checks that could unnecessarily slow things down.

I just wanted to keep a flexibility open.
We can give a try to LoopVersioning by keeping point 1 & 3 checks. but I’m not sure about point 2.
Will give a try to your upcoming patch.

LoopAccessAnalysis::analyzeLoop
Here again its very specific to LoopVectorizer.
The way it handles stores & loads may not be appreciated by other optimization
expecting other treatment. I suggest we should think on flexibility for user to
override load & store handling. We can provide virtual methods for load & store
handling (i.e. analyzeLoads & analyzeStores). Also some of the optimization may not
like call instruction, or further they like to analyze call. We should also think on those
lines to make some provision.

Can you please elaborate how you want to analyze function calls?
For calls expecting overridable method, providing flexibility to user to redefine behavior.
There are few possible cases:
may not like any call in the loop body, or may like only few specific calls.
or may not like to pass any pointer/address out of loop by calls.

AccessAnalysis & LoopAccessAnalysis are tied up dependency check, If some analysis
needs same functionality except dependency check then there should be provision available.
i.e. LoopVersioning needs similar stuff except dependency analysis, for now possibility is
extend & rewrite functions by removing dependency checks.

I actually consider the coupling of dependency analysis and memcheck generation as a feature. Since dependence analysis can further disambiguate memory accesses (as in the DependencySet except you mentioned above) it can reduce the number of run-time memory checks necessary.
As mentioned above I’m not sure this will be useful for loopVersioning.
But like to give a try with your upcoming patch.

Thanks,
Ashutosh

I like to propose a new loop multi versioning optimization for LICM.
For now I kept this for LICM only, but it can be used in multiple places.
The main motivation is to allow optimizations stuck because of memory
alias dependencies. Most of the time when alias analysis is unsure about
memory access and it says may-alias. This un surety from alias analysis restrict
some of the memory based optimizations to proceed further.
We observed some cases with LICM, where things are beyond aliasing.
In cases where alias analysis is unsure we like to use loop versioning as an alternative.

Loop Versioning will creates version of the loop with aggressive alias and the other
with conservative (default) alias. Aggressive alias version of loop will have all the
memory access marked as no-alias. These two version of loop will be preceded by a
memory runtime check. This runtime check consists of bound checks for all unique memory
accessed in loop, and it ensures aliasing of memory. Based on this check result at runtime
any of the loops gets executed, if memory is non aliased then aggressive aliasing loop
gets executed, else when memory is aliased then non aggressive aliased version gets executed.

Hi Ashutosh,

I think this is a really interesting idea, and I'd like to encourage you to continue working on it.
Thanks Hal.
Regarding profitability, I think you'll want to check that you'll be able to hoist/sink a significant number of the memory accesses inside the new "versioned" loop. If a loop has a large number of accesses, and the new loop body only differs by the hosting/sinking of a few, then you're unlikely to see the difference.
Yes it’s a good point, we can add this to profitability.
Some part of this change overlaps with loop access analysis work by Adam Nemet.
Will reuse his changes once gets accepted.
Regards,
Ashutosh

I am about to post the patches to make LAA suitable for Loop Distribution. As you will hopefully find this will make the LAA more generic. I will cc you on the patches.

Sure Adam.

RuntimeCheckEmitter
“RuntimeCheckEmitter::addRuntimeCheck”
While creating runtime check I have found, some of the things are not getting considered.
1) No need to check if two read only pointers intersect.
2) Only need to check pointers between two different dependency sets.
3) Only need to check pointers in the same alias set

I’m sure if we like this to be used by other optimization then not all optimization appreciate
above checks. Specifically LoopVersioning does not care about this, it expects all the pointers
in a loop should be considered for a memory check. Also it does not care about different
dependency set & different alias sets.

I suggest we can make these checks optional, and give flexibility to users of this class to set it.

I am not sure I follow. The logic is meant to reduce the number of memory checks necessary to ensure the independence of may-alias accesses. We can omit 1-3 but then we would end up with unnecessary checks that could unnecessarily slow things down.

I just wanted to keep a flexibility open.
We can give a try to LoopVersioning by keeping point 1 & 3 checks. but I’m not sure about point 2.
Will give a try to your upcoming patch.

LoopAccessAnalysis::analyzeLoop
Here again its very specific to LoopVectorizer.
The way it handles stores & loads may not be appreciated by other optimization
expecting other treatment. I suggest we should think on flexibility for user to
override load & store handling. We can provide virtual methods for load & store
handling (i.e. analyzeLoads & analyzeStores). Also some of the optimization may not
like call instruction, or further they like to analyze call. We should also think on those
lines to make some provision.

Can you please elaborate how you want to analyze function calls?
For calls expecting overridable method, providing flexibility to user to redefine behavior.
There are few possible cases:
may not like any call in the loop body, or may like only few specific calls.
or may not like to pass any pointer/address out of loop by calls.

AccessAnalysis & LoopAccessAnalysis are tied up dependency check, If some analysis
needs same functionality except dependency check then there should be provision available.
i.e. LoopVersioning needs similar stuff except dependency analysis, for now possibility is
extend & rewrite functions by removing dependency checks.

I actually consider the coupling of dependency analysis and memcheck generation as a feature. Since dependence analysis can further disambiguate memory accesses (as in the DependencySet except you mentioned above) it can reduce the number of run-time memory checks necessary.
As mentioned above I’m not sure this will be useful for loopVersioning.
But like to give a try with your upcoming patch.

Hi Ashutosh,

My changes are committed now. LoopAccessAnalysis is an analysis pass, so it has the advantage that the result of the analysis is cached until it gets invalidated (i.e. when the loop changes).

For an example of how to use it, you can look at either the loop-vectorizer in the tree or the WIP patch for the loop-distribution pass in http://reviews.llvm.org/D6930.

Please let me know if you have any questions.

Adam

Hi Adam,

Hi Ashutosh,

Hi Adam,

From: Adam Nemet [mailto:anemet@apple.com]
Sent: Wednesday, March 11, 2015 10:48 AM
To: Nema, Ashutosh
Cc: llvmdev@cs.uiuc.edu <mailto:llvmdev@cs.uiuc.edu>
Subject: Re: [LLVMdev] RFC: Loop versioning for LICM

Hi Ashutosh,

My changes are committed now. LoopAccessAnalysis is an analysis pass, so it has the advantage that the result of the analysis is cached until it gets invalidated (i.e. when the loop changes).

For an example of how to use it, you can look at either the loop-vectorizer in the tree or the WIP patch for the loop-distribution pass in http://reviews.llvm.org/D6930.

Please let me know if you have any questions.

I have gone through current LAA, found few gaps for reusing it.

i.e.
928 void LoopAccessInfo::analyzeLoop(const ValueToValueMap &Strides) {
1029 if (isUniform(Ptr)) {
1030 emitAnalysis(
1031 LoopAccessReport(ST)
1032 << "write to a loop invariant address could not be vectorized");
1033 DEBUG(dbgs() << "LAA: We don't allow storing to uniform addresses\n");
1034 CanVecMem = false;
1035 return;
1036 }

LAA is ignoring the cases where store is loop invariant, it sets memory can’t be vectorize.

In loop versioning for LICM, we are more interested in loop invariant cases.
Invariant store is one of the important case for “LICM::promoteLoopAccessesToScalars”
If there any loop invariant exists then only we do loop versioning.

This check looks specific to loop vectorizer, with this check we can’t use LAA for LICM loop versioning.
If we like to make this reusable probably we need to remove this or make it conditional.

Probably, agents to LAA can implements such check instead LAA.

Not really. Don’t you need memchecks for loop-invariant addresses as well?

I think we should just teach the analysis to also emit run-time checks for loop-invariant addresses. Assuming the memchecks pass we should have valid dependence analysis results so we could for example distribute the loop.

We couldn’t vectorize thus the analysis should provide a new piece of information about the loop having accesses to loop-invariant addresses.

Adam

Thanks Adam for your reply.

Thanks Adam for your reply.

From: Adam Nemet [mailto:anemet@apple.com]
Sent: Friday, March 20, 2015 3:23 AM
To: Nema, Ashutosh
Cc: Hal Finkel; Philip Reames; llvmdev@cs.uiuc.edu <mailto:llvmdev@cs.uiuc.edu>
Subject: Re: [LLVMdev] RFC: Loop versioning for LICM

Hi Ashutosh,

Hi Adam,

From: Adam Nemet [mailto:anemet@apple.com]
Sent: Wednesday, March 11, 2015 10:48 AM
To: Nema, Ashutosh
Cc: llvmdev@cs.uiuc.edu <mailto:llvmdev@cs.uiuc.edu>
Subject: Re: [LLVMdev] RFC: Loop versioning for LICM

Hi Ashutosh,

My changes are committed now. LoopAccessAnalysis is an analysis pass, so it has the advantage that the result of the analysis is cached until it gets invalidated (i.e. when the loop changes).

For an example of how to use it, you can look at either the loop-vectorizer in the tree or the WIP patch for the loop-distribution pass in http://reviews.llvm.org/D6930.

Please let me know if you have any questions.

I have gone through current LAA, found few gaps for reusing it.

i.e.
928 void LoopAccessInfo::analyzeLoop(const ValueToValueMap &Strides) {
1029 if (isUniform(Ptr)) {
1030 emitAnalysis(
1031 LoopAccessReport(ST)
1032 << "write to a loop invariant address could not be vectorized");
1033 DEBUG(dbgs() << "LAA: We don't allow storing to uniform addresses\n");
1034 CanVecMem = false;
1035 return;
1036 }

LAA is ignoring the cases where store is loop invariant, it sets memory can’t be vectorize.

In loop versioning for LICM, we are more interested in loop invariant cases.
Invariant store is one of the important case for “LICM::promoteLoopAccessesToScalars”
If there any loop invariant exists then only we do loop versioning.

This check looks specific to loop vectorizer, with this check we can’t use LAA for LICM loop versioning.
If we like to make this reusable probably we need to remove this or make it conditional.

Probably, agents to LAA can implements such check instead LAA.

Not really. Don’t you need memchecks for loop-invariant addresses as well?
We do need memcheck for loop-invariant addresses, but with current implementation if there is any invariant store, LAA simply returns by setting CanVecMem to false.

I think we should just teach the analysis to also emit run-time checks for loop-invariant addresses. Assuming the memchecks pass we should have valid dependence analysis results so we could for example distribute the loop.
Yes, I think that functionality exists, but the only blocker to it is the check mentioned in my previous mail.

So, instead of returning & setting ‘CanVecMem’ for loop invariant store.
Can’t we maintain a separate variable i.e. hasLoopInvariantStore in LoopAccessInfo and set it accordingly.

Yes, this is what I was proposing above and here ;):

We couldn’t vectorize thus the analysis should provide a new piece of information about the loop having accesses to loop-invariant addresses.

Also we can provide a getter function for the same.

i.e.
928 void LoopAccessInfo::analyzeLoop(const ValueToValueMap &Strides) {
1029 if (isUniform(Ptr)) {
1030 hasLoopInvariantStore = true;
1031 }

So later optimization can use this information in their legality analysis and make specific actions.
i.e. LoopVectorizer:

4002 bool LoopVectorizationLegality::canVectorizeMemory() {
4008 if (LAI->hasLoopInvariantStore) {
4009 emitAnalysis(VectorizationReport()
4010 << "write to a loop invariant address could not be vectorized");
4011 DEBUG(dbgs() << "LV: We don't allow storing to uniform addresses\n");
4012 return false;
4013 }

No, not hasLoopInvariantStore but hasAccessToLoopInvariantAddress.

You will also need to generate the memchecks for such accesses in canCheckPtrAtRT and addRuntimeCheck. Without those memchecks passing, the result of the dependence analysis is incorrect.

Adam

Yes, this is what I was proposing above and here ;):

Thanks Adam it’s for confirming J

No, not hasLoopInvariantStore but hasAccessToLoopInvariantAddress.

Its only for invariant stores[not loads], Using ‘hasLoopInvariantStore’ (or a name with invariant store) makes more sense over ‘hasAccessToLoopInvariantAddress’.

You will also need to generate the memchecks for such accesses in

canCheckPtrAtRT and addRuntimeCheck. Without those memchecks passing,

the result of the dependence analysis is incorrect.

I did not understood this point correctly, I feel the current functionality take cares of it

And we do not need any new handling for these invariant stores in “canCheckPtrAtRT” and “addRuntimeCheck”.

“canCheckPtrAtRT” consider all the pointers because its working over AliasSets.

It adds qualified pointers to “RuntimePointerCheck”.

“addRuntimeCheck” works on collected pointers by “canCheckPtrAtRT”

For non-invariant it creates range check by considering start & end.

But for invariant start & end are considered as same.

This same behavior we need for invariant store.

By debugging sample programs I can see its adding invariant stores pointers

to runtime check.

Can you provide more details, what you meant by above comment.

Regards,

Ashutosh

> Yes, this is what I was proposing above and here ;):
Thanks Adam it’s for confirming J

NP :).

> No, not hasLoopInvariantStore but hasAccessToLoopInvariantAddress.
Its only for invariant stores[not loads], Using ‘hasLoopInvariantStore’ (or a name with invariant store) makes more sense over ‘hasAccessToLoopInvariantAddress’.

Right but it’s the address that’s invariant not the store so hasLoopInvariantStore is a misleading name. How about hasStoreToLoopInvariantAddress?

> You will also need to generate the memchecks for such accesses in
> canCheckPtrAtRT and addRuntimeCheck. Without those memchecks passing,
> the result of the dependence analysis is incorrect.
I did not understood this point correctly, I feel the current functionality take cares of it
And we do not need any new handling for these invariant stores in “canCheckPtrAtRT” and “addRuntimeCheck”.

“canCheckPtrAtRT” consider all the pointers because its working over AliasSets.
It adds qualified pointers to “RuntimePointerCheck”.

“addRuntimeCheck” works on collected pointers by “canCheckPtrAtRT”
For non-invariant it creates range check by considering start & end.
But for invariant start & end are considered as same.
This same behavior we need for invariant store.

By debugging sample programs I can see its adding invariant stores pointers
to runtime check.

Can you provide more details, what you meant by above comment.

I would have expected canCheckPtrAtRT/hasComputableBounds to give up when the pointer was not a SCEVAddRecExpr. How do we get passed that point?

Adam

Right but it’s the address that’s invariant not the store so hasLoopInvariantStore

is a misleading name. How about hasStoreToLoopInvariantAddress?

‘hasStoreToLoopInvariantAddress’ sounds good J

I would have expected canCheckPtrAtRT/hasComputableBounds to give up when

the pointer was not a SCEVAddRecExpr. How do we get passed that point?

You can compute bounds out of invariant store, just like invariant loads.

It’s a ‘SCEVAddRecExpr’, that’s why ‘hasComputableBounds’ is fine with these.

Thanks,

Ashutosh

I would have expected canCheckPtrAtRT/hasComputableBounds to give up when

the pointer was not a SCEVAddRecExpr. How do we get passed that point?

You can compute bounds out of invariant store, just like invariant loads.

It’s a ‘SCEVAddRecExpr’, that’s why ‘hasComputableBounds’ is fine with these.

As an example you can try, test case mentioned in my first email.

1 int foo(int * var1, int * var2, int * var3, unsigned itr) {

2 unsigned i = 0, j = 0;

3 for(; i < itr; i++) {

4 for(; j < itr; j++) {

5 var1[j] = itr + i;

6 var3[i] = var1[j] + var3[i];

7 }

8 }

9 }

Regards,

Ashutosh

>> I would have expected canCheckPtrAtRT/hasComputableBounds to give up when
>> the pointer was not a SCEVAddRecExpr. How do we get passed that point?
> You can compute bounds out of invariant store, just like invariant loads.
> It’s a ‘SCEVAddRecExpr’, that’s why ‘hasComputableBounds’ is fine with these.

As an example you can try, test case mentioned in my first email.
     1 int foo(int * var1, int * var2, int * var3, unsigned itr) {
     2 unsigned i = 0, j = 0;
     3 for(; i < itr; i++) {
     4 for(; j < itr; j++) {
     5 var1[j] = itr + i;
     6 var3[i] = var1[j] + var3[i];
     7 }
     8 }
     9 }

You just get lucky because var1[j] is a SCEVAddRecExpr in the *outer loop*. This is not the case in general for all loop-invariant address.

Adam

You just get lucky because var1[j] is a SCEVAddRecExpr in the outer loop.

This is not the case in general for all loop-invariant address.

Thanks Adam, I got your point.

LAA will only add memcheck when it founds all address in loop body has computable bounds.

Same it does for LoopVectorizer, If any non-computable bound appears in loop body then LAA

won’t add runtime check. We simply wants to use existing functionality. If all the addresses in

loop are bound computable then only we like to go for loop versioning.

As the part of LAA do you have suggestions to handle and adding memcheck for non SCEVAddRecExpr ?

Regards,

Ashutosh