[Proposal][RFC] Cache aware Loop Cost Analysis

Hi,

This is a proposal about implementing an analysis that calculates loop cost based on cache data. The primary motivation for implementing this is to write profitability measures for cache related optimizations like interchange, fusion, fission, pre-fetching and others.

I have implemented a prototypical version at http://reviews.llvm.org/D21124.
The patch basically creates groups of references that would lie in the same cache line. Each group is then analysed with respect to innermost loops considering cache lines. Penalty for the reference is:
a. 1, if the reference is invariant with the innermost loop,
b. TripCount for non-unit stride access,
c. TripCount / CacheLineSize for a unit-stride access.
Loop Cost is then calculated as the sum of the reference penalties times the product of the loop bounds of the outer loops. This loop cost can then be used as a profitability measure for cache reuse related optimizations. This is just a brief description; please refer to [1] for more details.

A primary drawback in the above patch is the static use of Cache Line Size. I wish to get this data from tblgen/TTI and I am happy to submit patches on it.

Finally, if the community is interested in the above work, I have the following work-plan:
a. Update loop cost patch to a consistent, commit-able state.
b. Add cache data information to tblgen/TTI.
c. Rewrite Loop Interchange profitability to start using Loop Cost.

Please share your valuable thoughts.

Thank you.

References:
[1] [Carr-McKinley-Tseng] http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf

From: "Vikram TV via llvm-dev" <llvm-dev@lists.llvm.org>
To: "DEV" <llvm-dev@lists.llvm.org>
Sent: Wednesday, June 8, 2016 2:58:17 AM
Subject: [llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis

Hi,

This is a proposal about implementing an analysis that calculates
loop cost based on cache data. The primary motivation for
implementing this is to write profitability measures for cache
related optimizations like interchange, fusion, fission,
pre-fetching and others.

I have implemented a prototypical version at
http://reviews.llvm.org/D21124 .

I quick comment on your pass: You'll also want to add the ability to get the loop trip count from profiling information from BFI when available (by dividing the loop header frequency by the loop predecessor-block frequencies).

The patch basically creates groups of references that would lie in
the same cache line. Each group is then analysed with respect to
innermost loops considering cache lines. Penalty for the reference
is:
a. 1, if the reference is invariant with the innermost loop,
b. TripCount for non-unit stride access,
c. TripCount / CacheLineSize for a unit-stride access.
Loop Cost is then calculated as the sum of the reference penalties
times the product of the loop bounds of the outer loops. This loop
cost can then be used as a profitability measure for cache reuse
related optimizations. This is just a brief description; please
refer to [1] for more details.

This sounds like the right approach, and is certainly something we need. One question is how the various transformations you name above (loop fusion, interchange, etc.) will use this analysis to evaluate the effect of the transformation they are considering performing. Specifically, they likely need to do this without actually rewriting the IR (speculatively).

A primary drawback in the above patch is the static use of Cache Line
Size. I wish to get this data from tblgen/TTI and I am happy to
submit patches on it.

Yes, this sounds like the right direction. The targets obviously need to provide this information.

Finally, if the community is interested in the above work, I have the
following work-plan:
a. Update loop cost patch to a consistent, commit-able state.
b. Add cache data information to tblgen/TTI.
c. Rewrite Loop Interchange profitability to start using Loop Cost.

Great!

This does not affect loop interchange directly, but one thing we need to also consider is the number of prefetching streams supported by the hardware prefetcher on various platforms. This is generally, per thread, some number around 5 to 10. Splitting loops requiring more streams than supported, and preventing fusion from requiring more streams than supported, is also an important cost metric. Of course, so is ILP and other factors.

-Hal

a. 1, if the reference is invariant with the innermost loop,
b. TripCount for non-unit stride access,
c. TripCount / CacheLineSize for a unit-stride access.
Loop Cost is then calculated as the sum of the reference penalties times
the product of the loop bounds of the outer loops. This loop cost can then
be used as a profitability measure for cache reuse related optimizations.
This is just a brief description; please refer to [1] for more details.

<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>

I think you need finer granularity here. At least you need to distinguish
between stride-c (for some reasonable constant, say c = 2) access and
non-strided access as in indirect indexing (a[b[x]]) or otherwise
unpredictable index.

------------------------------

*From: *"Vikram TV via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"DEV" <llvm-dev@lists.llvm.org>
*Sent: *Wednesday, June 8, 2016 2:58:17 AM
*Subject: *[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis

Hi,

This is a proposal about implementing an analysis that calculates loop
cost based on cache data. The primary motivation for implementing this is
to write profitability measures for cache related optimizations like
interchange, fusion, fission, pre-fetching and others.

I have implemented a prototypical version at
http://reviews.llvm.org/D21124.

I quick comment on your pass: You'll also want to add the ability to get
the loop trip count from profiling information from BFI when available (by
dividing the loop header frequency by the loop predecessor-block
frequencies).

Sure. I will add it.

The patch basically creates groups of references that would lie in the
same cache line. Each group is then analysed with respect to innermost
loops considering cache lines. Penalty for the reference is:
a. 1, if the reference is invariant with the innermost loop,
b. TripCount for non-unit stride access,
c. TripCount / CacheLineSize for a unit-stride access.
Loop Cost is then calculated as the sum of the reference penalties times
the product of the loop bounds of the outer loops. This loop cost can then
be used as a profitability measure for cache reuse related optimizations.
This is just a brief description; please refer to [1] for more details.

This sounds like the right approach, and is certainly something we need.
One question is how the various transformations you name above (loop
fusion, interchange, etc.) will use this analysis to evaluate the effect of
the transformation they are considering performing. Specifically, they
likely need to do this without actually rewriting the IR (speculatively).

For Interchange: Loop cost can be used to build a required ordering and
interchange can try to match the nearest possible ordering. Few internal
kernels, matmul are working fine with this approach.
For Fission/Fusion: Loop cost can provide the cost before and after fusion.
Loop cost for a fused loop can be obtained by building reference groups by
assuming a fused loop. Similar case with fission but with partitions.
These require no prior IR changes.

A primary drawback in the above patch is the static use of Cache Line
Size. I wish to get this data from tblgen/TTI and I am happy to submit
patches on it.

Yes, this sounds like the right direction. The targets obviously need to
provide this information.

Finally, if the community is interested in the above work, I have the
following work-plan:
a. Update loop cost patch to a consistent, commit-able state.
b. Add cache data information to tblgen/TTI.
c. Rewrite Loop Interchange profitability to start using Loop Cost.

Great!

This does not affect loop interchange directly, but one thing we need to
also consider is the number of prefetching streams supported by the
hardware prefetcher on various platforms. This is generally, per thread,
some number around 5 to 10. Splitting loops requiring more streams than
supported, and preventing fusion from requiring more streams than
supported, is also an important cost metric. Of course, so is ILP and other
factors.

Cache based cost computation is one of the metric and yes, more of other
metrics need to be added.

I think, 'c' should be <= CacheLineSize. I will have this as a to-do as
only canonical loops are supported for now.

A primary drawback in the above patch is the static use of Cache Line
Size. I wish to get this data from tblgen/TTI and I am happy to submit
patches on it.

Yes, this sounds like the right direction. The targets obviously need to
provide this information.

I'd like to help review this as it'll be necessary to implement
http://wg21.link/p0154r1 which is (likely) in C++17. It adds two
values, constexpr
std::hardware_{constructive,destructive}_interference_size, because in some
configurations you don't have a precise cacheline size but rather a range
based on what multiple architectures have implemented for the same ISA. I
think you'll want something similar.

I hope the proposal is fine with the community. Should I wait on this thread a little longer before I start updating the patch?

Ping!

Hi Vikram,

Is the analysis result specific to a loop nest or to a loop nest together with a set of reference groups?

Adam

------------------------------

*From: *"Vikram TV via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"DEV" <llvm-dev@lists.llvm.org>
*Sent: *Wednesday, June 8, 2016 2:58:17 AM
*Subject: *[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis

Hi,

This is a proposal about implementing an analysis that calculates loop
cost based on cache data. The primary motivation for implementing this is
to write profitability measures for cache related optimizations like
interchange, fusion, fission, pre-fetching and others.

I have implemented a prototypical version at
http://reviews.llvm.org/D21124.

I quick comment on your pass: You'll also want to add the ability to get
the loop trip count from profiling information from BFI when available (by
dividing the loop header frequency by the loop predecessor-block
frequencies).

Sure. I will add it.

The patch basically creates groups of references that would lie in the
same cache line. Each group is then analysed with respect to innermost
loops considering cache lines. Penalty for the reference is:
a. 1, if the reference is invariant with the innermost loop,
b. TripCount for non-unit stride access,
c. TripCount / CacheLineSize for a unit-stride access.
Loop Cost is then calculated as the sum of the reference penalties times
the product of the loop bounds of the outer loops. This loop cost can then
be used as a profitability measure for cache reuse related optimizations.
This is just a brief description; please refer to [1] for more details.

This sounds like the right approach, and is certainly something we need.
One question is how the various transformations you name above (loop
fusion, interchange, etc.) will use this analysis to evaluate the effect of
the transformation they are considering performing. Specifically, they
likely need to do this without actually rewriting the IR (speculatively).

For Interchange: Loop cost can be used to build a required ordering and
interchange can try to match the nearest possible ordering. Few internal
kernels, matmul are working fine with this approach.
For Fission/Fusion: Loop cost can provide the cost before and after
fusion. Loop cost for a fused loop can be obtained by building reference
groups by assuming a fused loop. Similar case with fission but with
partitions.
These require no prior IR changes.

Hi Vikram,

Is the analysis result specific to a loop nest or to a loop nest together
with a set of reference groups?

The result is specific to each loop in the loop nest and the calculations
are based on the references in the loop nest.

Hi Vikram,

Is the analysis result specific to a loop nest or to a loop nest together with a set of reference groups?

The result is specific to each loop in the loop nest and the calculations are based on the references in the loop nest.

Sorry can you please rephrase/elaborate, I don’t understand what you mean. Analysis results are retained unless a transformation invalidates them. My question is how the reference groups affect all this.

You could probably describe how you envision this analysis would be used with something like Loop Fusion.

Thanks,
Adam

Hi Vikram,

Is the analysis result specific to a loop nest or to a loop nest together
with a set of reference groups?

The result is specific to each loop in the loop nest and the calculations
are based on the references in the loop nest.

Sorry can you please rephrase/elaborate, I don’t understand what you
mean. Analysis results are retained unless a transformation invalidates
them. My question is how the reference groups affect all this.

Sorry, I now understood that you meant llvm's analysis result (I thought
how the analysis calculates the result). The analysis result is specific to
each loop in its loop nest. Reference groups are necessary during cost
computation only.

You could probably describe how you envision this analysis would be used
with something like Loop Fusion.

For Loop Interchange, the cost calculated for each loop can provide a
desired ordering of the loops. Loop interchange can start interchanging the
loops to match the desired order iff interchange legality check passes.
Eg: For matrix multiplication case,
       for (i = 0 to 5000)
         for (j = 0 to 5000)
           for (k = 0 to 5000)
             C[i,j] = C[i,j] + A[i,k]* B[k,j]
       Here, based on cache line size and the cache penalties for the
references (please refer to first mail of the thread or the reference paper
for different penalties), following costs result:
       Loop i cost: 2.501750e+11
Loop j cost: 6.256252e+10
Loop k cost: 1.563688e+11
Loop cost here is nothing but the total cache penalities due to all
references, including penalties due to outer loops. So lower costing loop
is a better fit for innermost loop because it has better cache locality.
This would result in the desired loop order: i, k, j; and LoopInterchange
can interchange 'j' and 'k' loops given the legality check passes (a
nearest ordering for which legality is correct is also an optimal loop
ordering).

For fusion/fission, loop costs due to cache can be calculated by assigning
cache penalties for references with respect to loops before and after
fusion/fission. The costs will be a profitability measure for
fusion/fission.

I think use case description was very brief in a previous mail. So I have
elaborated this time. More details can also be found in
http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf.

Thanks

Good time...
Vikram TV
CompilerTree Technologies
Mysore, Karnataka, INDIA

Hi Adam,
Do you have any concerns/suggestions regarding the analysis results? I just
want to reiterate that the analysis calculates the cost for each loop in
its loop nest. Reference groups affect this only during computation of the
loop cost i.e. in order to compute the cost, the analysis need to analyze
each of the references in the nest.

Please do let me know. Thanks!

Hi Vikram,

Let me ask you differently. This analysis can be thought of as having two inputs, (1) a loop or loops (?) and (2) the reference group. If a transformation pass requires some analysis, the analysis result would have to be computed unless it has already been computed and no transformation has invalidated it since the last use.

My question how does this apply to this pass. Consider this sequence of passes:

  1. loop fusion of loops L1 and L2

Computes Loop Cost for L1 and L2 with the reference group that includes accesses from the both groups.

Let’s assume here that the transformation does not occur at the end so the analysis result is still valid at this point.

  1. loop distribution/fission for loop L1

Does this have to recompute the Loop Cost for L1 because now the reference group only includes accesses from L1?

To put it different my question what the tag/key is when caching the analysis result. Is it the loop only or also the reference group.

Adam

Hi Adam,

First, I sincerely apologize for stalling this thread.

I understand your point about loop nest and reference groups. Both loop nest and the reference groups form the key to cache the analysis result. In this case, isn’t it better not to cache anything and just compute the cost dynamically; based on loop nest and references provided as input by the calling transformation (fission/fusion/etc)?

Thank you

a. 1, if the reference is invariant with the innermost loop,

b. TripCount for non-unit stride access,
c. TripCount / CacheLineSize for a unit-stride access.
Loop Cost is then calculated as the sum of the reference penalties times
the product of the loop bounds of the outer loops. This loop cost can then
be used as a profitability measure for cache reuse related optimizations.
This is just a brief description; please refer to [1] for more details.

<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>

I think you need finer granularity here. At least you need to distinguish
between stride-c (for some reasonable constant, say c = 2) access and
non-strided access as in indirect indexing (a[b[x]]) or otherwise
unpredictable index.

I think, 'c' should be <= CacheLineSize. I will have this as a to-do as
only canonical loops are supported for now.

It seems that you are working on this again. So I thought to follow up on
this. I agree that the change that I asked could be done in a follow up
iteration. I don't understand what is the relation to canonical loops
though. IIUC, the patterns that I mentioned here can appear in a canonical
loop too.