[RFC][LV][VPlan] Proposal for Outer Loop Vectorization Implementation Plan

Proposal for Outer Loop Vectorization Implementation Plan

One area that needs a bit of attention before work on this proceeds much further is testing. The introduction of VPlan appears to have introduced a couple of bugs and exposed a couple of others. Most of those are now fixed, but the process did point out a lack of test coverage around the changes which is concerning.

I'd like to hear what plan is in place to ensure we don't destabilize the vectorizer while working on this.

One thing we could consider is leveraging the new IR fuzzer to help find assertion failures either before submission or shortly there after. Another might be to introduce changes under feature flags to ease the revert/reintroduce/revert cycle.

Philip

Another might be to introduce changes under feature flags to ease the revert/reintroduce/revert cycle.

This is essentially the first guard. We plan to have flags/settings to control which types of outer loops are handled.
The new code path is initially exclusive to outer loop vectorization. If we disable all types of outer loops
(and that's the initial default), LV continues to be good old innermost loop vectorizer.

I'd like to hear what plan is in place to ensure we don't destabilize the vectorizer while working on this.

W.r.t. this project, this matters when we touch the code that is shared with innermost loop vectorization
and outer loop vectorization. It's certainly good to ensure that such places to have good test coverage
at the time of commit.

This also matters (and matters greatly) when we start guiding innermost loops towards the new code path (when
we are ready). For this, another flag control would be there for everyone to try before the flipping of the default could
land in the trunk (somewhat analogous to Chandler asking people to test new Pass Manager prior to the switch).
We might be able to control which kind of innermost loops would take new code path, but that's TBD.

Fuzz testing or not, I fully agree that good testing coverage of vectorizer is desired.

Thanks,
Hideki

From http://lists.llvm.org/pipermail/llvm-dev/2017-December/119567.html

That sounds like an excellent idea! Any concrete ideas/plans how people could get involved, besides doing reviews?

Let's talk about this in the RFC context. http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html.
Divergence Analysis work mentioned there is a good example.

One of the big things we are hoping to see people outside of our Intel team contributing is dependence analysis --- for outer
loop auto-vectorization (and most of that can also be used for auto-parallelization of outer level loops). The plan
outlined below aims for explicit vectorization, but we'd certainly want to hook this up for outer loop auto-vectorization.
Having sent out the second clean up patch (https://reviews.llvm.org/D41045), I'm currently looking at LoopVectorizationLegality class
and thinking about how to 1) make it more modular, 2) take it out of LoopVectorize.cpp, and 3) move it to Analysis directory (but not
necessarily a separate Analysis pass, yet). Part of the Legality is the dependence checking that needs to be upgraded to deal
with outer loops. Any interest in working in this area? It's not really dependent on the VPlan progress. Can be started right away.

Thanks,
Hideki

To revive the discussion around vectorizer testing, here's a quick sample of a few of the issues hit recently in the loop vectorizer. I want to be careful to say that I am not stating these are the result of any recent work, just that they're issues that have been triaged down to the loop vectorizer doing something incorrect or questionable from a performance perspective.

https://bugs.llvm.org/show_bug.cgi?id=35282
https://bugs.llvm.org/show_bug.cgi?id=35687
https://bugs.llvm.org/show_bug.cgi?id=35734
https://bugs.llvm.org/show_bug.cgi?id=35773
https://reviews.llvm.org/D41939

I also see another 10 or so correctness related ones in a a simple search: https://bugs.llvm.org/buglist.cgi?quicksearch=LoopVectorize&list_id=131629

The loop vectorizer is currently a serious pain point in our regression testing. As the rest of the optimizer changes, we are finding that existing bugs in the vectorizer are being exposed with disturbing frequency and that some new regressions are landing as well.

Philip

To revive the discussion around vectorizer testing, here's a quick
sample of a few of the issues hit recently in the loop vectorizer. I
want to be careful to say that I am not stating these are the result
of any recent work, just that they're issues that have been triaged
down to the loop vectorizer doing something incorrect or questionable
from a performance perspective.

https://bugs.llvm.org/show_bug.cgi?id=35282
https://bugs.llvm.org/show_bug.cgi?id=35687
https://bugs.llvm.org/show_bug.cgi?id=35734
https://bugs.llvm.org/show_bug.cgi?id=35773
https://reviews.llvm.org/D41939

I also see another 10 or so correctness related ones in a a simple
search:
https://bugs.llvm.org/buglist.cgi?quicksearch=LoopVectorize&list_id=131629

The loop vectorizer is currently a serious pain point in our
regression testing. As the rest of the optimizer changes, we are
finding that existing bugs in the vectorizer are being exposed with
disturbing frequency and that some new regressions are landing as well.

I certainly understand what you're saying, but, as you point out, many
of these are existing bugs that are being exposed by other changes (and
they're seemingly all over the map). My general feeling is that the more
limited the applicability of a particular transform the buggier it will
tend to be. The work here to allow vectorization of an every-wider set
of inputs will really help to expose, and thus help us eliminate, bugs.
As such, one of the largest benefits of adding the
function-vectorization work (https://reviews.llvm.org/D22792), and
outer-loop vectorization capabilities, will be making it easier to throw
essentially-arbitrary inputs at the vectorizer (and have it do
something), and thus, hit it more effectively with automated testing.

Maybe we can do a better job, even with the current capabilities, of
automatically generating data-parallel loops with reductions of various
kinds? I'm thinking about automated testing because, AFAIK, the
vectorizer is already run through almost all of the relevant benchmarks
and test suites, and even if we add a few more, we probably need to
increase the test coverage by a lot more than that.

Do you have ideas about how we can have better testing in this area
otherwise?

-Hal

If you’re finding bugs that have been extant for a while, is there anything special you’re doing that’s uncovering them? If so, is there a way we could replicate that kind of IR input to the vectorizer to better test the corner cases?

Amara

I certainly understand what you're saying, but, as you point out, many
of these are existing bugs that are being exposed by other changes (and
they're seemingly all over the map). My general feeling is that the more
limited the applicability of a particular transform the buggier it will
tend to be. The work here to allow vectorization of an every-wider set
of inputs will really help to expose, and thus help us eliminate, bugs.

Absolutely agreed.

We haven't stopped working on the vectoriser but for the past few
years it feels as if we're always trading performance numbers all over
the place and not progressing.

We need better, more consistent, analysis. We need a generic approach.
We need more powerful approaches. We need a single pipeline that can
decide between clear costs, not luck.

The work Hideki/Ayal/Gil are doing covers most of those topics. It implements:

1. VPlan: which will help us understand more accurate costs and pick
the best choice, not the first profitable one
2. Outer loop: which will allow us to look at the loop as a complete
set, not a bunch of inner loops

The work Tobi & the Polly guys are doing covers:

3. Fantastic analysis and powerful transformations
4. Exposing code for other passes to profit

Linaro is looking on HPC workloads (mainly core loops [1]) and we
found that loop distribution would be very profitable to ease register
allocation in big loops, but that needs whole-loop analysis.

But, as was said before, we're lacking in understanding and
organisation. To be able to profit from all of those advances, we need
to understand the loop better, and for that, powerful analysis needs
to happen.

Polly has some powerful ones, but we haven't plugged that in properly.
The work to plug Polly into LLVM and make it an integral part of the
pipeline is important so that we can use parts of its analysis tools
to benefit other passes.

But we also need more alias analysis, inter-procedural access pattern
analysis etc.

As such, one of the largest benefits of adding the
function-vectorization work (https://reviews.llvm.org/D22792), and
outer-loop vectorization capabilities, will be making it easier to throw
essentially-arbitrary inputs at the vectorizer (and have it do
something), and thus, hit it more effectively with automated testing.

Function vectorisation is important, but whole-loop analysis
(including outer-loop, distribution, fusion) can open more doors to
new patterns.

Now, the real problem here is below...

Maybe we can do a better job, even with the current capabilities, of
automatically generating data-parallel loops with reductions of various
kinds? I'm thinking about automated testing because, AFAIK, the
vectorizer is already run through almost all of the relevant benchmarks
and test suites, and even if we add a few more, we probably need to
increase the test coverage by a lot more than that.

Without a wider understanding of what's missing and how to improve,
it's hard to know what path to take.

For a while I was running simple things (like Livermore Loops) and
digging specific details, and at every step I realised that I needed
better analysis, but ended up settling for a simplified version just
to get that one case vectorised.

After I stopped working on this, I continued reviewing performance
patches and what ended up happening is that we're always accepting
changes that give positive geomean, but which could also push some of
the past gains down considerably.

So much so that my current benchmarks show LLVM 5 performing worse on
almost all cases comparing to LLVM 4. This is worrying.

None of those benchmarks were done in a standard way, so I'm not sure
how to replicate, which means they're worthless and in summary, I have
wasted a lot of time.

That is why our current focus is to make sure we're all benchmarking
the things that make sense in a way that makes sense.

Our HCQC tool [1] is one way of doing that, but we need more, better
analysis of benchmark results, as well as what benchmarks we care and
how we run them.

Every arch / sub-arch / vendor / board tuple has special rules, and we
need to compare the best on each, not the standard on all. But once we
get the numbers we need to compare as apples-to-apples, and that's
sometimes not possible.

Do you have ideas about how we can have better testing in this area
otherwise?

I'd like to gather information about what benchmarks we really care,
how we run them and what analysis we should all do on them. I can't
share my raw numbers with you, but if we agree on a method, we can
share gains and know that they're as close as possible to be
meaningful.

For this thread, we can focus on loop benchmarks. Our team is focusing
on HPC workloads, so we will worry more about heavy loops than
"Hacker's delight" transformations, but we have to make sure we don't
break other people's stuff, so we need *all* in a bundle.

I think the test-suite in benchmark mode has a lot of potential to
become the package that we run for validation (before commit), but
that needs a lot of love before we can trust its results. We need more
relevant benchmarks, better suited results and analysis so that it can
work with the current fantastic visualisation LNT gives us.

This, IMO, together with whole-loop analysis, should be our priority
for LLVM 7 (because 6 is gone... :slight_smile:

cheers,
--renato

[1] https://github.com/Linaro/hcqc