RFC: Extending LV to vectorize outerloops

Proposal for extending the Loop Vectorizer to handle Outer Loops

Hi Ayal-

Can you elaborate a bit more on the modelling of the control flow for innermost loop vectorization ? Especially I am interested in knowing whether you are tending to a full mask-based approach as suggested by Karrenberg et al or you will evaluate ( as a cost/benefit ) models like BOSCC ( Shin et al ) or kind-of-BOSCC models by El Hajj (DYNAMIC LOOP VECTORIZATION FOR EXECUTING OPENCL KERNELS ON CPUS ). Because different vectorization factors, ISA overheads and control flow probabilities may dictate which one may turn out to be the best for a particular situation.

-Thx

Dibyendu

Sure Dibyendu. The modelling of the control-flow inside the vectorized loop is designed to support both cases where this control-flow is mandatory and cases where it’s the result of optimization. The former include predicated scalarized instructions, and inner loops when vectorizing an outer loop. The latter include various static and dynamic techniques you referred to, and possibly others such as our “Predicate Vectors If You Must”. Indeed as you point out these are subject to cost/benefit considerations, similar to many other optimizations; these should fit within the proposed “Plan Optimize Step 2.b” and “Cost Step 3”.

Thx Ayal.

Proposal for extending the Loop Vectorizer to handle Outer Loops

1. The resulting vectorized loop is inherently designed to contain a
*single* basic block. This poses an issue today, as innermost loops may
benefit from retaining some internal branches when vectorized. For
outerloops this clearly cannot hold – the resulting vectorized loop will
contain more than a single basic block as it will contain innerloops.

There's some space for if-conversion, but it is true that the loop
vectoriser gives up too easily on anything but canonical loops. That's
why we even have loop canonicalisation steps.

Though, the alternative would be to expand the solution space beyond
anything we can do in low-polinomial time and space. I also don't want
us having completely different strategies for simple inner loops,
conditionalised inner loops, outerloops with one inner loop, more than
one inner loop, multiply-nested loops, etc. That' would be a
maintenance nightmare.

We haven't focused on outer loops for a reason. Doing one of those
strategies is not that complex, but doing one without penalising the
other (or all others), is. We need to at least have a plan for a few,
even if we implement only at a time.

1. Cost Step tries to predict what the vectorized loop will look like and
how much it will cost, independently of what the Transform Step will
eventually do. It’s hard to keep the two in sync, enforcing the comment
placed at the beginning of Transform Step:
// Notice: any optimization or new instruction that go
// into the code below should be also be implemented in
// the cost-model.

This is *particularly* bad. Your proposal was put forward during the
first year of the vectorisation and was deemed too complex for our
initial goals. I think now is the time to start thinking again about
this.

1. Legal Step: check if loop can be legally vectorized, encode constraints
by initiating Vectorization Plan(s) if so.

Not much different than today. We'd need to inspect the evolution of
all inner-loops to be able to infer anything about the outer-loop
access, but that shouldn't be too complex.

2. Plan Step:
a. Build initial vectorization plans complying with all legal constraints.
b. Optimize vectorization plans.

Are you proposing we at least generate some straw-man vectorised loop
in IR form? Or is this still just an improved cost-model?

3. Cost Step: compute the relative cost of each plan. This step can be
applied repeatedly by Plan Optimize Step 2.b.

It's cheap to calculate the costs (mostly linear), but it's not cheap
to come up with multiple plans on step 2 to feed a voting system.

4. Transform Step: materialize the best plan. Note that only this step
modifies the IR, as in the current Loop Vectorizer.

Modifies is a dubious term, I just want to make sure we're on the same page.

You can create as many basic blocks as you want, but unless you
replace the old ones with the new ones, the "IR" isn't changed.

So, if I got ir right, you want to:

1. check legal, pass.
2. create some structure with an actual implementation of the
vectorisation (IR?)
3. min(cost[i])
4. Implement (or IR replace) implementation "i".

If 2 is similar to 4, than this will be very costly. If not, then we
have two *different* representations, and we go back to having a bad
cost-model.

The former can be justified if you pass an option like
"-faggressive-vectorizer". The latter can make our "diverging cost
model" problem much worse.

The Vectorization Plan is a recipe describing the vectorization candidate,
including for instance its internal control-flow. It serves for both
estimating the cost and for performing the translation, and facilitates
dealing with multiple vectorization candidates. One can compare with LLVM’s
existing SLP vectorizer, where TSLP [3] adds step 2.b.

We sort of do that already, since we pick the costs of all VFs and
choose the "best". By far not perfect, but I envisioned an improved
cost-model to having more than just VF/UF and trying skew factors
(with information provided by Polly) or fiddling with the induction
sequence, before/after inner-loop hoisting, etc.

Since the cost model is really cheap, we can do many of them each time
and even be very creative around them. But the more creative we are,
the harder it'll be to get them close to the implementation phase.

This proposal to extend the Loop Vectorizer to
outer loops supports and raises the importance of explicit vectorization
beyond the current capabilities of Clang and LLVM. Namely, from currently
forcing the vectorization of innermost loops according to prescribed width
and/or interleaving count, to supporting OpenMP’s “#pragma omp simd”
construct and associated clauses, including our earlier proposal for
vectorizing across function boundaries [2].

We already have all those pragmas, some Clang-specific, some borrowed
from OpenMP SIMD.

As Hal said, extending the pragmas to indicate recommended trip
counts, guaranteeing no-wrapping, etc. shouldn't be too hard.

cheers,
-renato

From: Renato Golin [mailto:renato.golin@linaro.org]
Sent: Friday, September 23, 2016 02:04
To: Zaks, Ayal <ayal.zaks@intel.com>
Cc: LLVM Dev <llvm-dev@lists.llvm.org>
Subject: Re: [llvm-dev] RFC: Extending LV to vectorize outerloops

> 1. The resulting vectorized loop is inherently designed to contain a
> *single* basic block. This poses an issue today, as innermost loops
> may benefit from retaining some internal branches when vectorized. For
> outerloops this clearly cannot hold – the resulting vectorized loop
> will contain more than a single basic block as it will contain innerloops.

There's some space for if-conversion, but it is true that the loop vectoriser
gives up too easily on anything but canonical loops. That's why we even
have loop canonicalisation steps.

Though, the alternative would be to expand the solution space beyond
anything we can do in low-polinomial time and space. I also don't want us
having completely different strategies for simple inner loops,
conditionalised inner loops, outerloops with one inner loop, more than one
inner loop, multiply-nested loops, etc. That' would be a maintenance
nightmare.

We haven't focused on outer loops for a reason. Doing one of those
strategies is not that complex, but doing one without penalising the other
(or all others), is. We need to at least have a plan for a few, even if we
implement only at a time.

Our first step is to make LV fully conscious of the control-flow that will result from vectorizing an (innermost) loop, allowing this control-flow to comprise of more than a single basic-block. A next step is to consider retaining branches rather than fully if-converting them. This could indeed entail a large solution space, but simple heuristics are usually employed to prune it effectively. When we reach the ability to handle outerloops the solution space grows further, and we fully recognize that early pruning of candidates will be important. As Hal noted, some heuristics can indeed be employed here too.

> 1. Cost Step tries to predict what the vectorized loop will look like
> and how much it will cost, independently of what the Transform Step
> will eventually do. It’s hard to keep the two in sync, enforcing the
> comment placed at the beginning of Transform Step:
> // Notice: any optimization or new instruction that go // into the
> code below should be also be implemented in // the cost-model.

This is *particularly* bad. Your proposal was put forward during the first
year of the vectorisation and was deemed too complex for our initial goals. I
think now is the time to start thinking again about this.

> 1. Legal Step: check if loop can be legally vectorized, encode
> constraints by initiating Vectorization Plan(s) if so.

Not much different than today. We'd need to inspect the evolution of all
inner-loops to be able to infer anything about the outer-loop access, but
that shouldn't be too complex.

Hopefully, based on SCEV's analysis that considers nested loops.

> 2. Plan Step:
> a. Build initial vectorization plans complying with all legal constraints.
> b. Optimize vectorization plans.

Are you proposing we at least generate some straw-man vectorised loop in
IR form? Or is this still just an improved cost-model?

This step is designed to generate a straw-man vectorized loop in "shadow" form, whose cost can be directly computed (3), and which facilitates generating the actual vector IR instructions (4).

> 3. Cost Step: compute the relative cost of each plan. This step can
> be applied repeatedly by Plan Optimize Step 2.b.

It's cheap to calculate the costs (mostly linear), but it's not cheap to come up
with multiple plans on step 2 to feed a voting system.

("cheap" - in terms of compile-time I take it)
Care must be taken therefore when deciding how many plans to draft. Step 2.b can consult the cost of partial plans being considered, for early plan pruning.

> 4. Transform Step: materialize the best plan. Note that only this
> step modifies the IR, as in the current Loop Vectorizer.

Modifies is a dubious term, I just want to make sure we're on the same page.

You can create as many basic blocks as you want, but unless you replace the
old ones with the new ones, the "IR" isn't changed.

This relates to the "shadow" instructions and basic-blocks discussed recently in http://lists.llvm.org/pipermail/llvm-dev/2016-August/104317.html.

So, if I got ir right, you want to:

1. check legal, pass.
2. create some structure with an actual implementation of the vectorisation
(IR?) 3. min(cost[i]) 4. Implement (or IR replace) implementation "i".

If 2 is similar to 4, than this will be very costly. If not, then we have two

("very costly" - in terms of compile-time I take it; "cost" is overloaded here)

*different* representations, and we go back to having a bad cost-model.

The former can be justified if you pass an option like "-faggressive-
vectorizer". The latter can make our "diverging cost model" problem much
worse.

Computing cost before having an explicit vectorization plan is indeed more compile-time efficient than first constructing the plan, but inherently less accurate and harder to keep consistent. Such a "divergent" predictive cost could be used to prune candidates, when considered decisive, to save compile-time. Having both a "-faggressive-vectorizer" path that builds a plan and weighs it before implementing it, while keeping the current path that vectorizes directly w/o a plan, may be quite a maintenance burden. A "-faggressive-vectorizer" option could be used to control early pruning of candidates. Let's first see how much compile-time, and improved accuracy, are actually involved.

> The Vectorization Plan is a recipe describing the vectorization
> candidate, including for instance its internal control-flow. It serves
> for both estimating the cost and for performing the translation, and
> facilitates dealing with multiple vectorization candidates. One can
> compare with LLVM’s existing SLP vectorizer, where TSLP [3] adds step 2.b.

We sort of do that already, since we pick the costs of all VFs and choose the
"best". By far not perfect, but I envisioned an improved cost-model to having
more than just VF/UF and trying skew factors (with information provided by
Polly) or fiddling with the induction sequence, before/after inner-loop
hoisting, etc.

Since the cost model is really cheap, we can do many of them each time and
even be very creative around them. But the more creative we are, the harder
it'll be to get them close to the implementation phase.

OTOH, computing the "non-divergent" costs of explicit plans could lead to the development of pruning techniques, or creative improvements to the cost-model. That is, when these costs turn out to be very high or very low.

Note that some decisions are currently taken w/o fully consulting cost, e.g.:
// FIXME: The interleaved load group with a huge gap could be even more
// expensive than scalar operations. Then we could ignore such group and
// use scalar operations instead.
and the hardwired NumberOfStoresToPredicate threshold.

> This proposal to extend the Loop Vectorizer to outer loops supports
> and raises the importance of explicit vectorization beyond the current
> capabilities of Clang and LLVM. Namely, from currently forcing the
> vectorization of innermost loops according to prescribed width and/or
> interleaving count, to supporting OpenMP’s “#pragma omp simd”
> construct and associated clauses, including our earlier proposal for
> vectorizing across function boundaries [2].

We already have all those pragmas, some Clang-specific, some borrowed
from OpenMP SIMD.

As Hal said, extending the pragmas to indicate recommended trip counts,
guaranteeing no-wrapping, etc. shouldn't be too hard.

cheers,
-renato

Thanks!
Ayal.

Our first step is to make LV fully conscious of the control-flow that will result from vectorizing an (innermost) loop, allowing this control-flow to comprise of more than a single basic-block. A next step is to consider retaining branches rather than fully if-converting them. This could indeed entail a large solution space, but simple heuristics are usually employed to prune it effectively. When we reach the ability to handle outerloops the solution space grows further, and we fully recognize that early pruning of candidates will be important. As Hal noted, some heuristics can indeed be employed here too.

Sounds like a plan.

This step is designed to generate a straw-man vectorized loop in "shadow" form, whose cost can be directly computed (3), and which facilitates generating the actual vector IR instructions (4).

I'm worried about how different will this be to actual IR. Semantics
is a hard thing to keep when representing data in different formats,
and an important task will be to make sure this internal format isn't
breaking any assumptions, or it'll be hard to fix bugs int the
vectorised code that don't exist in the scalar IR, if you go (scalar
IR -> internal rep -> vector IR).

Early pruning will indeed help the cases where we're not sure of the
semantics, but there will be a period that this internal format won't
be able to represent all the cases we have today, meaning there could
be some regressions on the kind of loops we can vectorise just because
the internal representation is not rich enough yet.

Unless, of course, we use this internal representation *in addition*
to the current vectoriser if all else fails, which means we'll run the
vectoriser twice in those cases.

("cheap" - in terms of compile-time I take it)
Care must be taken therefore when deciding how many plans to draft. Step 2.b can consult the cost of partial plans being considered, for early plan pruning.

Yes, compile time. Indeed, pruning will need to be harsh initially,
then relaxed as we improve the code / representation. That's why I
fear regressions.

("very costly" - in terms of compile-time I take it; "cost" is overloaded here)

Sorry, yes, compile time.

Computing cost before having an explicit vectorization plan is indeed more compile-time efficient than first constructing the plan, but inherently less accurate and harder to keep consistent.

Yup. :slight_smile:

Such a "divergent" predictive cost could be used to prune candidates, when considered decisive, to save compile-time. Having both a "-faggressive-vectorizer" path that builds a plan and weighs it before implementing it, while keeping the current path that vectorizes directly w/o a plan, may be quite a maintenance burden. A "-faggressive-vectorizer" option could be used to control early pruning of candidates. Let's first see how much compile-time, and improved accuracy, are actually involved.

Agree, we're getting a bit ahead of ourselves... :slight_smile:

OTOH, computing the "non-divergent" costs of explicit plans could lead to the development of pruning techniques, or creative improvements to the cost-model. That is, when these costs turn out to be very high or very low.

I see what you mean. This could indeed work faster than I'm expecting,
if the pruning happens at the same time as building the internal
representation, we can have many early exits and fall backs.

cheers,
--renato