LoopVectorizer: shufflevectors

Hi,

I have been discussing a bit with Sanjay on how to handle the poor sequences of shufflevector instructions produced by the loop vectorizer and he suggested we bring this up on llvm-dev.

I have run into this in the past also and it surprised me to again see (on SystemZ) that the vectorized loop did many seemingly unnecessary shuffles. In this case (see https://bugs.llvm.org/show_bug.cgi?id=38792), one group of interleaved loads got shuffled into vector operands, to then be used by an interleaved group of stores, which in turn did its shuffling. The loop vectorizer did not attempt to combine these shuffles, and unfortunately no later pass did so either.

This seems to be an issue which is due to keeping instcombine simple and fast, as well as a conservativeness to not produce any new shuffles not already in the input program (see comment in InstCombiner::visitShuffleVectorInst). For some reason a bit unclear to me the backend will get into trouble then.

Should improved optimization of shufflevector instructions handle all of them globally, or just the new ones produced by the vectorizers? At least in the code produced by the vectorizers, it seems fair to combine the shuffles to minimize them. If we want to limit this to just auto-vectorized code, then maybe this could be done with some common utility called by a vectorizer on its result? If we on the other hand want to optimize everything, a new separate IR pass for vector ops could be made to run after the vectorizers just once. Such a new pass could handle vector instructions in general more extensively than instcombine. Would it then be possible to avoid the current problems the backend is having?

Or does this really have to be done on the DAG by each backend? Or perhaps this is really just a local issue with the loop vectorizer?

Please fill in on how to best proceed with improving the loop vectorizers code.

Jonas

I have run into this in the past also and it surprised me to again see
(on SystemZ) that the vectorized loop did many seemingly unnecessary
shuffles.

Hi Jonas,

This is a known side-effect of vectorisers (not just LLVM) that have
to handle multiple hardware architectures. GCC has it's own set of bad
patterns, too.

This seems to be an issue which is due to keeping instcombine simple and
fast, as well as a conservativeness to not produce any new shuffles not
already in the input program (see comment in
InstCombiner::visitShuffleVectorInst). For some reason a bit unclear to
me the backend will get into trouble then.

Specifically interleaved generation will invariably lead to
*additional* shuffles, because it's trying to create the pattern that
will, later, be selected into one or few instructions.

The middle-end relies on the back-end knowing how to select the large
patters, as well as other middle-end passes not destroying it.
Cleaning that up may very well lead to poorer code.

Should improved optimization of shufflevector instructions handle all of
them globally, or just the new ones produced by the vectorizers?

It's probably a lot simpler to improve the SystemZ model to "know"
have the same arch flags / cost model completeness as the other
targets.

The transformations done by the vectoriser are target-agnostic, but
they still ask the targets if certain patterns are possible and
profitable before doing so.

Or does this really have to be done on the DAG by each backend? Or
perhaps this is really just a local issue with the loop vectorizer?

It's a dance between what the middle-end enquires of the target info
and what the back-end can actually generate.

You may have to expose some flags (to turn certain behaviour on), then
to tune the cost model (to make them profitable on most cases), then
implement the pattern recognition in the DAGISel (also GlobalISel), so
that the generated code can be optimally selected.

If LLVM was compiling to a single target, emitting IR that conforms to
one specific pattern would be the most logical choice (don't pollute,
simplify further passes, reduce pattern complexity), so it may sound a
lot simpler on arch-specific compilers.

But in a target-agnostic compiler you need to "emulate" that using the
three-step above: target info, cost model, ISel patterns.

Hope that helps.

Hi Renato,

It's probably a lot simpler to improve the SystemZ model to "know"
have the same arch flags / cost model completeness as the other
targets.

I thought they were - anything particular in mind?

The transformations done by the vectoriser are target-agnostic, but
they still ask the targets if certain patterns are possible and
profitable before doing so.

I have a patch for the LoopVectorizer that makes it recognize this particular case of a load interleave group being only used by a store interleave group. I then pass this flag to TTI.getInterleavedMemoryOpCost(), so that the target can return an accurate cost. During my experiments on SystemZ I added the cost of shuffling the vector(s) only on the load, while then the store group did not get that cost at all.

This then made many more cases of interleaving happen (~450 cases on spec IIRC). Only problem was... the SystemZ backend could not handle those shuffles as well in all the cases. To me that looked like something to be fixed on the I/R level, and after discussions with Sanjay I got the impression that this was the case...

To me, this looks like something the LoopVectorizer is neglecting and should be combining. I suppose with my patch for the Load -> Store groups, I could add also the handling of recomputed indices so that the load group produces a vector that fits the store group directly. But if I understand you correctly, even this is not so wise? And if so, then indeed improving the SystemZ DAGCombiner is the only alternative left, I guess...

But in a target-agnostic compiler you need to "emulate" that using the three-step above: target info, cost model, ISel patterns.

But having the cost functions available is not enough to drive a later I/R pass to optimize the generated vector code? I mean if the target indicated which shuffles were expensive, that could then easily be avoided.

Thanks,

Jonas

> It's probably a lot simpler to improve the SystemZ model to "know"
> have the same arch flags / cost model completeness as the other
> targets.
I thought they were - anything particular in mind?

I have no idea about SystemZ, sorry. :slight_smile:

From your post and response, it seems that both improving the target

info and cost model are opening new ways to vectorise on SystemZ.

That's what I was referring to.

This then made many more cases of interleaving happen (~450 cases on
spec IIRC). Only problem was... the SystemZ backend could not handle
those shuffles as well in all the cases. To me that looked like
something to be fixed on the I/R level, and after discussions with
Sanjay I got the impression that this was the case...

Right. Being fixed at IR level and that being done in the vectoriser
are two different things.

Our current implementation is too monolithic to be trying out
branching off the beaten path, and we're in the process of moving out
(which can still take years), so I don't recommend big refactorings on
the code.

You could probably find a number of simplifications, taking target
info in consideration, that can later be ported to VPlan, but that
will require testing the vectorisation on the supported targets.

We don't need to re-benchmark everything again, just make sure the
code doesn't change for them, of if it does, to know why.

To me, this looks like something the LoopVectorizer is neglecting and
should be combining.

It's not up to the vectoriser to combine code.

But it could be up to the vectoriser to generate less bloated code,
given it's a small change.

That's my point above.

I suppose with my patch for the Load -> Store
groups, I could add also the handling of recomputed indices so that the
load group produces a vector that fits the store group directly. But if
I understand you correctly, even this is not so wise?

It will depend on how much that changes other targets, because what
looks less bloated can also mean patterns are not recognised any more
by other back-ends.

And if so, then indeed improving the SystemZ DAGCombiner is the only alternative left, I guess...

You'll probably have to do that anyway, but I wouldn't try it unless I
had no other choice. :slight_smile:

But having the cost functions available is not enough to drive a later
I/R pass to optimize the generated vector code? I mean if the target
indicated which shuffles were expensive, that could then easily be avoided.

Sure, but "expensive" is a relative term and it's intimately linked to
what the back-end can combine.

If you're lucky enough that a mid-end change just happens to unbloat
shuffles and be correctly lowered, without breaking other targets,
then that's a big win.