I have been discussing a bit with Sanjay on how to handle the poor sequences of shufflevector instructions produced by the loop vectorizer and he suggested we bring this up on llvm-dev.
I have run into this in the past also and it surprised me to again see (on SystemZ) that the vectorized loop did many seemingly unnecessary shuffles. In this case (see https://bugs.llvm.org/show_bug.cgi?id=38792), one group of interleaved loads got shuffled into vector operands, to then be used by an interleaved group of stores, which in turn did its shuffling. The loop vectorizer did not attempt to combine these shuffles, and unfortunately no later pass did so either.
This seems to be an issue which is due to keeping instcombine simple and fast, as well as a conservativeness to not produce any new shuffles not already in the input program (see comment in InstCombiner::visitShuffleVectorInst). For some reason a bit unclear to me the backend will get into trouble then.
Should improved optimization of shufflevector instructions handle all of them globally, or just the new ones produced by the vectorizers? At least in the code produced by the vectorizers, it seems fair to combine the shuffles to minimize them. If we want to limit this to just auto-vectorized code, then maybe this could be done with some common utility called by a vectorizer on its result? If we on the other hand want to optimize everything, a new separate IR pass for vector ops could be made to run after the vectorizers just once. Such a new pass could handle vector instructions in general more extensively than instcombine. Would it then be possible to avoid the current problems the backend is having?
Or does this really have to be done on the DAG by each backend? Or perhaps this is really just a local issue with the loop vectorizer?
Please fill in on how to best proceed with improving the loop vectorizers code.