LoopVectorizer: shufflevectors

To me, this looks like something the LoopVectorizer is neglecting and
should be combining.

It's not up to the vectoriser to combine code.

But it could be up to the vectoriser to generate less bloated code,
given it's a small change.

That's my point above.

We should note that
1) Loop Vectorizer is not the only place that generates vectorized IR. For example, programmer's intrinsic vector code, after inlining etc. might show the same problem. Any optimizations added within LV won't be applied when other parts of the compiler is generating vectorized IR.
2) Vectorizer's main job is generating widened vector code that is easier to optimize later on, not necessarily generating highly optimized vector code on its own.
3) Cost modeling correctly (and as a result choosing good VF) is a more important problem, than performing the optimization within the vectorizer itself.
4) If cost modeling is taking optimization into account, LV has a chance of generating optimized code. That doesn't necessarily mean LV should ---- back to 1).

The last thing we want would be making LV a gigantic monolithic optimizer that is so hard to maintain.

I think we should talk about how much complexity we would be adding for general "vectorized load/store optimization", and whether we should have a separate post-vectorizer optimizer doing it (while LV still needs to understand the cost modeling aspect of that optimization, in order to choose the right VF). This should include a discussion about moving interleave memory access optimization from LV to there. Adding a small new optimization here and there to LV can have a snowball effect.

Thanks,
Hideki

I think we should talk about how much complexity we would be adding for general "vectorized load/store optimization", and whether we should have a separate post-vectorizer optimizer doing it (while LV still needs to understand the cost modeling aspect of that optimization, in order to choose the right VF).

I imagine it would be a lot easier to plug loop-vectorisation-specific
clean up passes in a VPlan model than today. But as you said, this is
only part of the vectorised code the middle end generates.

While LV could (potentially) generate less bloated code, which would
also help clean up passes to do their jobs better, it will have to be
very conservative and extensively tested.

This should include a discussion about moving interleave memory access optimization from LV to there. Adding a small new optimization here and there to LV can have a snowball effect.

I agree that interleave access is not exclusive to loop vectorisation
and that it should be moved to a higher position (some of your patches
earlier this year come to mind).

But, as I said back then, before we do so, we need to understand
exactly where to put it. That will depend on what other passes will
actually use it and if we want it to be a utility class or an analysis
pass, or both.

Have you compiled a list of passes that could benefit from such a move?

cheers,
--renato