To me, this looks like something the LoopVectorizer is neglecting and
should be combining.It's not up to the vectoriser to combine code.
But it could be up to the vectoriser to generate less bloated code,
given it's a small change.That's my point above.
We should note that
1) Loop Vectorizer is not the only place that generates vectorized IR. For example, programmer's intrinsic vector code, after inlining etc. might show the same problem. Any optimizations added within LV won't be applied when other parts of the compiler is generating vectorized IR.
2) Vectorizer's main job is generating widened vector code that is easier to optimize later on, not necessarily generating highly optimized vector code on its own.
3) Cost modeling correctly (and as a result choosing good VF) is a more important problem, than performing the optimization within the vectorizer itself.
4) If cost modeling is taking optimization into account, LV has a chance of generating optimized code. That doesn't necessarily mean LV should ---- back to 1).
The last thing we want would be making LV a gigantic monolithic optimizer that is so hard to maintain.
I think we should talk about how much complexity we would be adding for general "vectorized load/store optimization", and whether we should have a separate post-vectorizer optimizer doing it (while LV still needs to understand the cost modeling aspect of that optimization, in order to choose the right VF). This should include a discussion about moving interleave memory access optimization from LV to there. Adding a small new optimization here and there to LV can have a snowball effect.
Thanks,
Hideki