[Proposal][RFC] Strided Memory Access

Ashutosh,

First, I'm all for enabling general stride load/store support for all targets --- should be just a matter of proper
cost modeling. For that matter, we should enable general gather/scatter support for all targets.

About the specific approach taken by this RFC:
1) It's best to first state that this is for small constant stride where a single wide load/store can cover two or more
     valid stride load/store elements. If a wide load/store contain just one valid element, there aren't any need
     to do a wide load/store.
2) Optimal code generation for strided memory accesses are target dependent.
     Breaking one gather/scatter w/ constant index vector into several IR instructions like below might hurt such optimization,
     especially when the vectorization factor is greater than the target legal vector width ---- this can happen
     in mixed data type code (int mixed with double, etc.)
3) Make sure that the last element of unmasked load is a valid stride load element ---- otherwise,
     out-of-bounds SEGV can happen. Use of masked load intrinsic most likely degrades any benefits over
     gather intrinsic w/ constant index.
4) Unfortunately, 3) most likely imply that the wide load used here needs to be based on target HW physical
     vector width. There are reasons why such type legalization at vectorizer should be avoided.
5) In general, I think leaking this much complexity into IR hurts (rather than enable) optimization.
     It's best to keep vectorizer's output as simple as vector code can be such that other optimizers can do their best
     in optimizing vector code.

If I remember correctly, Elena Demikhovsky (my colleague) attempted to introduce stride load/store intrinsic
sometime ago and got a push back. I think it's best to address the following questions.
a) What can this enable over gather/scatter intrinsic w/ constant index vector
b) What can this enable over hypothetical "stride load/store" intrinsic
c) Any chance of introducing stride load/store as a form of wide load/store?
    I understand that there is a big hurdle here, but the reward should also be pretty good here.

One might claim that the cost modeling can be easier, but that is also the downside since
the generated cost may not match the cost of optimal code.

All these things considered, my suggestion is to start from enabling optimal code generation (or optimal lowering
in CG prepare) from gather/scatter intrinsic and reflect that in vectorizer's cost model.

As a vectorizer centric person, I'd like to eventually see gather/scatter/stride-load/store to be supported
by IR instructions (i.e., not intrinsics), but I'll leave that discussion to another time so that I won't pollute
Ashutosh's RFC.

Thanks,
Hideki Saito
Technical Lead of Vectorizer Development
Intel Compiler and Languages