An update on my experiment wih the first loop:
For the first loop, if I change the pragma to “#pragma clang loop vectorize_width(4) interleave_count(2)”, and force the legality check in isStridedPtr(), the loop gets vectorized and runs faster too.
So in summary,the issue with vectorizing the first loop seems to be (1) Too strict legality check that does not understand that index cannot really overflow and (2) Cost computation that says its not profitable to vectorize the loop.
