enabling interleaved access loop vectorization

Thanks Ayal!

So, at least for this example, it looks like we actually want to vectorize with -enable-interleaved-mem-accesses, we just need the backend to generate good code for the vector types that produces, specifically, in this case, <12 x i8>. The details are in PR29025.

The upshot of this is that for the original program (with an outer loop around it):

$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe
real 0m2.229s
user 0m2.224s

$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe
real 0m2.590s

user 0m2.584s

This indicates that we do have a slight cost modeling issue - the cost model is not quite conservative enough in case we really do use inserts and extracts. One thing we’re probably not accounting for is a bunch of GPR spills - although I’m not sure why we end up spilling so much. So perhaps this should also be fixed in regalloc.

But if we modify the program by adding “*out++ = 0” right after “*out++ = q;” (thus eliminating the pesky <12 x i8>), we get:

$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe

real 0m2.257s
user 0m2.256s

$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe
real 0m0.958s

user 0m0.956s

So turns out it is a full reproducer after all (choosing to vectorize on AVX), good.

The details are in PR29025.

Interesting. (So we should carefully insert unconditional branches inside shuffle sequences, eh? :wink:

But if we modify the program by adding “*out++ = 0” right after “*out++ = q;” (thus eliminating the pesky <12 x i8>), we get:

Indeed such padding is a known (programmer) optimization to effectively have power-of-2 strides and/or alignment.

So, unfortunately, it turns out I don’t have access to DENBench.

If you like we could test your patch to see how it (mis)behaves.

Yes, carefully inserting branches is the way to go!

Seriously though - you probably saw that I just committed a fix for PR29025 (r280418).
For the reproducer you provided, we now have (without forcing vectorization, and without “padding” to have power-of-2 stride):

$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe
real 0m2.290s

user 0m2.289s
sys 0m0.003s
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe
real 0m1.095s

user 0m1.095s
sys 0m0.002s

Care to give it a spin internally?

Note that this is not a full solution - we still won’t vectorize PR27619, and force-vectorizing it is still a bad idea. Getting that right will require more lowering improvements as well as cost model adjustments. But hopefully post-r280418 things should be good enough to avoid regressions for the cases we will vectorize.
If you still see regressions, more reproducers will be appreciated. :slight_smile:
If there are no more regressions, let me know, and I’ll post a patch to enable interleaved access for x86.

Thanks,
Michael

Seriously though - you probably saw that I just committed a fix for PR29025 (r280418).

Care to give it a spin internally?

Sure; spinning with r280423 and the patch below (*) indeed takes care of the slowdowns observed in 32 bit mode for AVX J.

If you still see regressions, more reproducers will be appreciated. :slight_smile:

If there are no more regressions, let me know, and I’ll post a patch to enable interleaved access for x86.

Unfortunately, we’re still observing severe slowdowns in 32 bit mode for SSE with -march=slm for the same rgb conversion workloads. Seems like we’ll need a different reproducer for that, as rgb2yik.c below is left unvectorized when compiled to slm.

Ayal.

(*) used the following in anticipation of your patch(?), effectively equivalent to -enable-interleaved-mem-accesses:

Index: lib/Target/X86/X86TargetTransformInfo.cpp

Ayal, we’re going on a month now waiting to enable a feature because of regressions you reported without a reproducer. Please prioritize getting a reproduction, even if it isn’t reduced. I don’t think it is reasonable for upstream to continually delay enabling this feature when all of the test cases and reproductions we have access to are good…