Thanks Ayal!
So, at least for this example, it looks like we actually want to vectorize with -enable-interleaved-mem-accesses, we just need the backend to generate good code for the vector types that produces, specifically, in this case, <12 x i8>. The details are in PR29025.
The upshot of this is that for the original program (with an outer loop around it):
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe
real 0m2.229s
user 0m2.224s
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe
real 0m2.590s
user 0m2.584s
This indicates that we do have a slight cost modeling issue - the cost model is not quite conservative enough in case we really do use inserts and extracts. One thing we’re probably not accounting for is a bunch of GPR spills - although I’m not sure why we end up spilling so much. So perhaps this should also be fixed in regalloc.
But if we modify the program by adding “*out++ = 0” right after “*out++ = q;” (thus eliminating the pesky <12 x i8>), we get:
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe
real 0m2.257s
user 0m2.256s
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe
real 0m0.958s
user 0m0.956s
So turns out it is a full reproducer after all (choosing to vectorize on AVX), good.
The details are in PR29025.
Interesting. (So we should carefully insert unconditional branches inside shuffle sequences, eh?
But if we modify the program by adding “*out++ = 0” right after “*out++ = q;” (thus eliminating the pesky <12 x i8>), we get:
Indeed such padding is a known (programmer) optimization to effectively have power-of-2 strides and/or alignment.
So, unfortunately, it turns out I don’t have access to DENBench.
If you like we could test your patch to see how it (mis)behaves.
Yes, carefully inserting branches is the way to go!
Seriously though - you probably saw that I just committed a fix for PR29025 (r280418).
For the reproducer you provided, we now have (without forcing vectorization, and without “padding” to have power-of-2 stride):
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe
real 0m2.290s
user 0m2.289s
sys 0m0.003s
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe
real 0m1.095s
user 0m1.095s
sys 0m0.002s
Care to give it a spin internally?
Note that this is not a full solution - we still won’t vectorize PR27619, and force-vectorizing it is still a bad idea. Getting that right will require more lowering improvements as well as cost model adjustments. But hopefully post-r280418 things should be good enough to avoid regressions for the cases we will vectorize.
If you still see regressions, more reproducers will be appreciated.
If there are no more regressions, let me know, and I’ll post a patch to enable interleaved access for x86.
Thanks,
Michael
Seriously though - you probably saw that I just committed a fix for PR29025 (r280418).
Care to give it a spin internally?
Sure; spinning with r280423 and the patch below (*) indeed takes care of the slowdowns observed in 32 bit mode for AVX J.
If you still see regressions, more reproducers will be appreciated.
If there are no more regressions, let me know, and I’ll post a patch to enable interleaved access for x86.
Unfortunately, we’re still observing severe slowdowns in 32 bit mode for SSE with -march=slm for the same rgb conversion workloads. Seems like we’ll need a different reproducer for that, as rgb2yik.c below is left unvectorized when compiled to slm.
Ayal.
(*) used the following in anticipation of your patch(?), effectively equivalent to -enable-interleaved-mem-accesses:
Index: lib/Target/X86/X86TargetTransformInfo.cpp
Ayal, we’re going on a month now waiting to enable a feature because of regressions you reported without a reproducer. Please prioritize getting a reproduction, even if it isn’t reduced. I don’t think it is reasonable for upstream to continually delay enabling this feature when all of the test cases and reproductions we have access to are good…