Hi all,
Attached herewith is a fairly simple LLVM file (shuffle.ll) with lots of vector shuffles.
When I use llc with -O3 -mcpu=core-avx2 the first shuffle sequence containing types of 128 wide gets reduced a single shuffle, where as the second shuffle sequence containing types of 256 wide doesn’t get reduced to a single shuffle instruction in the resulting X86 code (Shuffle.s attached).
The second sequence is identical to first and is a rewidening of the sequence for a higher vector length.
Can this be explained and where in the machine lowering passes does this simplification happen?
Thanks
shuffle.ll (12.2 KB)
shuffle.s (8.53 KB)