Vector Shuffle chain lowering to X86 instructions simplification inconsistencies

Hi all,

Attached herewith is a fairly simple LLVM file (shuffle.ll) with lots of vector shuffles.

When I use llc with -O3 -mcpu=core-avx2 the first shuffle sequence containing types of 128 wide gets reduced a single shuffle, where as the second shuffle sequence containing types of 256 wide doesn’t get reduced to a single shuffle instruction in the resulting X86 code (Shuffle.s attached).

The second sequence is identical to first and is a rewidening of the sequence for a higher vector length.

Can this be explained and where in the machine lowering passes does this simplification happen?


Hi Charith,

After taking a quick look it seems we could do better for the 256-bit shuffles.

Can you please open a bug report (, product=libraries, component=backend: X86) for this? It would be helpful if you minimized shuffle.ll to say two functions. One function will perform the 128-bit shuffles and 256-bit shuffles in the second.

Thanks, Zvi