Vectorization goes overboard because the throughput cost model used by the
vectorizers doesn't match the 6 IR instructions that correspond to 1 x86
rotate instruction. Instead, we have:
The broken cost model also affects unrolling and inlining. Size costs are
overestimated for a target that has a rotate instruction.
This cost problem isn't limited to rotate patterns (it's come up in the
context of min/max/abs/fma too). But it would be simpler if we had a
intrinsic, and the 6-to-1 margin is the biggest I've seen.
Given that this is a general problem that occurs with other instruction
sequences as well, wouldn't it make more sense to make the cost model
handle more than one instruction, as suggested in PR31274 ?
Yes, I think everyone agrees that we need to improve the cost model
interface. Note that there's a proposal for review for the 1 IR -> many
machine ops variant of this question:
In all these cases (rotate, min, max, abs, fma, add-with-overflow, and
probably many more) there's a tradeoff between elaborating them as simpler
IR instructions and modelling them as its own instruction / intrinsic. A
big disadvantage of introducing new instructions / intrinsics is that all
optimizations have to be told about this, increasing the compiler code base
and maintainability burden. On the other hand, too few instructions can
make optimization difficult as well (in theory, one instruction like
"subtract and branch if not equal to zero" could emulate all the others,
but this wouldn't be very helpful for optimization). Since you put a lot
of thought into how to canonicalize IR, can you eleborate more on this
tradeoff? Can we find a set of criteria to determine whether an
instruction pattern should get an intrinsic or not?
I don't have formal criteria for this. Someone more knowledgeable may give
us an answer.
An informal metric might be: if the operation is supported as a primitive
op or built-in in source languages and it is supported as a single target
instruction, can we guarantee that 1-to-1 translation through optimization?
We are failing that test in the loop-invariant C example presented here.
We're also failing that test when operands have different sizes. Rotate
goes in, logic and shifts come out.
We even failed that test for an even more basic case before this fix:
We may still be failing the basic case for other source languages/flavors
because encoding this operation in existing IR ops isn't obvious?
Another informal measure is: how do programmers express this operation
without a builtin? If they're resorting to inline asm, that's a strong sign
that the compiler needs an intrinsic.
Another metric could be: what is the probability that the optimizer can
improve the codegen if the operation is broken down into multiple IR
As noted here, it's unlikely for rotate. If it is possible, then adding
folds to instcombine for this intrinsic isn't hard. Are any other passes
For reference, these are the current target-independent bit-manipulation
intrinsics - bswap, bitreverse, ctpop, ctlz, cttz:
The LLVM cost for the proposed rotate intrinsic should be in the same range
as those? Note that we would not just be adding code to support an
intrinsic. There are already ~200 lines of DAG matching code for rotate, so
we already have a cost for trying to get this optimized. We can remove that
after adding canonicalization to the intrinsic in IR.