getScalarizationOverhead()

Hi,

I wonder why getScalarizationOverhead() does not take into account the number of operands of the instruction? This should influence the number of extracts needed, so instead of

Scalarization cost = NumEls * (insert + extract)

it would be better to do

Scalarization cost = NumEls * (insert + (extract * numOperands))

/ Jonas

Hi,

I wonder why getScalarizationOverhead() does not take into account the number of operands of the instruction? This should influence the number of extracts needed, so instead of

Scalarization cost = NumEls * (insert + extract)

it would be better to do

Scalarization cost = NumEls * (insert + (extract * numOperands))

I suspect this is an oversight (although we need to be a bit careful here because if two operands are the same, which is not uncommon, we don't want to double the cost).

  -Hal

Do you in those cases of an identical operand want to count just a cost of "1" for a register move, instead of the "extraction cost"?

/Jonas

Hi,

I wonder why getScalarizationOverhead() does not take into account the number of operands of the instruction? This should influence the number of extracts needed, so instead of

Scalarization cost = NumEls * (insert + extract)

it would be better to do

Scalarization cost = NumEls * (insert + (extract * numOperands))

I suspect this is an oversight (although we need to be a bit careful here because if two operands are the same, which is not uncommon, we don't want to double the cost).

-Hal

Do you in those cases of an identical operand want to count just a cost of "1" for a register move, instead of the "extraction cost"?

There should be no cost to reusing the operand. (mul a, a) should only extract a once, the fact that it is used twice should not increase the cost.

  -Hal

There appears to be a similar issue within the x86 AVX1 cost tables for cases where we have to split the 256-bit integer operations. Some binops add 1*extract_subvector + 1*insert_subvector to the 2*128-binop costs whilst others don’t bother adding anything at all. We need to try harder to determine if we should add 1 (duplicate input or constant folded extract) or 2 extracts to the final cost.

There appears to be a similar issue within the x86 AVX1 cost tables for cases where we have to split the 256-bit integer operations. Some binops add 1*extract_subvector + 1*insert_subvector to the 2*128-binop costs whilst others don’t bother adding anything at all. We need to try harder to determine if we should add 1 (duplicate input or constant folded extract) or 2 extracts to the final cost.

There should be no cost to reusing the operand. (mul a, a) should only extract a once, the fact that it is used twice should not increase the cost.

-Hal

There appears to be a similar issue within the x86 AVX1 cost tables for cases where we have to split the 256-bit integer operations. Some binops add 1*extract_subvector + 1*insert_subvector to the 2*128-binop costs whilst others don’t bother adding anything at all. We need to try harder to determine if we should add 1 (duplicate input or constant folded extract) or 2 extracts to the final cost.

What's more, this method seems to be duplicated in several places: BasicTTIImpl.h has one definition, which is duplicated in LoopVectorize.cpp (with just one line checking for void type added), and the X86 has also duplicated it with an int return type instead of unsigned. I tried to refactor this and also started to improve on the functionality. Please take a look at: https://reviews.llvm.org/D29017

/ Jonas