Even abs() comes with a performance penalty

--- bugs-bunny.c ---

cmov has been 1 cycle since Sandy Bridge. Moves execute in the register renamer since Ivy Bridge. So mov+neg+cmov should be faster than cdq+add+xor on modern CPUs. Furthermore, cdq really ties the hands of the register allocator so probably doesn’t make sense in a larger function with abs mixed with other code.

Sorry. I made a mistake. Cmov has been 1 cycle since Broadwell.

cmov has been 1 cycle since Sandy Bridge.

That doesn't matter here. It's the data dependency it introduces.

Moves execute in the register renamer since Ivy Bridge.

That's why my code shows 2 of them.

So mov+neg+cmov should be faster than cdq+add+xor on modern CPUs.

You but forgot sbb+ test, and the data dependency: how well does the CPU
speculate about cmovs?

Furthermore, cdq really ties the hands of the register allocator so probably
doesn't make sense in a larger function with abs mixed with other code.

The optimiser is free to use mov+sar then, at the expense of +4 or +6 bytes.

Ever heard of trade-off?

Stefan

Sorry. I made a mistake. Cmov has been 1 cycle since Broadwell.

Doesn't matter, no need to worry: all instructions used below run in 1 cycle on
recent CPUs ... just like Jcc. The question/point is but whether the CPU can/does
speculate ahead.

Stefan

I was mostly speaking to abssi2. What data dependency exists for cmov that doesn’t exist for cdq+add+xor?