Interesting. Do you happen to have some examples laying around?
Sorry, no, I'll have to re-run. I felt that the select-heavy results were
sort of humorous and clever at first, but then just annoying.
Except for decode-limited situations, in general decreasing the critical
path length is more important
than eliminating instructions. The critical path length is a
target-independent lower bound on the
maximum achievable execution speed; the number of instructions is not.
Sounds easy enough. I'll give it a try. I've observed many situations
where Souper can shorten the critical path but I hadn't really thought
about making that into a top-level goal.
What I said is sort of misleading in that the targets have a variety of
"exotic" instructions that can effectively collapse multiple nodes in the
critical path e.g. x86 BMI, PPC rlwinm, etc. Even stuff like whether the
target has ANDN will affect the critical path (or an instruction taking a
general truth table). Also the precise cost in a critical path of the
individual instructions can vary a fair amount.to the extent that except
for getting rid of mul's or div's it's not a universal win (even with mul,
there are often situations where the mul latency doesn't cost you anything,
so the mul is effectively "cheap as an add").
The above notwithstanding, the critical path is still likely to be a better
indicator than number than instructions. Ultimately the hardware
implementation of these instructions is critical-path-limited, so besides
actually reducing the critical path of instructions, optimizing for
critical path is more likely to land you near something which might
actually be a single cycle instruction in the hardware. That's my
intuition, at least.
I guess another way to select interesting transformations could be to look
for sequences where the result uses a "subset" of the input instruction
sequence. E.g. if the input instruction sequence involves a CMP, then
reducing the input sequence to a single CMP is a win. If the input contain
a shift-imm, then reducing the entire sequence to a shift-imm ("absorbing"
into it) is a win. Maybe there is a generalization of this intuition to
multiple-instruction result sequences?
%0:i32 = var
%1:i32 = addnw 1:i32, %0
%2:i32 = shl 1:i32, %1
%3:i32 = shl 2:i32, %0
%0:i8 = var
%1:i1 = ult 191:i8, %0
%2:i1 = slt 255:i8, %0
%3:i1 = or %1, %2
%4:i1 = sle 192:i8, %0
%0:i32 = var
%1:i32 = shlnsw %0, 1:i32
%2:i32 = or 1:i32, %1
%3:i1 = slt %1, %2
"or" and "and" are basically interchangeable as far as cost is concerned,
so this one also seems like a "subset":
%0:i32 = var
%1:i32 = addnsw 4294967295:i32, %0
%2:i32 = var
%3:i32 = or 1:i32, %2
%4:i32 = addnsw %1, %3
%5:i32 = and 4294967294:i32, %2
%6:i32 = add %0, %5
-- Sean Silva