Interesting. Do you happen to have some examples laying around?

Sorry, no, I'll have to re-run. I felt that the select-heavy results were

sort of humorous and clever at first, but then just annoying.

Except for decode-limited situations, in general decreasing the critical

path length is more important

than eliminating instructions. The critical path length is a

target-independent lower bound on the

maximum achievable execution speed; the number of instructions is not.

Sounds easy enough. I'll give it a try. I've observed many situations

where Souper can shorten the critical path but I hadn't really thought

about making that into a top-level goal.

What I said is sort of misleading in that the targets have a variety of

"exotic" instructions that can effectively collapse multiple nodes in the

critical path e.g. x86 BMI, PPC rlwinm, etc. Even stuff like whether the

target has ANDN will affect the critical path (or an instruction taking a

general truth table). Also the precise cost in a critical path of the

individual instructions can vary a fair amount.to the extent that except

for getting rid of mul's or div's it's not a universal win (even with mul,

there are often situations where the mul latency doesn't cost you anything,

so the mul is effectively "cheap as an add").

The above notwithstanding, the critical path is still likely to be a better

indicator than number than instructions. Ultimately the hardware

implementation of these instructions is critical-path-limited, so besides

actually reducing the critical path of instructions, optimizing for

critical path is more likely to land you near something which might

actually be a single cycle instruction in the hardware. That's my

intuition, at least.

I guess another way to select interesting transformations could be to look

for sequences where the result uses a "subset" of the input instruction

sequence. E.g. if the input instruction sequence involves a CMP, then

reducing the input sequence to a single CMP is a win. If the input contain

a shift-imm, then reducing the entire sequence to a shift-imm ("absorbing"

into it) is a win. Maybe there is a generalization of this intuition to

multiple-instruction result sequences?

%0:i32 = var

%1:i32 = addnw 1:i32, %0

%2:i32 = shl 1:i32, %1

infer %2

%3:i32 = shl 2:i32, %0

result %3

%0:i8 = var

%1:i1 = ult 191:i8, %0

%2:i1 = slt 255:i8, %0

%3:i1 = or %1, %2

infer %3

%4:i1 = sle 192:i8, %0

result %4

%0:i32 = var

%1:i32 = shlnsw %0, 1:i32

%2:i32 = or 1:i32, %1

%3:i1 = slt %1, %2

infer %3

result 1:i1

"or" and "and" are basically interchangeable as far as cost is concerned,

so this one also seems like a "subset":

%0:i32 = var

%1:i32 = addnsw 4294967295:i32, %0

%2:i32 = var

%3:i32 = or 1:i32, %2

%4:i32 = addnsw %1, %3

infer %4

%5:i32 = and 4294967294:i32, %2

%6:i32 = add %0, %5

result %6

-- Sean Silva