some superoptimizer results

We (the folks working on Souper) would appreciate any feedback on these IR-level superoptimizer results:

   Souper Optimizations

My impression is that while there's clearly plenty of material in here that doesn't want to get implemented in an opt pass, there are a number of gems hiding in there that are worth implementing.

Blog post containing additional explanation and caveats is here:

   A Few Synthesizing Superoptimizer Results – Embedded in Academia

Thanks!

John

I was reading about these; it looks really interesting! One thing that I worry about is that we want to make sure not to IR-transform into “weird constructs” that might be very suboptimal on certain targets (and hard or impossible to invert later); e.g. you noted that selects had cost 3 in your model, but I work on a GPU where select_cc (fcmp/icmp + select) has cost 1. Doing sneaky bit math to avoid a select might be good in some cases, but not others. Maybe it could use target hooks of some sort so it only does transformations that makes sense given a certain cost model?

—escha

>
> We (the folks working on Souper) would appreciate any feedback on these
IR-level superoptimizer results:
>
> Souper Optimizations
>
> My impression is that while there's clearly plenty of material in here
that doesn't want to get implemented in an opt pass, there are a number of
gems hiding in there that are worth implementing.
>
> Blog post containing additional explanation and caveats is here:
>
> A Few Synthesizing Superoptimizer Results – Embedded in Academia
>
> Thanks!
>
> John

I was reading about these; it looks really interesting! One thing that I
worry about is that we want to make sure not to IR-transform into “weird
constructs” that might be very suboptimal on certain targets (and hard or
impossible to invert later); e.g. you noted that selects had cost 3 in your
model, but I work on a GPU where select_cc (fcmp/icmp + select) has cost 1.
Doing sneaky bit math to avoid a select might be good in some cases, but
not others. Maybe it could use target hooks of some sort so it only does
transformations that makes sense given a certain cost model?

Even on x86, materializing i1's residing in the condition flags is a pain
(e.g. to sext) but CMOVcc based on them is easy.

-- Sean Silva

Are you taking into account critical path length? Because e.g. for:

%0:i64 = var
%1:i1 = slt 18446744073709551615:i64, %0
%2:i64 = subnsw 0:i64, %0
%3:i64 = select %1, %0, %2
infer %3
%4:i64 = ashr %0, 63:i64
%5:i64 = add %0, %4
%6:i64 = xor %5, %4
result %6

In the former case, the cmp and sub are independent, so can be executed in parallel, while in the latter case all 3 instructions are dependent. So the former case can execute in 2 cycles while the latter takes 3. Modern OoO chips do in fact exploit this kind of thing.

– Sean Silva

One thing that is important to consider is where in the pipeline these kinds of optimizations fit. We normally try to put the IR into a canonical simplified form in the mid-level optimizer. This form is supposed to be whatever is most useful for exposing other optimizations, and for lowering, but only in a generic sense. We do have some optimizations near the end of our pipeline (vectorization, partial unrolling, etc.) that consider target-specific properties, but only because the alternative is doing those loop optimizations after instruction selection.

Considering ILP and other pipeline-level costs are something we generally consider only in the SelectionDAG and after. If these are IR optimizations, then I'm not sure that considering ILP, etc. is the right metric -- so long as the transformations are sufficiently reversible to allow of efficient lowering afterward.

-Hal

One thing that is important to consider is where in the pipeline these
kinds of optimizations fit. We normally try to put the IR into a canonical
simplified form in the mid-level optimizer. This form is supposed to be
whatever is most useful for exposing other optimizations, and for lowering,
but only in a generic sense. We do have some optimizations near the end of
our pipeline (vectorization, partial unrolling, etc.) that consider
target-specific properties, but only because the alternative is doing those
loop optimizations after instruction selection.

Considering ILP and other pipeline-level costs are something we generally
consider only in the SelectionDAG and after. If these are IR optimizations,
then I'm not sure that considering ILP, etc. is the right metric -- so long
as the transformations are sufficiently reversible to allow of efficient
lowering afterward.

Agreed. It might just be that these initial results are from the "burn-in"
specifically targeting short simple sequences, but most of the
transformations in the link seem to be things that, if applicable, we would
want to do in the backend instead of in the middle-end.

-- Sean Silva

Looking through the items, I see a number which are suitable for mid level canonicalization. For example, the two for converting and/cmps into truncs seem like good candidates. We need to make sure to apply judgement here, but not all of these are backend specific.

I took a shot at implementing this and it looks like instcombine is actually canonicalizing in the opposite direction. All truncs to i1 are being converted to an and/cmp pattern. Anyone know why this is? I expected that as a lowering in selectiondag, but doing in InstCombine seems too early… I guess I can see an argument that we’re not actually loosing any information (i.e. ComputeKnownBits will handle the and/cmp just fine), but this still seems a bit surprising. Anyone have thoughts here? Anyways, at least those half dozen items from the list appear to be WONTFIX. :slight_smile:

One thing that is important to consider is where in the pipeline these
kinds of optimizations fit. We normally try to put the IR into a canonical
simplified form in the mid-level optimizer. This form is supposed to be
whatever is most useful for exposing other optimizations, and for lowering,
but only in a generic sense. We do have some optimizations near the end of
our pipeline (vectorization, partial unrolling, etc.) that consider
target-specific properties, but only because the alternative is doing those
loop optimizations after instruction selection.

Considering ILP and other pipeline-level costs are something we generally
consider only in the SelectionDAG and after. If these are IR optimizations,
then I'm not sure that considering ILP, etc. is the right metric -- so long
as the transformations are sufficiently reversible to allow of efficient
lowering afterward.

Agreed. It might just be that these initial results are from the
"burn-in" specifically targeting short simple sequences, but most of the
transformations in the link seem to be things that, if applicable, we would
want to do in the backend instead of in the middle-end.

Looking through the items, I see a number which are suitable for mid level
canonicalization. For example, the two for converting and/cmps into truncs
seem like good candidates. We need to make sure to apply judgement here,
but not *all* of these are backend specific.

Hence "most". From the blog post it seems like this batch is deliberately
obtained by running Souper on particularly short sequences; I'm looking
forward to seeing what it does with longer sequences, especially ones
containing "phis and path conditions".

-- Sean Silva

I just noticed: most of the results in this batch seem to be about exploiting [zs]ext i1 having cost 1 in order to replace a select of cost 3.

Could you do a run where select has cost 1 and [zs]ext i1 (and trunc to i1) has cost 2 or 3?

– Sean Silva

some cases, but not others. Maybe it could use target hooks of some sort so it only does transformations that makes sense given a certain cost model?

Yeah, I would love to explore that, when Souper is being invoked as a pass.

On the other hand, here I was only hoping to suggest some IR-level optimizations that might be useful...

John

Are you taking into account critical path length?

No, but of course that't not hard. There are a lot of things that could be taken into account in the cost functions, but I was hoping to keep it really simple, so that the results would not be opaque.

case all 3 instructions are dependent. So the former case can execute in 2 cycles while the latter takes
3. Modern OoO chips do in fact exploit this kind of thing.

Sure. One way to deal with this would be to use the HW as its own model: run a lot of code fragments on real chips under a benchmark harness and then use the results to train a cost model.

John

Hence "most". From the blog post it seems like this batch is deliberately obtained by running Souper on
particularly short sequences; I'm looking forward to seeing what it does with longer sequences,
especially ones containing "phis and path conditions".

I can easily provide these (and will) but I have been concerned that Souper's output can be really hard to understand. I was hoping that these shallow expressions would provide you folks with the right kind of information.

In any case, my only goal here is to do whatever is most broadly useful to LLVM. I suspect a fair amount of iteration will be required.

I might work on a backend superoptmizer next. I was not interested in doign that one first since it sounds somewhat more complex.

John

I just noticed: most of the results in this batch seem to be about exploiting `[zs]ext i1` having cost 1
in order to replace a select of cost 3.
Could you do a run where select has cost 1 and [zs]ext i1 (and trunc to i1) has cost 2 or 3?

I tried this (or something quite similar) earlier and it caused Souper to introduce *a lot* of selects. So the problem is that Souper's preferences become skewed too far in the other direction.

How about for the next run I give everything (except maybe div/rem) cost 1? Then the only thing Souper will tell us about is when it can eliminate instructions, which seems to be a generally desirable thing.

As I said in the blog post, in the next batch Souper will exploit known bits. I'm excited to see what kind of results come out of Philip's value-tracking-dom-conditions.

John

I just noticed: most of the results in this batch seem to be about

exploiting `[zs]ext i1` having cost 1
in order to replace a select of cost 3.
Could you do a run where select has cost 1 and [zs]ext i1 (and trunc to
i1) has cost 2 or 3?

I tried this (or something quite similar) earlier and it caused Souper to
introduce *a lot* of selects. So the problem is that Souper's preferences
become skewed too far in the other direction.

Interesting. Do you happen to have some examples laying around?

How about for the next run I give everything (except maybe div/rem) cost 1?

That seems reasonable. Probably give mul >1 cost too due to the increased
latency.

  Then the only thing Souper will tell us about is when it can eliminate
instructions, which seems to be a generally desirable thing.

Except for decode-limited situations, in general decreasing the critical
path length is more important than eliminating instructions. The critical
path length is a target-independent lower bound on the maximum achievable
execution speed; the number of instructions is not.

(assuming our chips don't start doing significant algebraic simplifications
on the fly, which seems like a safe bet)

-- Sean Silva

Interesting. Do you happen to have some examples laying around?

Sorry, no, I'll have to re-run. I felt that the select-heavy results were sort of humorous and clever at first, but then just annoying.

Except for decode-limited situations, in general decreasing the critical path length is more important
than eliminating instructions. The critical path length is a target-independent lower bound on the
maximum achievable execution speed; the number of instructions is not.

Sounds easy enough. I'll give it a try. I've observed many situations where Souper can shorten the critical path but I hadn't really thought about making that into a top-level goal.

Thanks,

John

Interesting. Do you happen to have some examples laying around?

Sorry, no, I'll have to re-run. I felt that the select-heavy results were
sort of humorous and clever at first, but then just annoying.

Except for decode-limited situations, in general decreasing the critical

path length is more important
than eliminating instructions. The critical path length is a
target-independent lower bound on the
maximum achievable execution speed; the number of instructions is not.

Sounds easy enough. I'll give it a try. I've observed many situations
where Souper can shorten the critical path but I hadn't really thought
about making that into a top-level goal.

What I said is sort of misleading in that the targets have a variety of
"exotic" instructions that can effectively collapse multiple nodes in the
critical path e.g. x86 BMI, PPC rlwinm, etc. Even stuff like whether the
target has ANDN will affect the critical path (or an instruction taking a
general truth table). Also the precise cost in a critical path of the
individual instructions can vary a fair amount.to the extent that except
for getting rid of mul's or div's it's not a universal win (even with mul,
there are often situations where the mul latency doesn't cost you anything,
so the mul is effectively "cheap as an add").

The above notwithstanding, the critical path is still likely to be a better
indicator than number than instructions. Ultimately the hardware
implementation of these instructions is critical-path-limited, so besides
actually reducing the critical path of instructions, optimizing for
critical path is more likely to land you near something which might
actually be a single cycle instruction in the hardware. That's my
intuition, at least.

I guess another way to select interesting transformations could be to look
for sequences where the result uses a "subset" of the input instruction
sequence. E.g. if the input instruction sequence involves a CMP, then
reducing the input sequence to a single CMP is a win. If the input contain
a shift-imm, then reducing the entire sequence to a shift-imm ("absorbing"
into it) is a win. Maybe there is a generalization of this intuition to
multiple-instruction result sequences?

%0:i32 = var
%1:i32 = addnw 1:i32, %0
%2:i32 = shl 1:i32, %1
infer %2
%3:i32 = shl 2:i32, %0
result %3

%0:i8 = var
%1:i1 = ult 191:i8, %0
%2:i1 = slt 255:i8, %0
%3:i1 = or %1, %2
infer %3
%4:i1 = sle 192:i8, %0
result %4

%0:i32 = var
%1:i32 = shlnsw %0, 1:i32
%2:i32 = or 1:i32, %1
%3:i1 = slt %1, %2
infer %3
result 1:i1

"or" and "and" are basically interchangeable as far as cost is concerned,
so this one also seems like a "subset":

%0:i32 = var
%1:i32 = addnsw 4294967295:i32, %0
%2:i32 = var
%3:i32 = or 1:i32, %2
%4:i32 = addnsw %1, %3
infer %4
%5:i32 = and 4294967294:i32, %2
%6:i32 = add %0, %5
result %6

-- Sean Silva

I guess another way to select interesting transformations could be to look for sequences where the
result uses a "subset" of the input instruction sequence.

Yeah, I had been noticing those subsets too. It sounds like it's worth a try doing a run that looks only for those.

One nice thing is that if we're just looking for subsets, we will not even give the syntehsizer the option to use instructions that aren't in the subset -- this should give a nice speedup, and probably will even make synthesis more likely to succeed (assuming that an optimization exists within the subset).

John

One last cost function idea: the cost is the number of inputs. So here the LHS has cost 1 and the RHS has cost 0:

%0:i32 = var
%1:i32 = shlnsw %0, 1:i32
%2:i32 = or 1:i32, %1
%3:i1 = slt %1, %2
infer %3
result 1:i1

The rationale, of course, is that these are likely to enable subsequent optimizations. There aren't that many of these so perhaps a separate run would be reasonable?

John

I guess another way to select interesting transformations could be to look for sequences where the
result uses a "subset" of the input instruction sequence.

Yeah, I had been noticing those subsets too. It sounds like it's worth a try doing a run that looks only for those.

One nice thing is that if we're just looking for subsets, we will not even give the syntehsizer the option to use instructions that aren't in the subset -- this should give a nice speedup, and probably will even make synthesis more likely to succeed (assuming that an optimization exists within the subset).

+1 - This sounds like a very interesting approach. I suspect it might also find cases which are easier to hand generalize, but that's just a guess.

Another option would be exclude extensions of i1 and non cmov-like selects from the candidate set. A lot of the examples in this set end up converting conditional codes to register values which is non-ideal. I saw a couple of examples which had obvious cmov implementations which just weren't expressed that way in the output.

A couple of small suggestions on the output formatting:
- Give each entry a unique ID (result set 3 #25 would be fine). That would make them much easier to discuss.
- Exclude examples where only a prefix changes. There were several instances where the last instruction in the sequence was duplicated exactly. This is essentially an instance of synthesizing the subproblem and results in duplicates in the output.
- Group all inputs at the beginning. Follow with an instructions which are shared between input and output. This'll make it easier to scan the output.