Improving codegen for not-quite-constant-folding

Hi all,

Some Facebook colleagues and I are looking at poor codegen for the
C++20 spaceship operator (i.e. operator <=>, which performs 3-way
comparisons), when used to implement simple comparisons (e.g.
operator<). A simple example is at
Compiler Explorer, which desugars into something
like Compiler Explorer. By comparison, a
hand-written operator< would probably look like one of
Compiler Explorer, or
Compiler Explorer, or
Compiler Explorer.

A reduced example with the same issue is here:
Compiler Explorer. Broadly, a tree of select/phi
nodes with constants at the leaves isn't subject to constant folding,
even though the reduced example could be rewritten as something like:
Compiler Explorer (this saves a multiply, and in
"real" examples enables further simplifications).

GCC's GVN/PRE pass does a better job here with -O2 -ftree-partial-pre
("partial pre" is not a typo; it's "partial partial redundancy
elimination", in addition to regular partial redundancy elimination),
or -O3 (which turns on -ftree-partial-pre). -ftree-partial-pre tracks
partial anticipation of expressions (those that are evaluated on some
path to exit) on a per-BB level, and inserts the necessary phis to
ensure that a partially anticipated expression is available in a
temporary if temporaries with the same value number are available in
all a BB's predecessors. It does constant folding after phi
translation before checking for availability, and regards all
constants as always available. After phi-translating the result into
each branch of the ternary operators, we get constants, so the
optimization kicks in and we can replace each leaf constant with the
result of multiplying it by 123.

A simpler approach that could catch cases like this would be to find
instances of trees in which the leaves are constants, and the non-leaf
nodes are phis or selects. If the root of such a tree is the input
into an operation we could otherwise constant fold (if that root were
a constant), we constant-fold each leaf of the tree instead. This
would miss some optimization opportunities that the GCC approach would
catch, however.

The former feels tricky to implement on top of GVN or NewGVN (and,
it's unclear that it's always a net improvement). The latter seems
reasonable, although it would require a one-off pass targeted at this
relatively narrow case, which feels heavyweight. My inclination is to
attempt the simpler approach, but wanted a sanity check that this
seems reasonable and is not better done elsewhere (or with a different
approach entirely).


There's a bit of prior work in this area inspired by the java icmp and lcmp bytecode semantics which are fairly similar to C++ space ship operators.

See matchThreeWayIntCompare in InstCombine. This recognizes the select form after all the control flow has been eliminated. We relied on jump threading and simplify cfg to eliminate the control flow, so that doesn't do anything for the question you posed.

Your example also involves multiple element comparisons which complicates the matching significantly. (I suspect you know this, just stating it for the record.)

On the partial PRE, fitting that into the existing logic in GVN wouldn't be terrible. However, I really question the profitability of this transform in general. It's effectively hoisting an instruction into path where it didn't previously execute. Doing that without regards to relative frequencies can be quite painful.

You could also try to PRE the condition back into the predecessors. We explicitly avoid that today due to profitability problems, but maybe you can find a sub-case which is profitable and enable it selectively?

Another observation is that we appear to not be canonicalizing the operands of the selects. In the optimized IR, I see:

%10 = icmp eq i64 %1, %3, !dbg !42
%11 = icmp ult i64 %1, %3, !dbg !44
%12 = select i1 %11, i8 -1, i8 1, !dbg !44
%13 = select i1 %10, i8 0, i8 %12, !dbg !44
br label %14, !dbg !44

14: ; preds = %6, %9
%15 = phi i8 [ %8, %6 ], [ %13, %9 ], !dbg !33
%16 = icmp slt i8 %15, 0, !dbg !45

What's interesting here is that 1 and 0 are indistinguishable. Not sure what would happen if we rewrote the i8 1 to be i8 0, but maybe that unlocks some progress? Actually, I think it does. Once that's done, anding the two conditions becomes more obviously profitable. Maybe we could back propagate constant ranges and canonicalize constants in some way? (Actually, now that I write that, why can't simplify demanded bits do exactly that?!)