InstCombine's select optimizations don't trigger reliably

Looking at foldSelectCttzCtlz, I noticed that the optimization fails to apply in cases where it should. E.g. it triggers for:
int f1(uint64_t n) { return n == 0 ? 64 : __builtin_ctzll(n); }
but not for:
int f2(uint64_t n) { uint64_t cn = ~n; return cn == 0 ? 64 : __builtin_ctzll(cn);}

And then looking further, I noticed that the same problem affects many of the other (maybe even most?) optimizations for select in InstCombine. E.g. foldSelectZeroOrMul should trigger on int f3(int n, int m) { int cn = ~n; return cn == 0 ? 0 : cn*m; } but doesn’t, for the same reason.

The reason why this doesn’t work seems to be a combination of inflexibility in pattern-matching, and the order in which instructions are folded.

Going back to f2, before InstCombine the IR looks like:

  %neg = xor i64 %n, -1
  %cmp = icmp eq i64 %neg, 0
  %0 = call i64 @llvm.cttz.i64(i64 %neg, i1 false)
  %cast = trunc i64 %0 to i32
  %cond = select i1 %cmp, i32 64, i32 %cast

And, as written, this should trigger the optimization. However, InstCombine visits the icmp instruction FIRST before it gets to the select (where the cttz optimization triggers) – and folds the icmp into the equivalent instruction:

  %cmp = icmp eq i64 %n, -1

Then, this no longer meets the required pattern, because %neg is not the first arg of icmp (to match the first arg of @llvm.cttz), and the constant arg is -1 rather than the expected 0.

Is this problem already being tracked? I didn’t find a pre-existing report, but it’s hard to search for. Is there some good way to solve this issue in general? E.g. to allow matching up against an arbitrarily-folded icmp, or to defer icmp folding until after running the select pattern matchers, or…something?

In general, we can’t bypass a fold in instcombine and have a complete solution. That’s because – as I think you’ve shown with the alternate example – the source might already be in the form that we wanted to avoid.

We probably just need to account for more variations in the general pattern. We can probably break down the larger sequences into smaller folds and reduce the extra matching requirements.

For example, we don’t seem to have this toggle for ctlz/cttz:

I don’t know of any bug reports tracking this set of problems. If you can file issues, that would be great, so we can track progress/fixes.

Yes, the source may indeed already be in the “result of icmp fold” form. However, to me, that feels like a lesser problem than the that a given source idiom is unreliably optimized. Yes, it would be better if if n == -1 ? 64 : __builtin_ctzll(~n) got its branch removed, but that’s not likely to be commonly written, so that seems only marginally unfortunate. But, n == 0 ? 64 : __builtin_ctzll(n) is, basically the required idiom, so having its optimization unreliably trigger, depending on what operation produced “n”, seems significantly more undesirable.

Yet, I don’t see how simply accounting for more variations is a real solution – there are just too many possible folds for icmp for us to possibly be able to account for all of them manually in every one of the select folds. (We could add a few more special cases, sure, and maybe that’s worthwhile doing, but it doesn’t actually solve the problem.)

It feels like we need some generic way to determine that when a select fold wants to match an instruction icmp %input, 0, that it actually must instead try to match an icmp %other_value, SomeConstant instruction – because that other instruction would be the result of the icmp fold on the desired instruction-match. But, I’m not sure how that would actually work…