Div/rem over select-of-constants

Consider the following code

template <typename T>
T div(T val, bool b) {
    T a = b ? 3 : 4;
    return val / a;

template <typename T>
T mod(T val, bool b) {
    T a = b ? 3 : 4;
    return val % a;

With -O3 and -ffast-math enabled, all outputs for different signed/unsigned integral type specializations emit div/idiv instruction. (-Os doesn’t affect it) This is a case that the div/rem can be folded into the select so that each branch can perform its own div-by-constant optimization, however I noticed that InstCombine intended to prevent folding here. (See
llvm/Transforms/InstCombine/InstCombineMulDivRem.cpp commonIRemTransforms and
llvm/test/Transforms/InstCombine/rem.ll srem_select_of_constants_divisor)
In commonIRemTransforms only if dividend is a constant then it is folded into select-of-constant divisor.

Is there any rationale for this design decision?

Also noticed another thing, probably belong to a different topic (x86 backend?) For 64-bit integral types, codegen tries to speculate if the dividend can fit 32-bit and branch into (i)divl, and this happens by default architecture (and architecture up to skylake if specified). If the point is to save instruction cycles if dividend fits in 32-bit, shouldn’t it be placed in the branch not taken path?

unsigned long div<unsigned long>(unsigned long, bool):                        # @unsigned long div<unsigned long>(unsigned long, bool)
        movq    %rdi, %rax
        movl    %esi, %edx
        movl    $4, %ecx
        subq    %rdx, %rcx
        movq    %rdi, %rdx
        shrq    $32, %rdx
        je      .LBB1_1
        xorl    %edx, %edx
        divq    %rcx
        xorl    %edx, %edx
        divl    %ecx

Regarding your first question: folding the div into the select can potentially increase the critical path length, so unless you know both branches after folding can be simplified, or if profile data exists, the simplification can happen with the hot branch – in your example when ‘4’ branch is hot.

The second one looks weird. GCC generates cleaner code. You may want to post the question in x86-backend group.

Want to clarify about the hot branch point – this works when combined with @apostolakis 's cmov optimization which will turn the select into branch.

folding the div into the select can potentially increase the critical path length

I don’t see how this increase critical path length since we can detect if the BB coming into the PHI is part of a loop itself, and if not, then whether folding or not the instruction count on each path will be the same, and that’s the rationale of foldOpIntoPhi used in many places (and it even permits one incoming BB not being a constant).

It is true that folding into PHI will increase total instruction count (by a number of incoming edges - 1), but so is replacing div by power of 2 with two instructions.

You are right — what I said is not clear. It does increase dynamic instruction count and can potentially increase critical path length due to the increased resource pressure. For idiv, there is also cost related to operand setup etc, so it needs to be dealt with with more care.