[IndVarSimplify] Narrow IV's are not eliminated resulting in inefficient code

​Hi,

Would you be able to kindly check and assist with the IndVarSimplify / SCEV problem I got in the latest LLVM, please?

Sometimes IndVarSimplify may not eliminate narrow IV’s when there actually exists such a possibility. This may affect other LLVM passes and result in inefficient code. The reproducing test ‘indvar_test.cpp’ is attached.

The problem is with the second ‘for’ loop that accesses array elements with different indexes on each iteration.
The latest LLVM fails to reuse array element values from previous iterations and generates an unnecessary GEP. The generated IR is shown in the attached file ‘bad.ir’.
This happens because IndVarSimplify fails to eliminate ‘%idxprom7’ and ‘%idxprom10’.

The clang command line we use:
clang++ -mllvm -debug -S -emit-llvm -O3 --target=aarch64-linux-elf indvar_test.cpp -o bad.ir

I found that ‘WidenIV::widenIVUse’ (IndVarSimplify.cpp) may fail to widen narrow IV uses.

When the function gets a NarrowUse such as ‘{(-2 + %inc.lcssa),+,1}<%for.body3>’, it first tries to get a wide recurrence for it via the ‘getWideRecurrence’ call.
‘getWideRecurrence’ returns recurrence like this: ‘{(sext i32 (-2 + %inc.lcssa) to i64),+,1}<%for.body3>’, which is fine by itself.

Then a wide use operation is generated by ‘cloneIVUser’. The generated wide use is evaluated to ‘{(-2 + (sext i32 %inc.lcssa to i64)),+,1}<%for.body3>’, which is different from ‘getWideRecurrence’ result (please note the position of -2). ‘cloneIVUser’ sees the difference and returns nullptr.

I attached a test patch ‘indvar.patch’, which is not correct for all cases, but it fixes the specific ‘indvar_test.cpp’ scenario to demonstrate the efficient code that could have been generated (good.ir).
It transforms expressions like ‘(sext i32 (-2 + %inc.lcssa) to i64)’ into ‘-2 + (sext i32 %inc.lcssa to i64)’ making expressions comparison succeed. IV’s are successfully eliminated, which can be seen in the ‘-mllvm -debug’ output.

The problem with the patch is that it uses wrong extend logic for the ‘-2’ operand. It must be sign or zero extended depending on the context.

Could you check and confirm the problem, and give any hints how this might be fixed properly, please?

Thank you.

bad.ir (2.73 KB)

good.ir (1.88 KB)

indvar.patch (989 Bytes)

indvar_test.cpp (229 Bytes)

Hi Oleg,

I think the problem here is that SCEV forgets to propagate no-wrap
flags when folding "{S,+,X}+T ==> {S+T,+,X}".

I haven't carefully thought about the implications and whether the
change is even correct, but the appended patch fixes the test case
you've attached. I'll give it some more thought and if it holds up
I'll check it in in the next few days. Meanwhile if you have a larger
test case that you extracted indvar_test.cpp from, I'd be interested
in hearing if this change works there as well.

diff --git a/lib/Analysis/ScalarEvolution.cpp b/lib/Analysis/ScalarEvolution.cpp
index 39ced1e..2e87902 100644
--- a/lib/Analysis/ScalarEvolution.cpp
+++ b/lib/Analysis/ScalarEvolution.cpp
@@ -2274,19 +2274,19 @@ const SCEV *ScalarEvolution::getAddExpr(SmallVectorImpl<const SCEV *> &Ops,
        }

      // If we found some loop invariants, fold them into the recurrence.
      if (!LIOps.empty()) {
        // NLI + LI + {Start,+,Step} --> NLI + {LI+Start,+,Step}
        LIOps.push_back(AddRec->getStart());

        SmallVector<const SCEV *, 4> AddRecOps(AddRec->op_begin(),
                                               AddRec->op_end());
- AddRecOps[0] = getAddExpr(LIOps);
+ AddRecOps[0] = getAddExpr(LIOps, Flags);

        // Build the new addrec. Propagate the NUW and NSW flags if both the
        // outer add and the inner addrec are guaranteed to have no overflow.
        // Always propagate NW.
        Flags = AddRec->getNoWrapFlags(setFlags(Flags, SCEV::FlagNW));
        const SCEV *NewRec = getAddRecExpr(AddRecOps, AddRecLoop, Flags);

        // If all of the other operands were loop invariant, we are done.
        if (Ops.size() == 1) return NewRec;

Thanks!
-- Sanjoy

Hi Sanjoy,

Thank you for looking into this!
Yes, your patch does fix my larger test case too. My algorithm gets double performance improvement with the patch, as the loop now has a smaller instruction set and succeeds to unroll w/o any extra #pragma’s.

I also ran the LLVM tests against the patch. There are 6 new failures:
Analysis/LoopAccessAnalysis/number-of-memchecks.ll
Analysis/LoopAccessAnalysis/reverse-memcheck-bounds.ll
Analysis/ScalarEvolution/flags-from-poison.ll
Analysis/ScalarEvolution/nsw-offset-assume.ll
Analysis/ScalarEvolution/nsw-offset.ll
Analysis/ScalarEvolution/nsw.ll

I haven’t inspected these failures in detail yet, but it’s likely the tests merely need to be adjusted to handle the new no-wrap flags the patch introduced. I will double-check this soon.

Kind regards,
Oleg

Hi Sanjoy,

Attached is the patch that fixes the LLVM test regressions caused by this fix. It just adds entries the fix has introduced.

Would you have a couple of minutes to check it, please?

I would also like to share two differences in the opt logs I noticed for these tests.

1. The fixed LLVM might split sext of a sum into a sum of two sext’s while doing SCEV analysis. This does not affect the final IR however - it is the same for all the patched tests with and w/o your fix. E.g., this can be seen on Analysis/ScalarEvolution/flags-from-poison.ll. LLVM w/o the fix prints this:

scev-tests.patch (8.65 KB)

Hi Sanjoy,

Could you let me know if you have had a chance to check the patch and the LLVM test corrections, please?

Thanks!

Hi Oleg,

Sorry for letting this slide -- the easiest way to shame me into looking at a patch is to put it on Phabricator [1], and add me as a reviewer. That way the patch will keep showing up on on my "pending action" queue till I review it. :slight_smile:

On the surface, the test corrections look fine to me, and I think the original fix is correct. Do you mind taking all of this and putting it up on Phabricator? Phabricator will also make it easy for me to more thoroughly check if the test fixes are correct.

[1]: http://llvm.org/docs/Phabricator.html

Thanks,
-- Sanjoy

Oleg Ranevskyy wrote: