How to best deal with undesirable Induction Variable Simplification?

Hello,
Recently I’ve come across two instances where Induction Variable Simplification lead to noticable performance regressions.

In one case, the removal of extra IV lead to the inability to reschedule instructions in a tight loop to reduce stalls. In that case, there were enough registers to spare, so using extra register for extra induction variable was preferable since it reduced dependencies in the loop.
In the second case, there was a big nested loop made even bigger after unswitching. However, the inner loop body was rather simple, of the form:

loop {

p+=n;

p+=n;

}
use p.

Due to unswitching there were several such loops each with the different number of p+=n ops, so when the IndVars pass rewrote all exit values, it added a lot of slightly different offsets to the main loop header that couldn’t fit in the available registers which lead to unnecessary spills/reloads.

I am wondering what is the usual strategy for dealing with such “pessimizations”? Is it possible to somehow modify the IndVarSimplify pass to take those issues into account (for example, tell it that adding offset computation + gep is potentially more expensive than simply reusing last var from the loop) or should it be recovered in some later pass? If so, is there an easy way to revert IV elimination? Have anyone dealt with similar issues before?

Hello,
Recently I’ve come across two instances where Induction Variable Simplification lead to noticable performance regressions.

In one case, the removal of extra IV lead to the inability to reschedule instructions in a tight loop to reduce stalls. In that case, there were enough registers to spare, so using extra register for extra induction variable was preferable since it reduced dependencies in the loop.

Since r139579, IndVarSimplify (the pass) should not normalize
induction variables without a reason anymore (a reason would be that
the loop can be deleted). Could you file a bug report, attach a
minimal .ll file and mention what output you would expect?

Due to unswitching there were several such loops each with the different number of p+=n ops, so when the IndVars pass rewrote all exit values, it added a lot of slightly different offsets to the main loop header that couldn’t fit in the available registers which lead to unnecessary spills/reloads.

Since after unswitching only one of the resulting loops is executed,
the register usage should be the maximum of those loops, which ideally
is at most the register usage of the pre-unswitched loop. In your
case, p could be in the same register in all unswitched loops.
However, other optimizations might increase register pressure again
and the register allocation is not optimal in all cases.

Again, could you file a bug report, include a minimal reproducer and
what output you expect?

I am wondering what is the usual strategy for dealing with such “pessimizations”? Is it possible to somehow modify the IndVarSimplify pass to take those issues into account (for example, tell it that adding offset computation + gep is potentially more expensive than simply reusing last var from the loop) or should it be recovered in some later pass? If so, is there an easy way to revert IV elimination? Have anyone dealt with similar issues before?

Ideally, we prefer to such pessimizations to not occur, as r139579
did. However, the transformation might also be a IR normalization that
enables other transformations. In that case, another pass down the
pipeline would transform the normalized form to an optimized one. For
instance, LoopSimplify inserts a loop preheader the CFGSimplify would
remove again. What is considered normalization depends on the case. If
you can show that a change generally improves performance (not just
for your code) and has at most minor regressions, then any approach is
worth considering.

Michael

Since r139579, IndVarSimplify (the pass) should not normalize induction variables without a reason anymore (a reason would be that the loop can be deleted). Could you file a bug report, attach a minimal .ll file and mention what output you would expect?

The IV is removed there by the replaceCongruentIVs. It is what I'd probably expect when looking at the IR alone, but, as I've mentioned, this prevents latency masking later down the line since now certain ops use single common register.

Since after unswitching only one of the resulting loops is executed, the register usage should be the maximum of those loops, which ideally is at most the register usage of the pre-unswitched loop. In your case, p could be in the same register in all unswitched loops.

However, other optimizations might increase register pressure again and the register allocation is not optimal in all cases.

It looks like for some reason, when IndVars rewrote all loop exit values (which were just pointers incremented in the loop body) from simple single-value phis to GEP with recomputed offset (back edge count * increment inside the loop), it expanded this offset computation in the main outermost loop (pre?)header even when the value was used only inside one of the unswitched loops exits. Later passes failed to sink them either for whatever reason so in the end instead of max(unswitched loop regs) it became max(unswitched loop regs) + Const * number of loops (for offsets, even though many were shared).

I'll see if I can come up with a minimal reproducer for some in-tree target.

Hi Hal,

I see. So LSR could theoretically counteract undesirable Ind Var transformations but it’s not implemented at the moment?

I think I’ve managed to come up with a small reproducer that can also exhibit similar problem on x86, here it is: https://godbolt.org/z/_wxzut

As you can see, when rewriteLoopExitValues is not disabled Clang generates worse code due to additional spills, because Ind Vars rewrites all exit values of ‘a’ to recompute it’s value instead of reusing the value from the loop body. This requires extra registers for the new “a after the loop” value (since it’s not simply reused) and also to store the new “offset”, which leads to the extra spills since they all live across big loop body. When exit values are not rewritten ‘a’ stays in it’s r15d register with no extra costs.

I see. So LSR could theoretically counteract undesirable Ind Var transformations but it’s not implemented at the moment?

I believe that’s correct.

I think I’ve managed to come up with a small reproducer that can also exhibit similar problem on x86, here it is: https://godbolt.org/z/_wxzut

Thanks. Can you please file a bug on https://bugs.llvm.org/? (if you can’t, let us know, and one of us can do it).

-Hal

I have to ask a further question here. Why are the spill/fills problematic? If they happened outside said loops - as you’d expect from the example - at worst there is a code size impact. Is there something more going on? (i.e. are the loops super short running or something?)

Hi Hal,

I see. So LSR could theoretically counteract undesirable Ind Var
transformations but it’s not implemented at the moment?

I think I’ve managed to come up with a small reproducer that can also
exhibit similar problem on x86, here it is: https://godbolt.org/z/_wxzut

As you can see, when rewriteLoopExitValues is not disabled Clang
generates worse code due to additional spills, because Ind Vars
rewrites all exit values of ‘a’ to recompute it’s value instead of
reusing the value from the loop body. This requires extra registers
for the new “a after the loop” value (since it’s not simply reused)
and also to store the new “offset”, which leads to the extra spills
since they all live across big loop body. When exit values are not
rewritten ‘a’ stays in it’s `r15d` register with no extra costs.

This hits on a point I've thought some about, but haven't tried to
implement.

I think there might be room for a late pass which undoes the exit value
rewriting. As an analogy, we have MachineLICM which sometimes undoes
the transforms performed by LICM, but we still want the IR form to hoist
aggressively for ease of optimization and analysis.

Maybe this should be part of LSR, or maybe separate. Haven't thought
about that part extensively.

It's worth noting that the SCEVs for the exit value of the value inside
the loop and the rewritten exit value should be identical. So
recognizing the case for potential rewriting is quite straight-forward.
The profitability reasoning might be more involved, but the legality
part should essentially be handled by SCEV, and should be able to reuse
exactly the same code as RLEV.

I’ve reported the bug at https://bugs.llvm.org/show_bug.cgi?id=42965

I’ve noticed that there was an attempt to mitigate ExitValues problem in https://reviews.llvm.org/D12494 that went nowhere. Were there particular issues with that approach?

Wasn’t aware of this patch. No, I don’t see an obvious reason why it wasn’t followed up on.

Philip

Thanks. I’ve rebased this patch on top of the recent LLVM (it was straightforward) and applied it in my fork.

It seems to have solved one of the problems I was having. Would LLVM be interested if I submit the updated version for the review?

I’d be happy to review a patch, but you need to make sure you have buy in from the original patch author. I’m not entirely sure of the licensing implications of you submitting someone elses code, so it’s best to avoid the discussion if we can. (If anyone can give a conclusive answer here, please chime in.)

Philip

I did a fresh implementation of the idea, having deliberately not looked at the previous patch much. I only took it as far as a POC, but you’re welcome to take it the rest of the way if you want.

I may find some time to keep playing with this, or I may not.

Philip

Thanks!

From a quick glance it looks like it tries to replace instruction operands case-by-case instead of identifying loop invariant exit values (created by IndVars) and replacing all their uses with the value from the loop, similarly to how it was done in the other attempt at this. I don’t know how these approaches compare in practice. Anyway, thanks for coming up with this. I’ve left some comments in the review.