[RFC] New pass: LoopExitValues

Hello LLVM,
This is a proposal for a new pass that improves performance and code
size in some nested loop situations. The pass is target independent.

From the description in the file header:

This optimization finds loop exit values reevaluated after the loop
execution and replaces them by the corresponding exit values if they
are available. Such sequences can arise after the
SimplifyIndVals+LoopStrengthReduce passes. This pass should be run
after LoopStrengthReduce.

A former colleague created this pass back in LLVM 2.9 and we've been
using it ever since. I've done some light refactoring and
modernization.

This pass broke 4 existing tests that were sensitive to generated
code. I've corrected all these, but please give them special
scrutiny.

The patch is available here: ⚙ D12494 New IR pass: LoopExitValues

Please advise.

Regards,
-steve

Do you have some specific performance measurements?

Jake

Averaging 4 runs of 10000 iterations each of Coremark on my X86_64
desktop showed:

-O2 performance: +2.9% faster with the L.E.V. pass
-Os size: 1.5% smaller with the L.E.V. pass

In the case of Coremark, the benefit comes mainly from the matrix
portion benchmark, which uses nested loops. Similarly, I used a
matrix multiplication for the regression test as shown below. The
L.E.V. pass eliminated 4 instructions.

void matrix_mul(unsigned int Size, unsigned int *Dst, unsigned int
*Src, unsigned int Val) {
  for (int Outer = 0; Outer < Size; ++Outer)
    for (int Inner = 0; Inner < Size; ++Inner)
       Dst[Outer * Size + Inner] = Src[Outer * Size + Inner] * Val;
}

With LoopExitValues

Hi,

Coremark really isn’t a good enough test - have you run the LLVM test suite with this patch, and what were the performance differences?

I’m still a bit confused about what pattern exactly this pass is supposed to trigger on. I understand the mechanics, but I still can’t quite see what patterns it would be useful on. You’ve mentioned matrix multiply - how does this pass alter the IR? What value is it avoiding being recomputed? How does this pass affect register pressure?

Also, your example just removes a mov and an add - the push/pops are just register allocation (unless your pass in fact reduces register pressure?)

A bit more clarification would be great.

Cheers,

James

Hi,

Coremark really isn't a good enough test - have you run the LLVM test suite
with this patch, and what were the performance differences?

For the test suite single source benches, the 235 tests improved
performance, 2 regressed and 705 were unchanged. That seems very
optimistic. Comparing consecutive runs with identical setting shows
there is a lot of noise in the performance data. Tips for stable
results would be appreciated.

I'm still a bit confused about what pattern exactly this pass is supposed to
trigger on. I understand the mechanics, but I still can't quite see what
patterns it would be useful on. You've mentioned matrix multiply - how does
this pass alter the IR?

Here's before and after IR for the matrix_mul example. Notice the two
bitcasts %1 and %2 generated in the for.cond.cleanup block. The L.E.V
pass converts these to scevgep values that already exist.

*** Code after LSR ***

; Function Attrs: nounwind optsize
define void @matrix_mul(i32 %Size, i32* nocapture %Dst, i32* nocapture
readonly %Src, i32 %Val) #0 {
entry:
  %cmp.25 = icmp eq i32 %Size, 0
  br i1 %cmp.25, label %for.cond.cleanup, label %for.body.4.lr.ph.preheader

for.body.4.lr.ph.preheader: ; preds = %entry
  %0 = shl i32 %Size, 2
  br label %for.body.4.lr.ph

for.body.4.lr.ph: ; preds =
%for.body.4.lr.ph.preheader, %for.cond.cleanup.3
  %lsr.iv5 = phi i32* [ %Src, %for.body.4.lr.ph.preheader ], [ %2,
%for.cond.cleanup.3 ]
  %lsr.iv1 = phi i32* [ %Dst, %for.body.4.lr.ph.preheader ], [ %1,
%for.cond.cleanup.3 ]
  %Outer.026 = phi i32 [ %inc10, %for.cond.cleanup.3 ], [ 0,
%for.body.4.lr.ph.preheader ]
  %lsr.iv56 = bitcast i32* %lsr.iv5 to i1*
  %lsr.iv12 = bitcast i32* %lsr.iv1 to i1*
  br label %for.body.4

for.cond.cleanup.loopexit: ; preds = %for.cond.cleanup.3
  br label %for.cond.cleanup

for.cond.cleanup: ; preds =
%for.cond.cleanup.loopexit, %entry
  ret void

for.cond.cleanup.3: ; preds = %for.body.4
  %inc10 = add nuw nsw i32 %Outer.026, 1
  %scevgep = getelementptr i1, i1* %lsr.iv12, i32 %0
  %1 = bitcast i1* %scevgep to i32*
  %scevgep7 = getelementptr i1, i1* %lsr.iv56, i32 %0
  %2 = bitcast i1* %scevgep7 to i32*
  %exitcond27 = icmp eq i32 %inc10, %Size
  br i1 %exitcond27, label %for.cond.cleanup.loopexit, label %for.body.4.lr.ph

for.body.4: ; preds =
%for.body.4, %for.body.4.lr.ph
  %lsr.iv8 = phi i32* [ %scevgep9, %for.body.4 ], [ %lsr.iv5,
%for.body.4.lr.ph ]
  %lsr.iv3 = phi i32* [ %scevgep4, %for.body.4 ], [ %lsr.iv1,
%for.body.4.lr.ph ]
  %lsr.iv = phi i32 [ %lsr.iv.next, %for.body.4 ], [ %Size, %for.body.4.lr.ph ]
  %3 = load i32, i32* %lsr.iv8, align 4, !tbaa !1
  %mul5 = mul i32 %3, %Val
  store i32 %mul5, i32* %lsr.iv3, align 4, !tbaa !1
  %lsr.iv.next = add i32 %lsr.iv, -1
  %scevgep4 = getelementptr i32, i32* %lsr.iv3, i32 1
  %scevgep9 = getelementptr i32, i32* %lsr.iv8, i32 1
  %exitcond = icmp eq i32 %lsr.iv.next, 0
  br i1 %exitcond, label %for.cond.cleanup.3, label %for.body.4
}

*** Code after Loop Exit Values Optimization **

; Function Attrs: nounwind optsize
define void @matrix_mul(i32 %Size, i32* nocapture %Dst, i32* nocapture
readonly %Src, i32 %Val) #0 {
entry:
  %cmp.25 = icmp eq i32 %Size, 0
  br i1 %cmp.25, label %for.cond.cleanup, label %for.body.4.lr.ph.preheader

for.body.4.lr.ph.preheader: ; preds = %entry
  br label %for.body.4.lr.ph

for.body.4.lr.ph: ; preds =
%for.body.4.lr.ph.preheader, %for.cond.cleanup.3
  %lsr.iv5 = phi i32* [ %Src, %for.body.4.lr.ph.preheader ], [
%scevgep9, %for.cond.cleanup.3 ]
  %lsr.iv1 = phi i32* [ %Dst, %for.body.4.lr.ph.preheader ], [
%scevgep4, %for.cond.cleanup.3 ]
  %Outer.026 = phi i32 [ %inc10, %for.cond.cleanup.3 ], [ 0,
%for.body.4.lr.ph.preheader ]
  br label %for.body.4

for.cond.cleanup.loopexit: ; preds = %for.cond.cleanup.3
  br label %for.cond.cleanup

for.cond.cleanup: ; preds =
%for.cond.cleanup.loopexit, %entry
  ret void

for.cond.cleanup.3: ; preds = %for.body.4
  %inc10 = add nuw nsw i32 %Outer.026, 1
  %exitcond27 = icmp eq i32 %inc10, %Size
  br i1 %exitcond27, label %for.cond.cleanup.loopexit, label %for.body.4.lr.ph

for.body.4: ; preds =
%for.body.4, %for.body.4.lr.ph
  %lsr.iv8 = phi i32* [ %scevgep9, %for.body.4 ], [ %lsr.iv5,
%for.body.4.lr.ph ]
  %lsr.iv3 = phi i32* [ %scevgep4, %for.body.4 ], [ %lsr.iv1,
%for.body.4.lr.ph ]
  %lsr.iv = phi i32 [ %lsr.iv.next, %for.body.4 ], [ %Size, %for.body.4.lr.ph ]
  %0 = load i32, i32* %lsr.iv8, align 4, !tbaa !1
  %mul5 = mul i32 %0, %Val
  store i32 %mul5, i32* %lsr.iv3, align 4, !tbaa !1
  %lsr.iv.next = add i32 %lsr.iv, -1
  %scevgep4 = getelementptr i32, i32* %lsr.iv3, i32 1
  %scevgep9 = getelementptr i32, i32* %lsr.iv8, i32 1
  %exitcond = icmp eq i32 %lsr.iv.next, 0
  br i1 %exitcond, label %for.cond.cleanup.3, label %for.body.4

What value is it avoiding being recomputed?

I'm not precisely sure, but it's residue from LSR. The pass checks
all computable SCEV values when a loop exits and in this case found
GEPs with the same value.

How does this pass affect register pressure?
Also, your example just removes a mov and an add - the push/pops are just
register allocation (unless your pass in fact *reduces* register pressure?)

Right, the computation eliminated is the mov and add. Register
savings is a byproduct.

Regards,
-steve

Hello LLVM,
It seems this thread has gone cold. Is there some low risk way for
the community to take the new pass for a test drive?
Regards,
-steve

Which cases does this pass handle which aren’t otherwise optimized out by passes like GlobalValueNumbering or DeadCodeElimination?

Thanks,
Jake VanAdrighem

The case uniquely handled here is one in which a computation after a
loop so happens to be redundant with a value left over from loop
execution. At each nesting level of N nested loops, the pass compares
computable SCEVs looking for matches. This sounds specialized to me,
but I'm not knowledgeable enough in LLVM's many passes to know which
one comes closest to this sort of optimization. My colleague that
originally wrote the pass noted that redundancies found by this pass
pop up after loop transforms like LSR.

Previously, I showed examples of the type of optimization performed by
this pass -- they're bulky so I won't cut and paste here. In
particular, check out the before-and-after IR example James asked for.

Regards,
-steve

Hi Steve

it seems the general consensus is that the patch feels like a work-around for a problem with LSR (and possibly other loop transformations) that introduces redundant instructions. It is probably best to file a bug and a few of your test cases.

Thanks
Gerolf

Hi Steve,

I believe that others have been looking recently at a very similar set of problems. For example, Wei Mi posted this patch:

  http://reviews.llvm.org/D12090

and I wonder if this would also address your use cases. If not, I think we'd like to understand why. Also, if these redundant expressions involve induction variables, then that's something that the IndVarSimplify is already supposed to do, and if it is missing cases, then we should improve that pass (and, thus, folding what you've done into IndVarSimplify might be the right way to go).

-Hal

Hal, thanks for the info. Turns out IndVarSimplify intentionally
creates redundant calculation of loop exit values!

// 2. Any use outside of the loop of an expression derived from the indvar
// is changed to compute the derived value outside of the loop, eliminating
// the dependence on the exit value of the induction variable. If the only
// purpose of the loop is to compute the exit value of some derived
// expression, this transformation will make the loop dead.

I understand the stated purpose for the redundancy, but there is no
mention of undoing the damage. The LoopExitValues pass seems to be
exactly the cleanup that was missing. Do passes coming after IVS like
LSR need the redundant calculations in place to identify dead loops or
should IVS have cleaned up after itself immediately?

Regards,
-steve

The basic idea of your proposed transform makes sense to me. Reversing the canonicalization above do remove redundant computation seems like a worthwhile goal. However, your transformation would have the effect of increasing the live range of the variable in the loop and introducing an extra scheduling dependency. I'm not entirely sure how that would work out in practice if the use outside the loop is a ways from the loop exit. Have you given this any thought? I suspect that the profitability heuristic would need tuning to get right (or at least not wrong for key cases.)

I was also under the impression that LSR was supposed to catch exactly these type of cases. It might be that your pass should be part of LSR instead. Anyone with more knowledge of LSR (Andy, Sanjoy, Hal) have an opinion here?

Philip

I tried the patch on the matrix_mul testcase and saw where the
redundency was removed. My patch in http://reviews.llvm.org/D12090
cannot handle the case. I feel like it is a different problem.

void matrix_mul(unsigned int Size, unsigned int *Dst, unsigned int
*Src, unsigned int Val) {
  for (int Outer = 0; Outer < Size; ++Outer)
    for (int Inner = 0; Inner < Size; ++Inner)
       Dst[Outer * Size + Inner] = Src[Outer * Size + Inner] * Val;
}

"Outer * Size + Inner" is an induction variable in inner loop. Its
SCEV is {{0,+,%Size}<outerloop>,+,1}<innerloop>, so the initial value
of the induction variable for inner loop is {0, +, %Size}<outerloop>,
.i.e, an "lsr.iv.next = lsr.iv + %Size" will be generated at the end
of every iteration of outerloop to prepare the initial value of the
induction variable in the next outerloop iteration.

However, the iteration number of inner loop happens to be "Size" too,
so the exit value of "Outer * Size + Inner" in inner loop happens to
be the same with the initial value of "Outer * Size + Inner" in the
next iteration of outerloop. So the exit value of the induction
variable can be reused and "lsr.iv.next = lsr.iv + %Size" which is
used to compute the initial value of "Outer * Size + Inner" can be
omitted.

If I changed Inner < Size to Inner < Size1, compiler will generate the
same code w/wo the patches.

I tried gcc and gcc also couldn't remove the redundency too. Looks
like the optimization doesn't fit current LSR very well because it
needs to consider the relationship between initial and exit value of
the induction variable in different iterations of outerloop.

This is an interesting work because the code pattern to implement a
multi-dimensional array as a single linear array is common.

Thanks,
Wei.

From: "Steve King" <steve@metrokings.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Gerolf Hoflehner" <ghoflehner@apple.com>, "llvm-dev" <llvm-dev@lists.llvm.org>, "Wei Mi" <wmi@google.com>
Sent: Monday, September 14, 2015 1:08:08 PM
Subject: Re: [llvm-dev] [RFC] New pass: LoopExitValues

> Also, if these redundant expressions involve induction variables,
> then that's something that the IndVarSimplify is already supposed
> to do, and if it is missing cases, then we should improve that
> pass (and, thus, folding what you've done into IndVarSimplify
> might be the right way to go).

Hal, thanks for the info. Turns out IndVarSimplify intentionally
creates redundant calculation of loop exit values!

// 2. Any use outside of the loop of an expression derived from the
indvar
// is changed to compute the derived value outside of the loop,
eliminating
// the dependence on the exit value of the induction variable.
If the only
// purpose of the loop is to compute the exit value of some
derived
// expression, this transformation will make the loop dead.

I understand the stated purpose for the redundancy, but there is no
mention of undoing the damage. The LoopExitValues pass seems to be
exactly the cleanup that was missing. Do passes coming after IVS
like
LSR need the redundant calculations in place to identify dead loops
or
should IVS have cleaned up after itself immediately?

IndVarSimplify deletes dead instructions, and then dead PHIs, but these depend on the in-this-case-temporary redundancy to trivialize dead loops (I assume that SimplifyCFG later removes the remaining now-empty loop blocks).

-Hal

From: "Philip Reames" <listmail@philipreames.com>
To: "Steve King" <steve@metrokings.com>, "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Wei Mi" <wmi@google.com>
Sent: Monday, September 14, 2015 1:57:06 PM
Subject: Re: [llvm-dev] [RFC] New pass: LoopExitValues

>> Also, if these redundant expressions involve induction variables,
>> then that's something that the IndVarSimplify is already supposed
>> to do, and if it is missing cases, then we should improve that
>> pass (and, thus, folding what you've done into IndVarSimplify
>> might be the right way to go).
> Hal, thanks for the info. Turns out IndVarSimplify intentionally
> creates redundant calculation of loop exit values!
>
> // 2. Any use outside of the loop of an expression derived from
> the indvar
> // is changed to compute the derived value outside of the
> loop, eliminating
> // the dependence on the exit value of the induction variable.
> If the only
> // purpose of the loop is to compute the exit value of some
> derived
> // expression, this transformation will make the loop dead.
>
> I understand the stated purpose for the redundancy, but there is no
> mention of undoing the damage. The LoopExitValues pass seems to be
> exactly the cleanup that was missing. Do passes coming after IVS
> like
> LSR need the redundant calculations in place to identify dead loops
> or
> should IVS have cleaned up after itself immediately?
The basic idea of your proposed transform makes sense to me.
Reversing
the canonicalization above do remove redundant computation seems like
a
worthwhile goal. However, your transformation would have the effect
of
increasing the live range of the variable in the loop and introducing
an
extra scheduling dependency. I'm not entirely sure how that would
work
out in practice if the use outside the loop is a ways from the loop
exit. Have you given this any thought? I suspect that the
profitability heuristic would need tuning to get right (or at least
not
wrong for key cases.)

I agree with this, and I think we should pursue kind of cleanup being proposed. We might end up finding issues with live ranges, spilling, rematerialization, etc., but we'll deal with that if/when they come up. We might need to end up weighing the recomputation cost against the spill/reload cost.

I was also under the impression that LSR was supposed to catch
exactly
these type of cases. It might be that your pass should be part of
LSR
instead. Anyone with more knowledge of LSR (Andy, Sanjoy, Hal) have
an
opinion here?

I don't think it will do anything with exit values, and more generally, is concerned only with IV-users within a single loop (and maybe only pointer-typed ones?).

-Hal

Assuming we should not interfere with the (hopefully) temporary
redundancy created by IndVarSimplify, then why would the live range
created here be of special concern vs. the live range of all the other
variables in a given function?

Thanks for experimenting with the patch! If I understand correctly,
you show that the redundancy is inherent in the loop and not a
byproduct of other passes. Following this insight, changing just one
line of the IR produces assembly *identical* to the effect of the new
LEV pass. Specifically, in the outer loop, the multiply can be
replaced with a PHI node that recycles the %add value coming from the
inner loop.

Should the front end be responsible for this sort of optimization?

Original:
  %mul = mul i32 %Outer.026, %Size

Replacing with this line has same effect as LEV pass:
  %mul = phi i32 [ %add, %for.cond.cleanup.3 ], [ 0, %entry ]

The original IR for the matrix_mul.c test case:

define void @matrix_mul(i32 %Size, i32* nocapture %Dst, i32* nocapture
readonly %Src, i32 %Val) #0 {
entry:
  %cmp.25 = icmp eq i32 %Size, 0
  br i1 %cmp.25, label %for.cond.cleanup, label %for.body.4.lr.ph

for.body.4.lr.ph: ; preds = %entry,
%for.cond.cleanup.3
  %Outer.026 = phi i32 [ %inc10, %for.cond.cleanup.3 ], [ 0, %entry ]
;****************
  %mul = mul i32 %Outer.026, %Size
;****************
  br label %for.body.4

for.cond.cleanup: ; preds =
%for.cond.cleanup.3, %entry
  ret void

for.cond.cleanup.3: ; preds = %for.body.4
  %inc10 = add nuw nsw i32 %Outer.026, 1
  %exitcond28 = icmp eq i32 %inc10, %Size
  br i1 %exitcond28, label %for.cond.cleanup, label %for.body.4.lr.ph

for.body.4: ; preds =
%for.body.4, %for.body.4.lr.ph
  %Inner.024 = phi i32 [ 0, %for.body.4.lr.ph ], [ %inc, %for.body.4 ]
  %add = add i32 %Inner.024, %mul
  %arrayidx = getelementptr inbounds i32, i32* %Src, i32 %add
  %0 = load i32, i32* %arrayidx, align 4, !tbaa !1
  %mul5 = mul i32 %0, %Val
  %arrayidx8 = getelementptr inbounds i32, i32* %Dst, i32 %add
  store i32 %mul5, i32* %arrayidx8, align 4, !tbaa !1
  %inc = add nuw nsw i32 %Inner.024, 1
  %exitcond = icmp eq i32 %inc, %Size
  br i1 %exitcond, label %for.cond.cleanup.3, label %for.body.4
}

Hi Folks, Let's keep this optimization alive. To summarize: several
folks voiced general support, but with questions about why existing
optimizations do not already catch this case. Deep dive by Wei Mi
showed that the optimization is most likely not a clean-up of LSR
sloppiness, but something new. Follow-up by myself confirmed that the
redundancy eliminated the LoopExitValues pass exists in the original
IR from Clang.

As a next step, can we come to consensus on where the LoopExitValues
optimization belongs?

Regards,
-steve

Maybe it can follow the "Delete dead loops" pass which is after
Induction Variable Simplification pass, so it will not affect existing
exitvalue rewriting optimization in Induction Variable Simplification
to find out and delete dead loops?

Existing pass pipeline:
        Loop-Closed SSA Form Pass
        Loop Pass Manager
          Induction Variable Simplification
          Recognize loop idioms
          Delete dead loops
                                                           <=== New
LoopExitValues
        Function TargetTransformInfo
        Loop Pass Manager
          Unroll loops
        Memory Dependence Analysis

I have the same worry as Philip and Hal that the new LoopExitValues
pass may increase some live range significantly in certain cases
because it reuses value cross outerloop iterations. Like the following
hypothetical case, the value reuse will create a live range living
across loop2, loop3, .... But we can add some simple logic to obviate
such case.

for (int Outer = 0; Outer < Size; ++Outer) {
   for (int Inner = 0; Inner < Size; ++Inner)
      Dst[Outer * Size + Inner] = Src[Outer * Size + Inner] * Val;

   for loop 2.
   for loop 3.
   ...
}

Thanks,
Wei.

Thanks Wei. Can you please give your ideas about logic to catch abuse
of live ranges?