[InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

Hi,

We found that today's 17.30%/11.37% performance regressions in LNT SingleSource/Benchmarks/Shootout/sieve on LNT-AArch64-A53-O3__clang_DEV__aarch64 and LNT-Thumb2v7-A15-O3__clang_DEV__thumbv7 (http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/20?filter-machine-regex=aarch64|arm|thumb|green) are caused by changes [rL292492] in InstCombine:

⚙ D28406 [InstCombine] icmp sgt (shl nsw X, C1), C0 --> icmp sgt X, C0 >> C1 "[InstCombine] icmp sgt (shl nsw X, C1), C0 --> icmp sgt X, C0 >> C1"

The Loop Vectorizer generates code with more instructions:

==== Loop Vectorizer from rL292492 ====
for.body5: ; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader
  %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ]
  %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1, %for.cond.preheader ]
  %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ]
  %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2, %for.cond.preheader ]
  %2 = add i64 %indvar, 2
  %3 = shl i64 %indvar, 1
  %4 = add i64 %3, 4
  %5 = add i64 %indvar, 2
  %6 = shl i64 %indvar, 1
  %7 = add i64 %6, 4
  %8 = add i64 %indvar, 2
  %9 = mul i64 %indvar, 3
  %10 = add i64 %9, 6
  %11 = icmp sgt i64 %10, 8193
  %smax = select i1 %11, i64 %10, i64 8193
  %12 = mul i64 %indvar, -2
  %13 = add i64 %12, -5
  %14 = add i64 %smax, %13
  %15 = add i64 %indvar, 2
  %16 = udiv i64 %14, %15
  %17 = add i64 %16, 1
  %tobool7 = icmp eq i8 %1, 0
  br i1 %tobool7, label %for.inc16, label %if.then

Thanks for letting me know about this problem!

There’s no ‘shl nsw’ visible in the earlier (r292487) code, so it would be better to see exactly what the IR looks like before that added transform fires.

But I see a red flag:
%smax = select i1 %11, i64 %10, i64 8193

The new icmp transform allowed us to create an smax, but we have this hack in InstCombiner::visitICmpInst():

// Test if the ICmpInst instruction is used exclusively by a select as
// part of a minimum or maximum operation. If so, refrain from doing
// any other folding. This helps out other analyses which understand
// non-obfuscated minimum and maximum idioms, such as ScalarEvolution
// and CodeGen. And in this case, at least one of the comparison
// operands has at least one user besides the compare (the select),
// which would often largely negate the benefit of folding anyway.

…so that prevented folding the icmp into the earlier math.

I am actively working on trying to get rid of that bail-out by improving min/max value tracking and icmp/select folding. In fact, we might be able to remove it right now, but I don’t know the history of that code or what cases it was supposed to help.

If it’s possible, can you remove that check locally, rebuild, and try the benchmark again on your system? I’d love to know if that change alone would solve the problem.

Hi Sanjay,

The benchmark source file: http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup

Clang options used to produce the initial IR: clang -DNDEBUG -O3 -DNDEBUG -mcpu=cortex-a53 -fomit-frame-pointer -O3 -DNDEBUG -w -Werror=date-time -c sieve.c -S -emit-llvm -mllvm -disable-llvm-optzns --target=aarch64-arm-linux

Opt options: opt -O3 -o /dev/null -print-before-all -print-after-all sieve.ll >& sieve.log

I used the IR (in attached sieve.zip) created with the r292487 version.

The attached sieve contains the output of ‘-print-before-all -print-after-all’ for r292487 and rL292492.

If it’s possible, can you remove that check locally, rebuild,

and try the benchmark again on your system? I’d love to know

if that change alone would solve the problem.

Do you mean to remove the hack in InstCombiner::visitICmpInst()?

Kind regards,
Evgeny Astigeevich
Senior Compiler Engineer
Compilation Tools
ARM

sieve.zip (39.2 KB)

Do you mean to remove the hack in InstCombiner::visitICmpInst()?

Yes. Although (this just came up in D28625 too) we might need to remove multiple versions of that in order to unlock optimization:

https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4338

https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L470

https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstructionCombining.cpp#L803

https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L409

Thank you for information.

I’ll build clang without the hack and re-run the benchmark tomorrow.

-Evgeny

I tried an experiment to remove the integer min/max bailouts from InstCombine, and it doesn’t appear to change the IR in the attachment, so I doubt there’s going to be any improvement.

If I haven’t messed up this example, this is amazing:
https://godbolt.org/g/yzoxeY

Thanks for letting me know about this problem!

There's no 'shl nsw' visible in the earlier (r292487) code, so it would be
better to see exactly what the IR looks like before that added transform
fires.

But I see a red flag:
  %smax = select i1 %11, i64 %10, i64 8193

The new icmp transform allowed us to create an smax, but we have this hack
in InstCombiner::visitICmpInst():

  // Test if the ICmpInst instruction is used exclusively by a select as
  // part of a minimum or maximum operation. If so, refrain from doing
  // any other folding. This helps out other analyses which understand
  // non-obfuscated minimum and maximum idioms, such as ScalarEvolution
  // and CodeGen. And in this case, at least one of the comparison
  // operands has at least one user besides the compare (the select),
  // which would often largely negate the benefit of folding anyway.

...so that prevented folding the icmp into the earlier math.

I am actively working on trying to get rid of that bail-out by improving
min/max value tracking and icmp/select folding. In fact, we might be able
to remove it right now, but I don't know the history of that code or what
cases it was supposed to help.

I think the primary reason is SCEV. It has SCEVSMax/SCEVUMax expressions.
Our generic select handling for things like knownbits is going to be less
precise.

Confirm there is no change in IR if the hack is disabled in the sources.

David wrote that these instructions are created by SCEV.

Are other targets affected by the changes, e.g. X86?

Kind regards,
Evgeny Astigeevich
Senior Compiler Engineer
Compilation Tools
ARM

All targets are likely affected in some way by the icmp+shl fold introduced with r292492. It’s a basic pattern that occurs in lots of code. Did you see any perf wins on your targets with this commit?

Sadly, it is also likely that many (all?) targets are negatively impacted on the particular test (SingleSource/Benchmarks/Shootout/sieve) that you have pointed out here because the IR is now decidedly worse.

IMO, we should not revert the commit because it exposed shortcomings in the optimizer. It’s an “obvious” fold/canonicalization, and the related ‘nuw’ variant of this fold has existed in trunk since:
https://reviews.llvm.org/rL285729

We need to dissect what analysis/folds are missing to restore the IR to the better form that existed before, but this is probably going to be a long process because we treat min/max like an optimization fence.

If this is gonna be a long process to recover, this looks like something to be reverted in the 4.0 branch (unless I missed that there is a correctness fix involved?).

All targets are likely affected in some way by the icmp+shl fold
introduced with r292492. It's a basic pattern that occurs in lots of code.
Did you see any perf wins on your targets with this commit?

Sadly, it is also likely that many (all?) targets are negatively impacted
on the particular test (SingleSource/Benchmarks/Shootout/sieve) that you
have pointed out here because the IR is now decidedly worse.

IMO, we should not revert the commit because it exposed shortcomings in
the optimizer. It's an "obvious" fold/canonicalization, and the related
'nuw' variant of this fold has existed in trunk since:
https://reviews.llvm.org/rL285729

We need to dissect what analysis/folds are missing to restore the IR to
the better form that existed before, but this is probably going to be a
long process because we treat min/max like an optimization fence.

If this is gonna be a long process to recover, this looks like something
to be reverted in the 4.0 branch (unless I missed that there is a
correctness fix involved?).

Nope - this is just about perf, not correctness. Of course, the intent was
that this transform should only improve perf, so I wonder if we can pin any
other perf changes from this commit.

I'm new to using the LNT site, but this should be the full set of results
for the A53 machine in question with a baseline (r292491) before this patch
and current (r292522) :
http://llvm.org/perf/db_default/v4/nts/107364

If these are reliable results, we have 2 perf wins (puzzle, gramschmidt) on
the A53 machine. How do we determine the importance of the sieve benchmark
vs. the rest of the suite?

An x86 machine doesn't show any regressions from this change:
http://llvm.org/perf/db_default/v4/nts/107353

Are there target-scope-based guidelines for when something is bad enough to
revert?

Also, we've absolutely destroyed perf (-48%) on the sieve benchmark on that
A53 target since the baseline (r256803). There are multiple things to fix
before we can truly recover?

Regardless of whether we revert or not, I am looking at how to clawback the
IR from the r292492 regression. Here's one step towards that:
https://reviews.llvm.org/D29053

If we get lucky, we may be able to sidestep the min/max problem by folding
harder before we reach that point in the optimization pipeline.

I don’t think we have any guidelines.

I think my suggestion was more about other regression that we would discover after the release, it was more of a “maturity” call: if we just notice a problem with the commit right before the release, it may not have been in tree long enough to get enough scrutiny.

(Also I thought this thread included a compile time regression, which on re-read it doesn’t).

All targets are likely affected in some way by the icmp+shl fold
introduced with r292492. It's a basic pattern that occurs in lots of code.
Did you see any perf wins on your targets with this commit?

Sadly, it is also likely that many (all?) targets are negatively impacted
on the particular test (SingleSource/Benchmarks/Shootout/sieve) that you
have pointed out here because the IR is now decidedly worse.

IMO, we should not revert the commit because it exposed shortcomings in
the optimizer. It's an "obvious" fold/canonicalization, and the related
'nuw' variant of this fold has existed in trunk since:
https://reviews.llvm.org/rL285729

We need to dissect what analysis/folds are missing to restore the IR to
the better form that existed before, but this is probably going to be a
long process because we treat min/max like an optimization fence.

If this is gonna be a long process to recover, this looks like something
to be reverted in the 4.0 branch (unless I missed that there is a
correctness fix involved?).

Nope - this is just about perf, not correctness. Of course, the intent was
that this transform should only improve perf, so I wonder if we can pin any
other perf changes from this commit.

I'm new to using the LNT site, but this should be the full set of results
for the A53 machine in question with a baseline (r292491) before this patch
and current (r292522) :
http://llvm.org/perf/db_default/v4/nts/107364

If these are reliable results, we have 2 perf wins (puzzle, gramschmidt)
on the A53 machine. How do we determine the importance of the sieve
benchmark vs. the rest of the suite?

An x86 machine doesn't show any regressions from this change:
http://llvm.org/perf/db_default/v4/nts/107353

Are there target-scope-based guidelines for when something is bad enough
to revert?

I don’t think we have any guidelines.

I think my suggestion was more about other regression that we would
discover after the release, it was more of a “maturity” call: if we just
notice a problem with the commit right before the release, it may not have
been in tree long enough to get enough scrutiny.

That makes sense. I have no stake in any particular branch, so I have no
objection to revert from the release branch if that's what people would
like to do. My preference is to keep it in trunk though because it should
be a win in theory and reverting there would make it harder to find and
debug problems like this one.

All targets are likely affected in some way by the icmp+shl fold introduced with r292492. It’s a basic pattern that occurs in lots of code. Did you see any perf wins on your targets with this commit?

Sadly, it is also likely that many (all?) targets are negatively impacted on the particular test (SingleSource/Benchmarks/Shootout/sieve) that you have pointed out here because the IR is now decidedly worse.

IMO, we should not revert the commit because it exposed shortcomings in the optimizer. It’s an “obvious” fold/canonicalization, and the related ‘nuw’ variant of this fold has existed in trunk since:
https://reviews.llvm.org/rL285729

We need to dissect what analysis/folds are missing to restore the IR to the better form that existed before, but this is probably going to be a long process because we treat min/max like an optimization fence.

If this is gonna be a long process to recover, this looks like something to be reverted in the 4.0 branch (unless I missed that there is a correctness fix involved?).

Nope - this is just about perf, not correctness. Of course, the intent was that this transform should only improve perf, so I wonder if we can pin any other perf changes from this commit.

I’m new to using the LNT site, but this should be the full set of results for the A53 machine in question with a baseline (r292491) before this patch and current (r292522) :
http://llvm.org/perf/db_default/v4/nts/107364

If these are reliable results, we have 2 perf wins (puzzle, gramschmidt) on the A53 machine. How do we determine the importance of the sieve benchmark vs. the rest of the suite?

An x86 machine doesn’t show any regressions from this change:
http://llvm.org/perf/db_default/v4/nts/107353

Are there target-scope-based guidelines for when something is bad enough to revert?

I don’t think we have any guidelines.

I think my suggestion was more about other regression that we would discover after the release, it was more of a “maturity” call: if we just notice a problem with the commit right before the release, it may not have been in tree long enough to get enough scrutiny.

That makes sense. I have no stake in any particular branch, so I have no objection to revert from the release branch if that’s what people would like to do.

+Hans for advising on what to on clang-4.0.

My preference is to keep it in trunk though because it should be a win in theory and reverting there would make it harder to find and debug problems like this one.

Right, trunk seems fine to me at this point.

The commit in question, r292492, occurred after the branch point
(r291814), so the branch is not affected by this.

Thanks,
Hans

Hi Sanjay,

Thank you for your analysis. It’s interesting why the x86 machine is not affected. Maybe the x86 backend is smarter than the AArch64 backend, or it might be micro-architectural differences.

I don’t mind to keep the changes on trunk.

What I’d like to see is who will/should be involved in solving the issue. What kind of help/support is needed? Should we (ARM Compilation Tools) start digging into the issue and fix it because the issue affects our future ARM Compiler 6 releases?

We can provide help with validating that patches fix the issue and don’t introduce new ones.

Kind regards,

Evgeny Astigeevich

Senior Compiler Engineer
Compilation Tools
ARM

Hi Evgeny,

I started looking at the log files that you attached, and I’m confused. The code that is supposedly causing the perf regression is created by the loop vectorizer, right? Except the bad code is not in the “vector.body”, so is there something peculiar about this benchmark that the hot loop is not the vector loop? But there’s another mystery: there are no vector ops in the “vector.body”!

Although I’d hope that InstCombine and friends can clean up any mess made by the vectorizer, maybe the faster way to solve this problem is to stop the vectorizer from doing bogus (and in this case, harmful) work? I don’t know the vectorizer code, so I can’t offer much help on that side. See how the IR post-vectorizer differs between A53 and an x86 target since x86 didn’t regress?

I haven't been following this conversation very closely, but the vectorizer
can choose to just unroll the loop rather than vectorize it, so you can see
a "vector.body" block with no vector instructions. I took a brief look at
the dumped IR. The code inserted by the vectorizer into the "other" loop is
part of the minimum iterations run-time check. This was probably created
and put there by a SCEV expander.

-- Matt

I haven't looked at this particular example, but this isn't very rare.

1) The vectorizer actually doubles as an unroller, under some
circumstances. So it's conceivable to have a vector.body without vector
instructions.

2) From what was pasted above, the "bad" code is in the runtime checks that
LV generates.
The general problem is that if we don't know what the loop count of a loop
is going to be, we assume it's high, so the cost of the runtime checks is
negligible. This fails if we have a loop with a low (but statically
unknown) trip-count nested inside a hot loop with a high trip-count. In
this case, the overhead from the runtime checks, that end up being inside
the outer loop, may cost more than what we gain from vectorization.
To try to avoid this, we have a threshold that prevents us from generating
too costly runtime checks, but I guess this particular check is not
considered complex enough.

Also, the difference between X86 and AArch64 is that on X86, the vectorizer decides not to unroll.