Missing vectorization of loop due to load late in the loop

mikaelholmen · May 5, 2020, 1:25pm

Hi,

TL;DR: A loop doesn't get vectorized due to the interaction of loop-
rotate, licm and instcombine. What to do about it?

Full story:

In the benchmarks for our out-of-tree target we have a case that we
would like to get vectorized, but currently it isn't. I've done some
digging to see why and have some kind of idea what prevents it, but I
don't know what the best way to fix it would be so I thought I'd share
a reduced version of it to see if anyone here have ideas.

So, what happens can be reproduced on trunk with
opt -O3 -S -o - bbi-39227-reduced.ll

The input program consists of two nested loops where the inner loop
loads a value and does some calculations and then the outer loop writes
the calculated value somewhere.

With
-debug-only=loop-vectorize

I see this printout from the vectorizer for the inner loop:
LV: Not vectorizing: Found an unidentified PHI %h.15 = phi i32 [
%h.11, %inner.cond.preheader ], [ %h.1, %inner.body ]

When we run the vectorizer the inner loop looks like

inner.body: ; preds =
%inner.cond.preheader, %inner.body
  %h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1, %inner.body
]
  %h.pn4 = phi i32* [ %h, %inner.cond.preheader ], [ %hp.1, %inner.body
]
  %j.03 = phi i16 [ 0, %inner.cond.preheader ], [ %j.1, %inner.body ]
  %real.02 = phi i32 [ 0, %inner.cond.preheader ], [ %sub, %inner.body
]
  %hp.1 = getelementptr inbounds i32, i32* %h.pn4, i64 1
  %0 = shl i32 %h.15, 16
  %conv7 = ashr exact i32 %0, 16
  %add = sub i32 %real.02, %h.15
  %sub = add i32 %add, %conv7
  %j.1 = add nuw nsw i16 %j.03, 1
  %h.1 = load i32, i32* %hp.1, align 1, !tbaa !4
  %cmp3 = icmp ult i16 %j.03, 99
  br i1 %cmp3, label %inner.body, label %inner.end, !llvm.loop !8

And the vectorizer bails out since the load is placed "late" in the
loop, so
RecurrenceDescriptor::isFirstOrderRecurrence
returns false.

If we just move the load before the definition of %0 in the vectorizer
input, then we instead get
LV: We can vectorize this loop!

Originally the loop had two loads, and then one of them was actually
placed early in the loop block. That was done by instcombine, by
InstCombiner::FoldPHIArgLoadIntoPHI.

The reason this doesn't happen for the load we see in the reduced
example is because after loop rotation licm hoists the start value of
the PHI, so when instcombine tries to do FoldPHIArgLoadIntoPHI, the
start value isn't placed in the direct predecessor block of the PHI,
and the folding is aborted.

Therefore I tried to squeeze in an additional run of instcombine before
the licm run that does that hoisting, and then instcombine does the
folding and the load in the inner loop is done early instead of late in
the loop. The vectorizer then is happy and accepts to vectorize the
loop. I have no idea if inserting another run of instcombine is a good
idea though and I see some mixed results in other benchmarks.

So, the question is what really to do about this...

Should the vectorizer vectorize the loop anyway in this case?

Should licm not hoist the initial value of the PHI (but hoisting is in
general nice, so...).

Should instcombine realize it can do something about this case even if
the initial value of the PHI is "far" from the PHI.

Something completely different?

Any ideas?

Thanks,
Mikael

bbi-39227-reduced.ll (2.79 KB)

Finkel_Hal_J · May 6, 2020, 10:35pm

That just seems like a bug. Unless I’m missing something, the load depends only on %hp.1, there are no use-def or memory dependencies in between the load and the phi. The semantics are exactly the same between the two placements of the load instruction. The order of independent SSA calculations should not affect whether or not the vectorizer is able to vectorize the loop.

-Hal

mikaelholmen · May 7, 2020, 6:03am

Hi,

> From: llvm-dev <llvm-dev-bounces@lists.llvm.org> on behalf of
> Mikael Holmén via llvm-dev <llvm-dev@lists.llvm.org>
> Sent: Tuesday, May 5, 2020 8:25 AM
> To: llvm-dev@lists.llvm.org <llvm-dev@lists.llvm.org>
> Subject: [llvm-dev] Missing vectorization of loop due to load late
> in the loop
>
> Hi,
>
> TL;DR: A loop doesn't get vectorized due to the interaction of
> loop-
> rotate, licm and instcombine. What to do about it?
>
>
> Full story:
>
> In the benchmarks for our out-of-tree target we have a case that we
> would like to get vectorized, but currently it isn't. I've done
> some
> digging to see why and have some kind of idea what prevents it, but
> I
> don't know what the best way to fix it would be so I thought I'd
> share
> a reduced version of it to see if anyone here have ideas.
>
> So, what happens can be reproduced on trunk with
> opt -O3 -S -o - bbi-39227-reduced.ll
>
> The input program consists of two nested loops where the inner loop
> loads a value and does some calculations and then the outer loop
> writes
> the calculated value somewhere.
>
> With
> -debug-only=loop-vectorize
>
> I see this printout from the vectorizer for the inner loop:
> LV: Not vectorizing: Found an unidentified PHI %h.15 = phi i32 [
> %h.11, %inner.cond.preheader ], [ %h.1, %inner.body ]
>
> When we run the vectorizer the inner loop looks like
>
> inner.body: ; preds =
> %inner.cond.preheader, %inner.body
> %h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1,
> %inner.body
> ]
> %h.pn4 = phi i32* [ %h, %inner.cond.preheader ], [ %hp.1,
> %inner.body
> ]
> %j.03 = phi i16 [ 0, %inner.cond.preheader ], [ %j.1, %inner.body
> ]
> %real.02 = phi i32 [ 0, %inner.cond.preheader ], [ %sub,
> %inner.body
> ]
> %hp.1 = getelementptr inbounds i32, i32* %h.pn4, i64 1
> %0 = shl i32 %h.15, 16
> %conv7 = ashr exact i32 %0, 16
> %add = sub i32 %real.02, %h.15
> %sub = add i32 %add, %conv7
> %j.1 = add nuw nsw i16 %j.03, 1
> %h.1 = load i32, i32* %hp.1, align 1, !tbaa !4
> %cmp3 = icmp ult i16 %j.03, 99
> br i1 %cmp3, label %inner.body, label %inner.end, !llvm.loop !8
>
> And the vectorizer bails out since the load is placed "late" in the
> loop, so
> RecurrenceDescriptor::isFirstOrderRecurrence
> returns false.
>
> If we just move the load before the definition of %0 in the
> vectorizer
> input, then we instead get
> LV: We can vectorize this loop!

That just seems like a bug. Unless I'm missing something, the load
depends only on %hp.1, there are no use-def or memory dependencies in
between the load and the phi. The semantics are exactly the same
between the two placements of the load instruction. The order of
independent SSA calculations should not affect whether or not the
vectorizer is able to vectorize the loop.

Great! That's what I thought too, but as I don't know the vectorizer
I'm not sure what requirements are reasonable on its input.

When the load is placed early, it's this piece of code in
LoopVectorizationLegality::canVectorizeInstrs() that saves us:

    if (RecurrenceDescriptor::isFirstOrderRecurrence(Phi, TheLoop,
                                                     SinkAfter, DT)) {
      AllowedExit.insert(Phi);
      FirstOrderRecurrences.insert(Phi);
      continue;
    }

but when it's placed late it doesn't and we end up with "Found an
unidentified PHI".

The vectorizer behavior can be reproduced with
opt -loop-vectorize -S -o - bbi-39227-lv.ll

I'll write a PR about this then. And if anyone has ideas about how the
LoopVectorizationLegality checks can be relaxed to allow this case too,
please chime in.

Thanks!
Mikael

bbi-39227-lv.ll (1.73 KB)

Topic		Replies	Views
loop vectorizer issue LLVM Dev List Archives	5	101	November 4, 2013
LoopVectorize fails to vectorize more complex loops LLVM Dev List Archives	2	196	July 7, 2018
[Vectorization] Mis match in code generated LLVM Dev List Archives	9	103	November 11, 2014
loop vectorizer: this loop is not worth vectorizing LLVM Dev List Archives	2	84	November 1, 2013
Question about the loop vectorizer LLVM Dev List Archives	1	115	January 14, 2013

Missing vectorization of loop due to load late in the loop

Related Topics