TL;DR: A loop doesn't get vectorized due to the interaction of loop-
rotate, licm and instcombine. What to do about it?
In the benchmarks for our out-of-tree target we have a case that we
would like to get vectorized, but currently it isn't. I've done some
digging to see why and have some kind of idea what prevents it, but I
don't know what the best way to fix it would be so I thought I'd share
a reduced version of it to see if anyone here have ideas.
So, what happens can be reproduced on trunk with
opt -O3 -S -o - bbi-39227-reduced.ll
The input program consists of two nested loops where the inner loop
loads a value and does some calculations and then the outer loop writes
the calculated value somewhere.
I see this printout from the vectorizer for the inner loop:
LV: Not vectorizing: Found an unidentified PHI %h.15 = phi i32 [
%h.11, %inner.cond.preheader ], [ %h.1, %inner.body ]
When we run the vectorizer the inner loop looks like
inner.body: ; preds =
%h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1, %inner.body
%h.pn4 = phi i32* [ %h, %inner.cond.preheader ], [ %hp.1, %inner.body
%j.03 = phi i16 [ 0, %inner.cond.preheader ], [ %j.1, %inner.body ]
%real.02 = phi i32 [ 0, %inner.cond.preheader ], [ %sub, %inner.body
%hp.1 = getelementptr inbounds i32, i32* %h.pn4, i64 1
%0 = shl i32 %h.15, 16
%conv7 = ashr exact i32 %0, 16
%add = sub i32 %real.02, %h.15
%sub = add i32 %add, %conv7
%j.1 = add nuw nsw i16 %j.03, 1
%h.1 = load i32, i32* %hp.1, align 1, !tbaa !4
%cmp3 = icmp ult i16 %j.03, 99
br i1 %cmp3, label %inner.body, label %inner.end, !llvm.loop !8
And the vectorizer bails out since the load is placed "late" in the
If we just move the load before the definition of %0 in the vectorizer
input, then we instead get
LV: We can vectorize this loop!
Originally the loop had two loads, and then one of them was actually
placed early in the loop block. That was done by instcombine, by
The reason this doesn't happen for the load we see in the reduced
example is because after loop rotation licm hoists the start value of
the PHI, so when instcombine tries to do FoldPHIArgLoadIntoPHI, the
start value isn't placed in the direct predecessor block of the PHI,
and the folding is aborted.
Therefore I tried to squeeze in an additional run of instcombine before
the licm run that does that hoisting, and then instcombine does the
folding and the load in the inner loop is done early instead of late in
the loop. The vectorizer then is happy and accepts to vectorize the
loop. I have no idea if inserting another run of instcombine is a good
idea though and I see some mixed results in other benchmarks.
So, the question is what really to do about this...
Should the vectorizer vectorize the loop anyway in this case?
Should licm not hoist the initial value of the PHI (but hoisting is in
general nice, so...).
Should instcombine realize it can do something about this case even if
the initial value of the PHI is "far" from the PHI.
Something completely different?
bbi-39227-reduced.ll (2.79 KB)