From: "Michael Kuperstein" <mkuper@google.com>
To: "Ayal Zaks" <ayal.zaks@intel.com>
Cc: "Nadav Rotem" <nadav.rotem@me.com>, "Hal Finkel"
<hfinkel@anl.gov>, "Elena Demikhovsky"
<elena.demikhovsky@intel.com>, "Adam Nemet" <anemet@apple.com>,
"Sanjoy Das" <sanjoy@playingwithpointers.com>, "James Molloy"
<james.molloy@arm.com>, "Matthew Simpson" <mssimpso@codeaurora.org>,
"Sanjay Patel" <spatel@rotateright.com>, "Chandler Carruth"
<chandlerc@google.com>, "David Li" <davidxl@google.com>, "Wei Mi"
<wmi@google.com>, "Dehao Chen" <dehao@google.com>, "Cong Hou"
<congh@google.com>, "Llvm Dev" <llvm-dev@lists.llvm.org>
Sent: Thursday, June 16, 2016 11:09:09 AM
Subject: Re: [RFC] Allow loop vectorizer to choose vector widths that
generate illegal types
Thanks, Ayal!
> Some thoughts:
> o To determine the VF for a loop with mixed data sizes, choosing
> the
> smallest ensures each vector register used is full, choosing the
> largest will minimize the number of vector registers used. Which
> one’s better, or some size in between, depends on the target’s
> costs
> for the vector operations, availability of registers and possibly
> control/memory divergence and trip count. “This is a question of
> cost modeling” and its associated compile-time, but in general good
> vectorization of loops with mixed data sizes is expected to be
> important, especially when larger scopes are vectorized. BTW, SLP
> followed this a year ago:
> [llvm] r241760 - [SLPVectorizer] Try different vectorization factors for store chains
Yes, I agree completely.
The approach we have right now is that availability of registers is a
hard upper bound, and I'm not planning on changing that (e.g. by
modeling spill cost.) at the moment.
We chatted about this on IRC, but I'll add here that I'm in favor of this. I think that it will put (useful) pressure on improving the cost model, and our register-pressure heuristics, and the quality of our legalization code. In the end, I think that the representation makes sense.
I'm not too worried about the compile time impact. I haven't measured
it yet, but one thing that may mitigate this is the fact that
postponing interleaving until the legalizer will result in smaller
IR coming out of the vectorizer. So the increased compile-time cost
of the TTI queries may be offset by the decreased amount of work for
post-vectorizer IR passes and pre-legalization ISel. Anyway, this is
all idle talk right now, as you and Nadav said, it needs to be
measured.
> o As for increasing VF beyond maximize-bandwidth, one could argue
> that a vectorizer should focus on tapping the SIMD capabilities of
> the target, up to maximize-bandwidth, and that its vectorized loop
> should later be subject to a separate independent
> unroller/interleaver pass. One suggestion, regardless, is to use
> the
> term “unroll-and-jam”, which traditionally applies to loops
> containing control-flow and nested loops but is quite clear for
> innermost loops too, instead of the overloaded term “interleaving”.
> Admittedly loop vectorization conceptually applies unroll-and-jam
> followed by packetization into vectors, so why unroll-and-jam
> twice.
> As noted, the considerations for best unroll factor are different
> from those of best VF for optimal usage of SIMD capabilities.
> Indeed
> representing in LLVM-IR a loop with vectors longer than
> maximize-bandwidth looks more appealing than replicating its
> ‘legal’
> vectors, easier produced by the vectorizer than by an
> unroll-and-jam
> pass. BTW, taken to the extreme, one could vectorize to the full
> trip count of the loop, as in
> http://impact.crhc.illinois.edu/shared/Papers/tr2014.mxpa.pdf ,
> where memory spatial locality is deemed more important to optimize
> than register usage.
"Why unroll-and-jam twice" is precisely the motivation behind
increasing VF beyond maximize-bandwidth.
Both getting good code from the legalizer and getting good cost
modeling for illegal types are required to increase VF up to
choosing the smallest scalar type. And if that works out, then going
beyond maximize-bandwidth seems like it should require fairly little
additional work. I think once we go beyond maximize-bandwidth, and
assume the legalizer will split things back up, the consideration
for the best unroll factor and the best VF becomes essentially the
same, since increasing the VF, in effect, increases the unroll
factor.
It's possible that we'll need two different cost estimates, one up to
max-bandwidth, and one beyond max-bandwidth - and in this case, I'm
not sure the exercise is worthwhile.
In any case, I mostly see this is as a bonus, what's really important
to me is getting maximize-bandwidth to work well.
As to the terminology - I agree "unroll-and-jam" is the correct
technical term, but it's not currently used in the vectorizer, and I
wanted to keep the terminology here consistent with the code.
Yea, we came up with interleaving to differentiate it from the concatenation unrolling that the unrolling pass performs. If we'd like to rename this to be akin to a jamming factor, for consistency with the literature, I don't object. As I recall, interleaving was the least bad option we discussed at the time
-Hal