How to broaden the SLP vectorizer's search

The BB vectorizer has an option 'bb-vectorizer-search-limit'. Is there a similar option for the SLP vectorizer? Maybe an analysis pass' scope that can be widen?

I have large basic blocks with instructions that should be merged into packed versions. However, the blocks are optimized independently from each other. Now, if the instructions to be merged aren't too far apart the SLP vectorizer successfully packs the instructions within a block. However, from a certain distance between the instructions on the SLP vectorizer won't detect anymore the groups of instructions that can be packed. There's no interleaved data dependency that would prevent vectorizing. I am almost certain that's the vectorizer must have a maximum distance set somewhere or indirectly through dependent analysis passes.

I tried the BB vectorizer, but for whatever reason it starts emitting extractvector/shufflevector instructions which I am sure aren't necessary. So, until further investigation I'd like to stick to SLP.

Ideally, I'd like the maximum distance to be infinite, thus removing this restriction. I am aware of the consequent search time growth.

Anyone any insights?

Frank

The BB vectorizer has an option 'bb-vectorizer-search-limit'. Is there a
similar option for the SLP vectorizer? Maybe an analysis pass' scope
that can be widen?

I have large basic blocks with instructions that should be merged into
packed versions. However, the blocks are optimized independently from
each other. Now, if the instructions to be merged aren't too far apart
the SLP vectorizer successfully packs the instructions within a block.
However, from a certain distance between the instructions on the SLP
vectorizer won't detect anymore the groups of instructions that can be
packed. There's no interleaved data dependency that would prevent
vectorizing. I am almost certain that's the vectorizer must have a
maximum distance set somewhere or indirectly through dependent analysis
passes.

I tried the BB vectorizer, but for whatever reason it starts emitting
extractvector/shufflevector instructions which I am sure aren't
necessary. So, until further investigation I'd like to stick to SLP.

Ideally, I'd like the maximum distance to be infinite, thus removing
this restriction. I am aware of the consequent search time growth.

Anyone any insights?

Frank,
The SLP vectorizer limits the amount of recursion then building the
expression tree. Unfortunately, there's no command-line option to control
this limit at this time.

Line 73:
http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp?view=markup

You might consider filing a bug (llvm.org/bugs) requesting a flag, but I
don't know if the code owners want to expose such a flag.

Hope this helps.

Regards,
  Chad

I'm not sure that's a good idea as a raw access to that limit, as
there are no guarantees that it'll stay the same. But maybe a flag
turning some "aggressive" behaviour from SLP that would then change
that threshold (and maybe some others) would be a good flag to have, I
think.

This could maybe be a way to deprecate the BB vectorizer faster than
we would otherwise. But that would depend on all missing BB features
to be implemented in SLP.

An item in bugzilla seems appropriate.

cheers,
--renato

I changed the max. recursion depth to 36, and tried then 1000 (from the original value of 12) and it did not improve SLP's optimization capabilities on my input function. For example, the attached function is (by design) perfectly vectorizable into 4-packed single precision SIMD load/add/store. The SLP vectorizer just does nothing with it.

I ran

opt -default-data-layout="e-m:e-i64:64-f80:128-n8:16:32:64-S128" -basicaa -slp-vectorizer -S < mod_vec_p_vec.ll

with RecursionMaxDepth = 12, 36, and 1000.

Thanks,
Frank

mod_vec_p_vec.ll (5.65 KB)

I changed the max. recursion depth to 36, and tried then 1000 (from the
original value of 12) and it did not improve SLP's optimization
capabilities on my input function. For example, the attached function is
(by design) perfectly vectorizable into 4-packed single precision SIMD
load/add/store. The SLP vectorizer just does nothing with it.

I ran

opt -default-data-layout="e-m:e-i64:64-f80:128-n8:16:32:64-S128"
-basicaa -slp-vectorizer -S < mod_vec_p_vec.ll

The data layout that you're specifying doesn't have a vector type included? Maybe adding one will help

http://llvm.org/docs/LangRef.html#data-layout

Cheers,
  Roel

Hi Frank,

Thanks for working on this. Please look at vectorizeStoreChains. In this function we process all of the stores in the function in buckets of 16 elements because constructing consecutive stores is implemented using an O(n^2) algorithm. You can try to increase this threshold to 128 and see if it helps.

I also agree with Renato and Chad that adding a flag to tell the SLP-vectorizer to put more effort (compile time) into the problem is a good idea.

Thanks,
Nadav

Hi Nadav,

increasing the number of instructions per bucket indeed results in a completely vectorized version of the given small function. For a medium-size function I had to increase the bucket size to 8192 to achieve full vectorization.

If I then try this setup on one of my larger functions (containing one huge basic block) it seems that the O(n^2) algorithm you were talking about hits me hard here. After 10min in SLP I sent the termination signal by hand.

I remember the documentation saying that SLP is a quite versatile vectorizer. I was wondering if I only need to vectorize the load/store/arithmetic a single basic block and nothing else, are there any parts in SLP that I could deactivate in order to reduce the time needed for optimization?

Thanks,
Frank

Hi Frank,

This means that you had at least 8192 stores in that medium-sized function. :slight_smile: I wonder why it is necessary to vectorize two stores with thousands of other store instructions in between. Are you sure you need such a high number?

The SLP vectorizer starts by finding sequences of consecutive stores. It sounds like most of the time is spent in the code that find consecutive store sequence. You could probably verify this assumption using a profiler.

I think that one way to move forward would be to improve the routine that finds consecutive stores. At the moment we iterate over all possibly combinations and use SCEV to figure out if two stores are consecutive. We can’t simply ’sort’ the stores based on the distance from some location because there could be multiple base indices. I think that one way to improve the search is to sort the store instructions into buckets according to their memory type.

Thanks,
Nadav

From: "Frank Winter" <fwinter@jlab.org>
To: llvmdev@cs.uiuc.edu
Sent: Thursday, August 7, 2014 10:58:47 AM
Subject: [LLVMdev] How to broaden the SLP vectorizer's search

The BB vectorizer has an option 'bb-vectorizer-search-limit'. Is
there a
similar option for the SLP vectorizer? Maybe an analysis pass' scope
that can be widen?

I have large basic blocks with instructions that should be merged
into
packed versions. However, the blocks are optimized independently from
each other. Now, if the instructions to be merged aren't too far
apart
the SLP vectorizer successfully packs the instructions within a
block.
However, from a certain distance between the instructions on the SLP
vectorizer won't detect anymore the groups of instructions that can
be
packed. There's no interleaved data dependency that would prevent
vectorizing. I am almost certain that's the vectorizer must have a
maximum distance set somewhere or indirectly through dependent
analysis
passes.

I tried the BB vectorizer, but for whatever reason it starts emitting
extractvector/shufflevector instructions which I am sure aren't
necessary. So, until further investigation I'd like to stick to SLP.

A quick comment on this point: The BB vectorizer has always required a run of InstCombine and EarlyCSE (or GVN) afterward to produce reasonable code. That having been said, the development focus is now squarely on the SLP vectorizer is this space, and I suggest you focus there regardless.

-Hal