SPMD Autovectorizer

Hi,

Are there any plans to integrate an autovectorizer for SPMD programs into LLVM? For example, there were previous discussions about integrating the whole function vectorizer (WFV) from Ralf Karrenberg into LLVM.

Thanks,
Zack

From: "Zack Waters" <zswaters@gmail.com>
To: "llvmdev" <llvmdev@cs.uiuc.edu>
Sent: Monday, July 6, 2015 1:03:47 PM
Subject: [LLVMdev] SPMD Autovectorizer

Hi,

Are there any plans to integrate an autovectorizer for SPMD programs
into LLVM? For example, there were previous discussions about
integrating the whole function vectorizer (WFV) from Ralf Karrenberg
into LLVM.

I don't know of any concrete plans, but this is something I'd still like to see happen. If no one gets to it first, I may even work on it at some point.

-Hal

The pocl's kernel compiler's parallel region analysis
could also provide for a starting point. It generates parallel
loops out from the regions between barriers via static
analysis which the vectorizers then vectorize/parallelize as
they see fit for the target at hand (with assistance from
the parallel loop annotations).

The loops could even be converted (selectively) to OpenMP
parallel loops for flexible parallelization when there are
not too many work items and an SMP target has multiple cores,
etc. The point is that the way those loops are mapped to
hardware is up to other (target dependent) passes(*).

However, the current upstream version is somewhat rusty with
some known issues, and I'd like to try to find time after
the summer to clean up an experimental private version and
push it to pocl master.

Maybe that could serve as a basis for LLVM upstreaming?
Previously I've assumed that this part of the pocl kernel
compiler could not be easily upstreamed to LLVM as it's
not needed for C/C++ (and someone said other languages
such as OpenCL C are very low priority), but if there's
interest, I'd be happy to get that part upstreamed
as well.

(*) http://link.springer.com/article/10.1007/s10766-014-0320-y

BR,

Wouldn't OpenMP account for some of that? At least on a single
machine, could you have both parallel and simd optimisations done on
the same loop? For MPI solutions, I believe just vectorising the code
per architecture would achieve a similar goal.

What other points could be relevant, here?

cheers,
--renato

The point in SPMD program description (e.g. CUDA or OpenCL C)
autovectorization is to produce something like OpenMP parallel
loops or SIMD pragmas automatically from the single thread/WI
description, adhering to its barrier synchronization semantics
etc.

That is, the output of this pass could be also converted to
OpenMP SIMD constructs, if wanted. In pocl's case the output
is simply a new kernel function (we call "work group function")
that executes all WIs using parallel loops (which can be
autovectorized more easily, or even multithreaded if seeing fit,
or both).

If you're going to "autopar" (turn a loop into a threads which run on
many cores or something) then please don't add a dependency on OMP.
While it may seem enticing that will just add a layer of overhead that
in-the-end you won't need (and probably won't want). Just lower to
pthreads on linux and whatever on Windows.

I wouldn't, but simply utilize the parallel loop metadata that
was originally designed for this purpose. What is done with
that MD is up to other passes.

Yes, that's what I was suggesting. Sorry if that wasn't clear.

Now, IIRC, OpenCL had a lot of trouble from getting odd-sized vector
types in IR that the middle end would not understand, especially the
vectorizers. The solution, at least as of 2 years ago, was to
serialise everything and let the CL back-end to vectorize it.

Since CL back-ends are normally very different from each others, with
very different cost models, and some secretive implementation details,
it's very hard to do that generically in the LLVM middle-end.

Also, if you have different domains (like many SIMD cores), sharing
the operations across cores and across lanes may be completely
different than, say, pthreads vs. AVX, so the model may not even apply
here. If you need write back loops, non-trivial synchronization
barriers between cores and other crazy stuff, adding all that to the
vectorizer would be bloating code beyond usability. On the other hand,
maybe not.

I'd be interested in knowing what kind of changes we'd need to get the
OMP+SIMD model into CL-type code, if that's what you're proposing...

cheers,
--renato

Hi Renato,

Now, IIRC, OpenCL had a lot of trouble from getting odd-sized vector
types in IR that the middle end would not understand, especially the
vectorizers. The solution, at least as of 2 years ago, was to
serialise everything and let the CL back-end to vectorize it.

Perhaps you are referring to the problem of autovectorizing
work-groups with kernels that use implicit vector datatypes
internally?

Yes, this can be done with (selective) scalarization or
with a vector-variable aware vectorizer. AFAIK,
there's already a Scalarizer pass in upstream LLVM for this.

Since CL back-ends are normally very different from each others, with
very different cost models, and some secretive implementation details,
it's very hard to do that generically in the LLVM middle-end.

Of course it's impossible to cover everything always. What pocl
tries to do is the very minimum to make it easier for later passes
to do their tasks while reusing the standard cost models etc. (such
as that in the LLVM vectorizers already) whenever possible.

Also, if you have different domains (like many SIMD cores), sharing
the operations across cores and across lanes may be completely
different than, say, pthreads vs. AVX, so the model may not even apply
here. If you need write back loops, non-trivial synchronization
barriers between cores and other crazy stuff, adding all that to the
vectorizer would be bloating code beyond usability. On the other hand,
maybe not.

Instead of implementing a monolithic SPMD-specific kernel vectorizer
with lots of code duplication to simple loop vectorizers, what pocl
does is quite the opposite. All it does is identify the
parallel regions between barriers, marks them as parallel loops and
let the other passes do what they like with the loops.

Currently we apply inner loop vectorizer (hopefully e.g. the loop
interchange and other work will soon improve it) for CPU+SIMD
targets, VLIW-style schedule the inner loops for static multi-issue
using custom backends, and just leave the original SPMD representation
be for GPU-like "SPMD targets" (briefly tested in an ongoing
experimental HSA support work).

Adding a mode where some of the parallel loop iterations are
executed in SIMD lanes and some in multiple cores with the target's supported threading mechanism is something to consider, but not yet
done (in pocl). The original question was only about autovectorization
so I'd not go there yet. OpenMP was just a side note from me, sorry
for the possible confusion.

I'd be interested in knowing what kind of changes we'd need to get the
OMP+SIMD model into CL-type code, if that's what you're proposing...

I'm not sure what you mean by "OMP+SIMD" model. I was simply
proposing using the existing parallel loop MD like pocl does
to keep the pass responsibilities organized.

What I suggested was to consider upstreaming a part of the pocl compiler (or preferably an improved implementation of it) that
statically identifies the parallel regions and generates a new
function that wraps the parallel regions in parallel loops
(which are then vectorized or whatever is best for the target
at hand by other passes, to keep the chain modular).

From the IR, I think it by minimum needs a notion of a "barrier instruction" using which it can do its CFG analysis to identify the regions. We simply use a dummy function declaration for this now.

BR,

Yes, this can be done with (selective) scalarization or
with a vector-variable aware vectorizer. AFAIK,
there's already a Scalarizer pass in upstream LLVM for this.

There is, I was just wondering how much work would it be to get our
main vectorizers to be vector-variable aware, or if it is really worth
pursuing that.

Instead of implementing a monolithic SPMD-specific kernel vectorizer
with lots of code duplication to simple loop vectorizers, what pocl
does is quite the opposite. All it does is identify the
parallel regions between barriers, marks them as parallel loops and
let the other passes do what they like with the loops.

Ah! Excellent!

I'm not sure what you mean by "OMP+SIMD" model. I was simply
proposing using the existing parallel loop MD like pocl does
to keep the pass responsibilities organized.

My question is if OMP+SIMD MD would help pocl to identify kernels and
vectorisable areas, but it seems you already have that in place.

What I suggested was to consider upstreaming a part of the pocl compiler (or
preferably an improved implementation of it) that
statically identifies the parallel regions and generates a new
function that wraps the parallel regions in parallel loops
(which are then vectorized or whatever is best for the target
at hand by other passes, to keep the chain modular).

Is this an IR pass? I think that'd be interesting...

From the IR, I think it by minimum needs a notion of a "barrier instruction"
using which it can do its CFG analysis to identify the regions. We simply
use a dummy function declaration for this now.

As usual, ok.

Maybe we should discuss a builtin with the larger community. Probably
best if in a different thread, as this one's stale. :slight_smile:

cheers,
--renato

Is this an IR pass? I think that'd be interesting...

Right, an IR pass mostly working at the CFG level. In addition to
creating the parallel loops, it modifies the IR mainly to inject "context arrays" for temporarily storing variables
of which live range crosses parallel regions (loops).

Maybe we should discuss a builtin with the larger community. Probably
best if in a different thread, as this one's stale. :slight_smile:

Yes. Simply an intrisics "llvm.spmd_barrier" or similar with
"all threads or none reach me" semantics might do for starters.