LoopVectorizer in OpenCL C work group autovectorization

Hi,

I started to play with the LoopVectorizer of LLVM trunk
on the work-item loops produced by pocl's OpenCL C
kernel compiler, in hopes of implementing multi-work-item
work group autovectorization in a modular manner.

The vectorizer seems to refuse to vectorize the loop if it sees
multiple writes to the same memory object within the
same iteration. In case of parallel loops such as
the work-item loops, it could just assume vectorization is doable
from the data dependency point of view -- no matter what kind of
memory accesses the single iteration does.

What would be the cleanest way to communicate the parallel loop
information to the vectorizer? There was some discussion of
parallelism information in LLVM some time ago in this list, but
it somehow died. Was adding some parallelism information to
the LLVM IR decided to be a bad idea? Any conclusion in that?

Another thing with OpenCL C autovectorization is that the
language itself has vector datatypes. In order to autovectorize
multi-WI work groups efficiently, it might be beneficial to
break the vectors in the single work item to scalars to get more
efficient vector hardware utilization. Is there an existing pass
that breaks vectors to scalars and that works on the LLVM IR level?
There seems to be such at the code gen level according to
this blog post: http://blog.llvm.org/2011/12/llvm-31-vector-changes.html

Thanks,

Hi Pekka,

Hi,

I started to play with the LoopVectorizer of LLVM trunk
on the work-item loops produced by pocl's OpenCL C
kernel compiler, in hopes of implementing multi-work-item
work group autovectorization in a modular manner.

Thanks for checking the Loop Vectorizer, I am interested in hearing your feedback. The Loop Vectorizer does not fit here. OpenCL vectorization is completely different because the language itself is data-parallel. You don't need all of the legality checks that the loop vectorizer has. Moreover, OpenCL has lots of language specific APIs such as "get_global_id" and builtin function calls, and without knowledge of these calls it is impossible to vectorize OpenCL.

The vectorizer seems to refuse to vectorize the loop if it sees
multiple writes to the same memory object within the
same iteration. In case of parallel loops such as
the work-item loops, it could just assume vectorization is doable
from the data dependency point of view -- no matter what kind of
memory accesses the single iteration does.

Yep.

What would be the cleanest way to communicate the parallel loop
information to the vectorizer? There was some discussion of
parallelism information in LLVM some time ago in this list, but
it somehow died. Was adding some parallelism information to
the LLVM IR decided to be a bad idea? Any conclusion in that?

You need to implement something like Whole Function Vectorization (http://dl.acm.org/citation.cfm?id=2190061). The loop vectorizer can't help you here. Ralf Karrenberg open sourced his implementation on github. You should take a look.

Another thing with OpenCL C autovectorization is that the
language itself has vector datatypes.

Unfortunately yes. And OpenCL compilers scalarize these vector operations at some point in the compilation pipeline.

In order to autovectorize
multi-WI work groups efficiently, it might be beneficial to
break the vectors in the single work item to scalars to get more
efficient vector hardware utilization. Is there an existing pass
that breaks vectors to scalars and that works on the LLVM IR level?

No. But this pass needs to be OpenCL specific because you want to scalarize function calls. OpenCL is "blessed" with lots of function calls, even for trivial type conversions.

There seems to be such at the code gen level according to
this blog post: http://blog.llvm.org/2011/12/llvm-31-vector-changes.html

Yes but you can't use it because you need to do this at IR-level.

- Nadav

It'd be great to have this in LLVM, though some care must be taken to
continue relevant (unlike the C back-end, for example). There are lots of
secrets around GPUs and OpenCL concrete implementation, which could make
very hard to predict or model costs for each different GPU.

cheers,
--renato

Thanks for checking the Loop Vectorizer, I am interested in hearing your
feedback. The Loop Vectorizer does not fit here. OpenCL vectorization is
completely different because the language itself is data-parallel. You
don't need all of the legality checks that the loop vectorizer has.

I'm aware of this and it was my point in the original post.
However, I do not see why the loop vectorizer wouldn't fit
this use case given how the pocl's "kernel compiler" is structured.

How I see it, the data parallel input simply makes the vectorizer's job
easier (skip some of the legality checks) while reusing most of the
implementation (e.g. cost estimation, unrolling decisions, the
vector instruction formation itself, predication/if-conversion,
speculative execution+blend, etc.).

Now pocl's kernel compiler detects the "parallel regions" (the
regions between work group barriers) and generates a new function suitable
for executing multiple work items (WI) in the work group. One method to
generate such functions is to generate embarrassingly parallel "for-loops"
(wiloops) that produce the multi-WI DLP execution. That is, the loop
executes the code in the parallel regions for each work item in the work
group.

This step is needed to make the multi-WI kernel executable on
non-SIMD/SIMT platforms (read: CPUs). On the "SPMD-tailored" processors
(many GPUs) this step is not always necessary as they can input the single
kernel instructions and do the "spreading" on the fly. We have a different
method to generate the WG functions for such targets.

Moreover, OpenCL has lots of language specific APIs such as
"get_global_id" and builtin function calls, and without knowledge of these
calls it is impossible to vectorize OpenCL.

In pocl the whole kernel is "flattened", that is, the processed kernel code
does not usually have function calls. Well, printf() and some intrisics
calls might be exceptions. In such cases the vectorization could be
simply not done and the parallelization can be attempted using some other
method (e.g. pure unrolling), like usual.

get_local_id is converted to regular iteration variables (local id space x,
y,z) in the wiloop.

I played yesterday a bit by kludge-hacking the LoopVectorizer code to
skip the canVectorizeMemory() check for these wiloop constructs and it
managed to vectorize a kernel as expected.

You need to implement something like Whole Function Vectorization
(http://dl.acm.org/citation.cfm?id=2190061). The loop vectorizer can't
help you here. Ralf Karrenberg open sourced his implementation on github.
You should take a look.

I think the WFV paper has plenty of good ideas that could be applied to
*improve* the vectorizability of DLP code/parallel loops (e.g. the mask
generation for diverging branches where the traditional if-conversion won't
do, especially intra kernel for-loops), but the actual vectorization
could be modularized to generic passes to, e.g., allow the choice of target-specific parallelization methods later on.

From: "Pekka Jääskeläinen" <pekka.jaaskelainen@tut.fi>
To: "Nadav Rotem" <nrotem@apple.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Friday, January 25, 2013 5:35:16 AM
Subject: Re: [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

> Thanks for checking the Loop Vectorizer, I am interested in hearing
> your
> feedback. The Loop Vectorizer does not fit here. OpenCL
> vectorization is
> completely different because the language itself is data-parallel.
> You
> don't need all of the legality checks that the loop vectorizer has.

I'm aware of this and it was my point in the original post.
However, I do not see why the loop vectorizer wouldn't fit
this use case given how the pocl's "kernel compiler" is structured.

How I see it, the data parallel input simply makes the vectorizer's
job
easier (skip some of the legality checks) while reusing most of the
implementation (e.g. cost estimation, unrolling decisions, the
vector instruction formation itself, predication/if-conversion,
speculative execution+blend, etc.).

Now pocl's kernel compiler detects the "parallel regions" (the
regions between work group barriers) and generates a new function
suitable
for executing multiple work items (WI) in the work group. One method
to
generate such functions is to generate embarrassingly parallel
"for-loops"
(wiloops) that produce the multi-WI DLP execution. That is, the loop
executes the code in the parallel regions for each work item in the
work
group.

This step is needed to make the multi-WI kernel executable on
non-SIMD/SIMT platforms (read: CPUs). On the "SPMD-tailored"
processors
(many GPUs) this step is not always necessary as they can input the
single
kernel instructions and do the "spreading" on the fly. We have a
different
method to generate the WG functions for such targets.

> Moreover, OpenCL has lots of language specific APIs such as
> "get_global_id" and builtin function calls, and without knowledge
> of these
> calls it is impossible to vectorize OpenCL.

In pocl the whole kernel is "flattened", that is, the processed
kernel code
does not usually have function calls. Well, printf() and some
intrisics
calls might be exceptions. In such cases the vectorization could be
simply not done and the parallelization can be attempted using some
other
method (e.g. pure unrolling), like usual.

get_local_id is converted to regular iteration variables (local id
space x,
y,z) in the wiloop.

I played yesterday a bit by kludge-hacking the LoopVectorizer code to
skip the canVectorizeMemory() check for these wiloop constructs and
it
managed to vectorize a kernel as expected.

Based on this experience, can you propose some metadata that would allow this to happen (so that the LoopVectorizer would be generally useful for POCL)? I suspect this same metadata might be useful in other contexts (such as implementing iteration-independence pragmas).

-Hal

I cannot yet. In this hack I simply changed LoopVectorizer to assume
all loops the vectorizer sees are parallel (as the kernels I tried
didn't have loops inside) to see where the other potential
vectorization obstacles are.

I'm planning to try next an approach where I add metadata
to the loop header basic block that simply marks that the loop is parallel.
The loop vectorizer, when it sees such metadata in the loop can then
skip cross-iteration memory dependency checks. If you think this is a
dead-end, please let me know. Otherwise, I'll try and see how it
works.

BR,

From: "Pekka Jääskeläinen" <pekka.jaaskelainen@tut.fi>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>, "Nadav Rotem" <nrotem@apple.com>
Sent: Friday, January 25, 2013 8:14:57 AM
Subject: Re: [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

> Based on this experience, can you propose some metadata that would
> allow
> this to happen (so that the LoopVectorizer would be generally
> useful for
> POCL)? I suspect this same metadata might be useful in other
> contexts (such
> as implementing iteration-independence pragmas).

I cannot yet. In this hack I simply changed LoopVectorizer to assume
all loops the vectorizer sees are parallel (as the kernels I tried
didn't have loops inside) to see where the other potential
vectorization obstacles are.

Okay, I understand.

I'm planning to try next an approach where I add metadata
to the loop header basic block that simply marks that the loop is
parallel.
The loop vectorizer, when it sees such metadata in the loop can then
skip cross-iteration memory dependency checks. If you think this is a
dead-end, please let me know. Otherwise, I'll try and see how it
works.

My point is that I specifically think that you should try it. I'm curious to see how what you come up with might apply to other use cases as well.

-Hal

Hi Pekka,

How I see it, the data parallel input simply makes the vectorizer's job
easier (skip some of the legality checks) while reusing most of the
implementation (e.g. cost estimation, unrolling decisions, the
vector instruction formation itself, predication/if-conversion,
speculative execution+blend, etc.).

What you need is outer loop vectorization while the loop vectorizer is an inner loop vectorizer.
If you decide to use the Loop Vectorizer then you won't be able to vectorize kernels that have inner loops or kernels that have barriers in them.
If you look at the AMD OpenCL SDK you will see that most of the workloads have barriers, inner loops.

Another problem that you may run into is 'early exits'. In many kernels you will see something like " if (get_global_id(0) > N) return; "

Not to mention that it will be very important for you to vectorize function calls. Vectorization needs to happen before inlining because you don't want to vectorize a cloud of instructions when you can convert a single function call. Think about image samplers or slightly more complex builtins that have control flow in them.

OK, attached is the first quick attempt towards this. I'm not
proposing committing this, but would like to get comments
to possibly move towards something committable.

It simply looks for a metadata named 'parallel_for' in any of the
instructions in the loop's header and assumes the loop is a parallel
one if such is found. This metadata is added by the pocl's wiloops
generation routine. It passes the pocl test suite when enabled but
probably cannot vectorize many kernels (at least) due to the missing
intra-kernel vector scalarizer.

Some known problems that need addressing:

- Metadata can only be attached to Instructions (not Loops or even
   BasicBlocks), therefore the brute force approach of marking all
   instructions in the header BB in hopes of that optimizers
   might retain at least one of them. E.g., a special intrinsics call
   might be a better solution.

- The loop header can be potentially shared with multilevel loops where the
   outer or inner levels might not be parallel. Not a problem in the pocl use
   case as the wiloops are fully parallel at all the three levels, but needs
   to be sorted out in a general solution.

   Perhaps it would be better to attach the metadata to the iteration
   count increment/check instruction(s) or similar to better identify the
   parallel (for) loop in question.

- Are there optimizations that might push code *illegally* to the parallel
   loop from the outside of it? If there's, e.g., a non-parallel loop inside
   a parallel loop, loop invariant code motion might move code from the
   inner loop to the parallel loop's body. That should be a safe optimization,
   to my understanding (it preservers the ordering semantics), but I wonder if
   there are others that might cause breakage.

llvm-3.3-loopvectorizer-parallel_for-metadata-detection.patch (1.72 KB)

Hi Nadav,

What you need is outer loop vectorization while the loop vectorizer is an
inner loop vectorizer. If you decide to use the Loop Vectorizer then you
won't be able to vectorize kernels that have inner loops or kernels that have
barriers in them. If you look at the AMD OpenCL SDK you will see that most of
the workloads have barriers, inner loops.

Barriers are the problem of the "parallel region formation phase" of
pocl. It's a distinct problem from the actual parallelization method
such as vectorization (or, e.g., unroll+VLIW schedule).

Non-divergent iteration count kernel loops can be executed in lock step
and also vectorized. The parallel region/wiloop can be formed
inside the kernel loop which can be then vectorized.

kernel_for_loop {
    parallel_wiloop over x {
      .. the original kernel loop body ..
    }
}

Vectorizing divergent loops needs masking or similar, e.g., as presented
in the WFV paper, but this doesn't need to be an OpenCL specific optimization
as it helps vectorization in general.

It's a case of the inner-loop iteration count depending on the outer
loop.

parallel_wiloop over x {
    kernel_for_loop i := 0...x { // or similar variable range depending on x
      ...
    }
}

to

kernel_for_loop {
    parallel_wiloop {
       // the whole body predicated with the kernel_for_loop condition
       // that includes 'x' somewhere
    }
}

Another problem that you may run into is 'early exits'. In many kernels you
will see something like " if (get_global_id(0)> N) return; "

Now in pocl this ends up being a parallel region similar to this:

parallel_wiloop over x {
    if (x > N) goto ret;
    ... kernel code here
ret:
}

Not the easiest case to parallelize but might be doable because N
can be used to modify the wiloop iteration range.

parallel_wiloop x:= 0...N-1 {
    ... kernel code here
}

Anyways, one cannot be expected to defeat all the bad kernel coding practices.

Not to mention that it will be very important for you to vectorize function
calls. Vectorization needs to happen before inlining because you don't want
to vectorize a cloud of instructions when you can convert a single function
call. Think about image samplers or slightly more complex builtins that have
control flow in them.

I think function calls are one thing, builtins/intrinsics another.

Vectorizing builtins is something that is partially OpenCL specific (if the
builtins itself are OpenCL-specific), but I think there should be benefit in
a generic implementation of that case also. I.e., converting builtin/intrinsics
calls to their vector counterparts, if available.

Say,

for_loop {
   call @llvm.sinf32(a[x]);
   ...
}

is useful to be vectorizable if the target ISA can do SIMD sinf.

In any case, it's clear some kernels are not vectorizable (at least
beneficially so), e.g. due to non-predicateable (huh!) control flow, but
that does not have much to do with the actual vectorizing method or the input
language used.

BR,

Pekka,

I am in favor of adding metadata to control different aspects of vectorization, mainly for supporting user-level pargmas [1] but also for DSLs.
Before we start adding metadata to the IR we need to define the semantics of the tags. "Parallel_for" is too general. We also want to control vectorization factor, unroll factor, cost model, etc.

Doug Gregor suggested to add the metadata to the branch instruction of the latch block in the loop.

My main concern is that your approach for vectorizing OpenCL is wrong. OpenCL was designed for SPMD/outer-loop vectorization and any good OpenCL vectorizer should be able to vectorize 100% of the workloads. The Loop Vectorizer vectorizes innermost loops only. It has a completely different cost model and legality checks. You also have no use for reduction variables, reverse iterators, etc. If all you are interested in is the widening of instructions then you can easily implement it.

- Nadav

[1] http://software.intel.com/en-us/articles/vectorization-with-the-intel-compilers-part-i

I am in favor of adding metadata to control different aspects of
vectorization, mainly for supporting user-level pargmas [1] but also for
DSLs. Before we start adding metadata to the IR we need to define the
semantics of the tags. "Parallel_for" is too general. We also want to control
vectorization factor, unroll factor, cost model, etc.

These are used to control *how* the loops are parallelized.
The generic "parallel_for" lets the compiler (to try) to do the actual
parallelization decisions based on the target (aim for performance
portability). So, both have their uses.

Doug Gregor suggested to add the metadata to the branch instruction of the
latch block in the loop.

OK that should work better. I'll look into it next week.

My main concern is that your approach for vectorizing OpenCL is wrong. OpenCL
was designed for SPMD/outer-loop vectorization and any good OpenCL vectorizer
should be able to vectorize 100% of the workloads. The Loop Vectorizer
vectorizes innermost loops only. It has a completely different cost model and
legality checks. You also have no use for reduction variables, reverse
iterators, etc. If all you are interested in is the widening of instructions
then you can easily implement it.

Sorry, I still don't see the problem in the "modular" approach vs. generating
vector instructions directly in pocl -- but then again, I'm not a vectorization
expert. All I'm really trying to do is to delegate the "widening of
instructions" and the related tasks to the loop vectorizer. If it doesn't
need all of the vectorizer's features it should not be a problem AFAIU. I think
it's better for me just play a bit with it, and experience the possible problems
in it.

Pekka Jääskeläinen wrote:

My point is that I specifically think that you should try it. I'm curious
to see how what you come up with might apply to other use cases as well.

OK, attached is the first quick attempt towards this. I'm not
proposing committing this, but would like to get comments
to possibly move towards something committable.

It simply looks for a metadata named 'parallel_for' in any of the
instructions in the loop's header and assumes the loop is a parallel
one if such is found.

Aren't all loops in OpenCL parallel? Or are you planning to inline non-OpenCL code into your OpenCL code before running the vectorizer? If not, just have the vectorizer run as part of the pipeline you set up when producing IR from OpenCL code. That it would miscompile non-OpenCL code is irrelevant.

+ for (BasicBlock::iterator ii = header->begin();
+ ii != header->end(); ii++) {

http://llvm.org/docs/CodingStandards.html#don-t-evaluate-end-every-time-through-a-loop

Nick

  This metadata is added by the pocl's wiloops

Hi Nick,

Aren't all loops in OpenCL parallel? Or are you planning to inline

The intra-kernel loops (what the OpenCL C programmer writes) are not by
default parallel. Only the implicit "work group loops" (that iterate
over the work items in the local work space for the regions between
barriers) are.

non-OpenCL code into your OpenCL code before running the vectorizer? If
not, just have the vectorizer run as part of the pipeline you set up
when producing IR from OpenCL code. That it would miscompile non-OpenCL
code is irrelevant.

I (still) think a cleaner and a more modularized approach is to simply add
parallel loop-awareness to the regular vectorizer. This should help
other parallel languages with parallel loop constructs, too.

The basic idea is to use a loop interchange-style optimization to convert
the work group function to a generic inner loop vectorization problem.
Effectively doing outer-loop vectorization this way like Nadav Rotem
suggested. Let's see how it goes.

+ for (BasicBlock::iterator ii = header->begin();
+ ii != header->end(); ii++) {

http://llvm.org/docs/CodingStandards.html#don-t-evaluate-end-every-time-through-a-loop

Thanks. I'll send an updated patch shortly in a separate
email thread.

BR,

Hi Pekka, hi Nadav,

I didn't find the time to read this thread until now, sorry for that.

I actually think you are both right :).
As for the current status, the loop vectorizer is only able to vectorize inner loops and (I think) does not handle function calls and memory operations well. This will prevent it from vectorizing a large group of OpenCL kernels, and certainly all "interesting", more complex ones.
However, in the long run, I think the only difference between WFV-like approaches and classic loop vectorization a la LoopVectorizer in an OpenCL context is the following:
WFV assumes that there is at least one outer loop that has increments of one, runs a multiple of the SIMD width iterations, and that every iteration is independent (barriers can be handled by the OpenCL driver *after* WFV).

On the other hand, LoopVectorizer may not be aimed at covering all kinds of code inside the body and/or instead focus more on things not required by WFV, such as handling reductions and other kinds of loop-carried dependencies.

In any case, since our own OpenCL driver is more of a proof-of-concept implementation and not very robust, I'd be willing to give it a try to integrate the current libWFV into pocl. This should boost performance quite a bit for many kernels without too much effort ;). I just don't know (yet) where to start - can you give me a hint, Pekka?

Cheers,
Ralf

Hi Ralf,

As for the current status, the loop vectorizer is only able to vectorize
inner loops and (I think) does not handle function calls and memory
operations well. This will prevent it from vectorizing a large group of
OpenCL kernels, and certainly all "interesting", more complex ones.

Agreed -- but not being able to handle function calls/intrinsics is
not an OpenCL-specific limitation. Any vectorizable input suffers from
that. Also, an inner loop vectorizer might be able to handle outer loops
e.g. via loop interchange. I'm planning to look into that if time allows.

However, in the long run, I think the only difference between WFV-like
approaches and classic loop vectorization a la LoopVectorizer in an
OpenCL context is the following:
WFV assumes that there is at least one outer loop that has increments of
one, runs a multiple of the SIMD width iterations, and that every
iteration is independent (barriers can be handled by the OpenCL driver
*after* WFV).

Yes, this is the case with the "wiloops" work group generation
method of pocl. The parallel outer loops are the max 3 dimensions of the
local space. The actual wg barrier calls are converted to no-ops (compiler
barriers) for the current targets.

On the other hand, LoopVectorizer may not be aimed at covering all kinds
of code inside the body and/or instead focus more on things not required
by WFV, such as handling reductions and other kinds of loop-carried
dependencies.

It is true that the feature set of the LoopVectorizer goes beyond the
"embarrassingly parallel loops" that the implicit WI loops are. However,
I don't see this as a show-stopper for trying to provide a modularized
approach to work group vectorization.

Moreover, parallelization-helping optimizations such as "loop masking" for
the diverging inner-loops (kernel loops) are more generally useful, and, IMHO
should be added to LLVM upstream (not to an OpenCL implementation only)
eventually as generic loop vectorization routines.

In any case, since our own OpenCL driver is more of a proof-of-concept
implementation and not very robust, I'd be willing to give it a try to
integrate the current libWFV into pocl. This should boost performance
quite a bit for many kernels without too much effort ;). I just don't
know (yet) where to start - can you give me a hint, Pekka?

I'm very glad to hear this! Luckily, the pocl code base has been modularized
to allow easily switching the "work group function generation method" which I
think your WFV work actually is.

Perhaps the detailed instructions on how to start are out of topic here and
you might want to join the pocl-devel list (and #pocl) where the pocl
developers can give more hints. See http://pocl.sourceforge.net/discussion.html.

BR,

From: "Pekka Jääskeläinen" <pekka.jaaskelainen@tut.fi>
To: "Ralf Karrenberg" <Chareos@gmx.de>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Thursday, January 31, 2013 11:15:43 AM
Subject: Re: [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Hi Ralf,

> As for the current status, the loop vectorizer is only able to
> vectorize
> inner loops and (I think) does not handle function calls and memory
> operations well. This will prevent it from vectorizing a large
> group of
> OpenCL kernels, and certainly all "interesting", more complex ones.

Agreed -- but not being able to handle function calls/intrinsics is
not an OpenCL-specific limitation. Any vectorizable input suffers
from
that. Also, an inner loop vectorizer might be able to handle outer
loops
e.g. via loop interchange. I'm planning to look into that if time
allows.

This is also on my TODO list. Let's collaborate when you have time.

> However, in the long run, I think the only difference between
> WFV-like
> approaches and classic loop vectorization a la LoopVectorizer in an
> OpenCL context is the following:
> WFV assumes that there is at least one outer loop that has
> increments of
> one, runs a multiple of the SIMD width iterations, and that every
> iteration is independent (barriers can be handled by the OpenCL
> driver
> *after* WFV).

Yes, this is the case with the "wiloops" work group generation
method of pocl. The parallel outer loops are the max 3 dimensions of
the
local space. The actual wg barrier calls are converted to no-ops
(compiler
barriers) for the current targets.

> On the other hand, LoopVectorizer may not be aimed at covering all
> kinds
> of code inside the body and/or instead focus more on things not
> required
> by WFV, such as handling reductions and other kinds of loop-carried
> dependencies.

It is true that the feature set of the LoopVectorizer goes beyond the
"embarrassingly parallel loops" that the implicit WI loops are.
However,
I don't see this as a show-stopper for trying to provide a
modularized
approach to work group vectorization.

Moreover, parallelization-helping optimizations such as "loop
masking" for
the diverging inner-loops (kernel loops) are more generally useful,
and, IMHO
should be added to LLVM upstream (not to an OpenCL implementation
only)
eventually as generic loop vectorization routines.

I completely agree.

> In any case, since our own OpenCL driver is more of a
> proof-of-concept
> implementation and not very robust, I'd be willing to give it a try
> to
> integrate the current libWFV into pocl. This should boost
> performance
> quite a bit for many kernels without too much effort ;). I just

Ralf, Does this mean that you're close to releasing the new version?

Thanks again,
Hal

Hi Pekka,

On the other hand, LoopVectorizer may not be aimed at covering all kinds
of code inside the body and/or instead focus more on things not required
by WFV, such as handling reductions and other kinds of loop-carried
dependencies.

It is true that the feature set of the LoopVectorizer goes beyond the
"embarrassingly parallel loops" that the implicit WI loops are. However,
I don't see this as a show-stopper for trying to provide a modularized
approach to work group vectorization.

Moreover, parallelization-helping optimizations such as "loop masking" for
the diverging inner-loops (kernel loops) are more generally useful, and,
IMHO
should be added to LLVM upstream (not to an OpenCL implementation only)
eventually as generic loop vectorization routines.

Yes, I fully agree. I already told Nadav that he will immediately get access to my new implementation when he reaches that point to prevent him from re-implementing everything (or at least to have some code to refer to :wink: ). The code is not released yet, but it is under LLVM license so there's no problem with that.

In any case, since our own OpenCL driver is more of a proof-of-concept
implementation and not very robust, I'd be willing to give it a try to
integrate the current libWFV into pocl. This should boost performance
quite a bit for many kernels without too much effort ;). I just don't
know (yet) where to start - can you give me a hint, Pekka?

I'm very glad to hear this! Luckily, the pocl code base has been
modularized
to allow easily switching the "work group function generation method"
which I
think your WFV work actually is.

Perhaps the detailed instructions on how to start are out of topic here and
you might want to join the pocl-devel list (and #pocl) where the pocl
developers can give more hints. See
http://pocl.sourceforge.net/discussion.html.

I'll do this now :).

Cheers,
Ralf

Hi Hal,

In any case, since our own OpenCL driver is more of a
proof-of-concept
implementation and not very robust, I'd be willing to give it a try
to
integrate the current libWFV into pocl. This should boost
performance
quite a bit for many kernels without too much effort ;). I just

Ralf, Does this mean that you're close to releasing the new version?

It depends ;). The new version is already running in our OpenCL driver, which means that it is more or less at the same level of the old implementation now. However, the exploitation of the divergence analysis as described in our CC'12 paper is not fully implemented yet, I can't seem to find the time for that right now :(.
Anyway, if you guys are interested, I can give you access to the repository.

Best,
Ralf

From: "Ralf Karrenberg" <Chareos@gmx.de>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Pekka Jääskeläinen" <pekka.jaaskelainen@tut.fi>, "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>, "Nadav
Rotem" <nrotem@apple.com>
Sent: Friday, February 1, 2013 1:49:28 AM
Subject: Re: [LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Hi Hal,

>>> In any case, since our own OpenCL driver is more of a
>>> proof-of-concept
>>> implementation and not very robust, I'd be willing to give it a
>>> try
>>> to
>>> integrate the current libWFV into pocl. This should boost
>>> performance
>>> quite a bit for many kernels without too much effort ;). I just
>
> Ralf, Does this mean that you're close to releasing the new
> version?

It depends ;). The new version is already running in our OpenCL
driver,
which means that it is more or less at the same level of the old
implementation now. However, the exploitation of the divergence
analysis
as described in our CC'12 paper is not fully implemented yet, I can't
seem to find the time for that right now :(.
Anyway, if you guys are interested, I can give you access to the
repository.

I think that would be useful, thanks!

-Hal