[PATCH] parallel loop awareness to the LoopVectorizer

Hi,

Attached is a patch which uses a simple "parallel_loop" metadata attached
to the loop branch instruction in the loop latch for skipping cross-iteration
memory dependency checking in the LoopVectorizer. This was briefly discussed
in the email thread "LoopVectorizer in OpenCL C work group autovectorization".

It also converts the "min iteration count to vectorize" to a parameter so
this can be controlled from the command line.

Comments welcomed.

Thanks in advance,

llvm-3.3-loopvectorizer-parallel_for-metadata-detection.patch (2 KB)

Hi,

Attached is a patch which uses a simple "parallel_loop" metadata attached
to the loop branch instruction in the loop latch for skipping
cross-iteration
memory dependency checking in the LoopVectorizer. This was briefly
discussed
in the email thread "LoopVectorizer in OpenCL C work group
autovectorization".

Can you provide a test case?

It also converts the "min iteration count to vectorize" to a parameter so
this can be controlled from the command line.

This seems to be a separate patch?

Tobi

Attached is a patch which uses a simple "parallel_loop" metadata attached
to the loop branch instruction in the loop latch for skipping
cross-iteration
memory dependency checking in the LoopVectorizer. This was briefly
discussed
in the email thread "LoopVectorizer in OpenCL C work group
autovectorization".

I agree this is a good idea, but not sure about the implementation.

+ if (latch->getTerminator()->getMetadata("parallel_loop") != NULL) {

This seems an awfully specific check on a generic part of the code... If
this metadata standard in any form? If this OpenCL specific? Does all
OpenCL front-ends generate the same meta-data in that way? Etc...

It also converts the "min iteration count to vectorize" to a parameter so

this can be controlled from the command line.

Is this really necessary? Do you have use cases where this would make sense?

I think you should send a test case with this patch, not separate.

cheers,
--renato

Hi Renato,

This seems an awfully specific check on a generic part of the code... If

True. Perhaps the check is better encapsulated, e.g., in the Loop class?
Or, if there's such thing as a loop-carried data dependency analyzer,
the correct place could be there, as a trivial "no deps" analysis.

> this metadata standard in any form? If this OpenCL specific? Does all

This metadata is not standard in any form. Therefore the request
for comments. However, its meaning is generic, not OpenCL
specific at all. It specifies that the loop iterations can be
treated as independent, regardless of the memory operations the
body contains. Thus, the potential cross-iteration memory dependencies
can be considered a programming error.

> OpenCL front-ends generate the same meta-data in that way? Etc...

I have no knowledge of other OpenCL implementations than
pocl as I haven't seen their code.

    It also converts the "min iteration count to vectorize" to a
    parameter so
    this can be controlled from the command line.

Is this really necessary? Do you have use cases where this would make sense?

Where a lower threshold could be useful? At least with loops having long
bodies and loops with outer loops that iterate the inner loop many
times.

In fact, shouldn't the default minimum be the minimum vector width of the
machine? The cost estimation routine should take care of the actual
profitability estimate?

I think you should send a test case with this patch, not separate.

As soon as there's a consensus on the metadata format and where
the check shall reside in, I'll prepare a proper patch with
a vectorizer test case.

Thanks for the comments so far,

Hi Pekka,

I am okay with this patch, assuming that you follow the review of Tobias and Renato and provide a separate patch for the min-iter-count and a few test cases.

I think that it would be a good idea to start a new thread and to discuss the best way to annotate loops in LLVM.

Thanks,
Nadav

From: "Nadav Rotem" <nrotem@apple.com>
To: "Pekka Jääskeläinen" <pekka.jaaskelainen@tut.fi>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Monday, January 28, 2013 10:45:36 AM
Subject: Re: [LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

Hi Pekka,

I am okay with this patch, assuming that you follow the review of
Tobias and Renato and provide a separate patch for the
min-iter-count and a few test cases.

I think that it would be a good idea to start a new thread and to
discuss the best way to annotate loops in LLVM.

Is this sufficient to implement #pragma ivdep in clang?

-Hal

OK. Any opinions on the location of the isParallelLoop() check? Shall I
put it to Loop so it is more widely accessible? I.e. Loop->isParallel().

I'd assume so. It looks as though Loop->isParallel() or similar should
account for both OpenCL and #ivdep cases.

If we don't have #ivdep yet, would be a good time to add, at least setting
the attribute in Clang for both.

Nadav, you mentioned ivdep a month ago, did that get any traction?

cheers,
--renato

Is this sufficient to implement #pragma ivdep in clang?

I think so.

I'm not completely sure of this:

"Note: The proven dependencies that prevent vectorization are not ignored,
only assumed dependencies are ignored."

Thus, there's a slight difference. It cannot be used to disable dependency
checking altogether (and just blame a sloppy programmer if there actually
are dependencies), but it just converts the "unknown alias" to "no alias".
If there's a "yes" from the analyzer it still prevents the vectorization.
So, sort of a softened programmer-friendlier version of the semantics.

The vagueness comes from that it depends on the intelligence
of the dependency analysis implementation whether a dependency can be "proven" or not, doesn't it? Thus, #pragma ivdep with a non-existing
loop dependence analyzer is equivalent to the semantics of the proposed
metadata.

Also, it's a bit unclear what is the real difference to the #pragma parallel:

It similarly states: "However, if dependencies are proven, they are not ignored." So conversely, if the compiler cannot prove a dependency for
some reason, they *are* ignored?

OpenMP's 'omp for', on the other hand, can be used to mark a truly parallel
loop where this metadata could be used if one wants to parallelize those
loops using a finer-granularity mechanism than threads.

That said, I cannot think of a case where it would *harm* if the
dependency analyzer, if it can actually prove a dependency, serializes
the code.

Thus, the same metadata can be used in both cases, if one doesn't care
the possible wasted compilation time spent on the unnecessary dependency checking.

From: "Pekka Jääskeläinen" <pekka.jaaskelainen@tut.fi>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Nadav Rotem" <nrotem@apple.com>, "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Monday, January 28, 2013 11:36:21 AM
Subject: Re: [LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

> Is this sufficient to implement #pragma ivdep in clang?

I'm not completely sure of this:

"Note: The proven dependencies that prevent vectorization are not
ignored,
only assumed dependencies are ignored."

Development Tools

Thus, there's a slight difference. It cannot be used to disable
dependency
checking altogether (and just blame a sloppy programmer if there
actually
are dependencies), but it just converts the "unknown alias" to "no
alias".
If there's a "yes" from the analyzer it still prevents the
vectorization.
So, sort of a softened programmer-friendlier version of the
semantics.

The vagueness comes from that it depends on the intelligence
of the dependency analysis implementation whether a dependency can be
"proven"
or not, doesn't it? Thus, #pragma ivdep with a non-existing
loop dependence analyzer is equivalent to the semantics of the
proposed
metadata.

And the user has no way of knowing which dependencies are proven and which are assumed, right? It seems like the user just needs to assume that nothing is proven :wink: Nevertheless, based on this, we probably do need something with slightly weaker semantics for ivdep.

Also, it's a bit unclear what is the real difference to the #pragma
parallel:
Development Tools

It similarly states: "However, if dependencies are proven, they are
not
ignored." So conversely, if the compiler cannot prove a dependency
for
some reason, they *are* ignored?

Interesting.

OpenMP's 'omp for', on the other hand, can be used to mark a truly
parallel
loop where this metadata could be used if one wants to parallelize
those
loops using a finer-granularity mechanism than threads.

Agreed; we should make sure to incorporate this into the upcoming OpenMP support. The loops will be outlined, but the outlined pieces can then be marked with this 'parallel' metadata.

Thanks again,
Hal

From: "Pekka Jääskeläinen" <pekka.jaaskelainen@tut.fi>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Nadav Rotem" <nrotem@apple.com>, "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Monday, January 28, 2013 11:58:09 AM
Subject: Re: [LLVMdev] [PATCH] parallel loop awareness to the LoopVectorizer

> If there's a "yes" from the analyzer it still prevents the
> vectorization.
> So, sort of a softened programmer-friendlier version of the
> semantics.

That said, I cannot think of a case where it would *harm* if the
dependency analyzer, if it can actually prove a dependency,
serializes
the code.

Thus, the same metadata can be used in both cases, if one doesn't
care
the possible wasted compilation time spent on the unnecessary
dependency checking.

Agreed. In fact, it would not really be wasted because what we *should* do in that case it warn the user about it.

-Hal

It sounds like a good idea to move the method in to Loop.

Is there a naming scheme for metadata? I think llvm.loop.* would be helpful for loop-specific metadata. As for parallel I think it is a little too generic. If ivdep are the semantics you're going for I'd use that.

paul

Fine, except I prefer not to include 'v' in it. Vectorization
is merely a one way to parallelize the loop.

How does llvm.loop.ignore_assumed_deps sound?

sounds good.

Sounds better!

Hi Pekka,

I was a little bit fast in my last email. I would like to discuss the test cases I want to see a little bit more. I am especially interested
to see that the meta data is robust in case of transformations.

Assuming we have something like;

# ignore assumed dependences.
for (i = 0; i < 4; i++) {
    tmp1 = A[3i+1];
    tmp2 = A[3i+2];
    tmp3 = tmp1 + tmp2;
    A[3i] = tmp3;
}

Now I apply for whatever reason a partial reg2mem transformation.

float tmp3[1];

# ignore assumed dependences. // Still valid?
for (i = 0; i < 4; i++) {
    tmp1 = A[3i+1];
    tmp2 = A[3i+2];
    tmp3[0] = tmp1 + tmp2;
    A[3i] = tmp3[0];
}

Is the meta data now still valid or how do we ensure the invalid meta data is removed?

I have the feeling it may be necessary to link the loop as well as the accesses for which we can ignore dependences together and mark the whole construct as "this set of memory accesses does not have dependences within the mentioned loop". Like this, the introduction of additional memory accesses which may very well introduce dependences that we need to preserve, can be easily detected. The parallelism check would now be: "In case the llvm.loop.ignore_assumed_deps meta-data is found in a loop header _and_ all memory accesses in the loop are referenced by this meta-data, the loop is parallel. If there is a memory
access that does not reference the ignore_assumed_deps meta-data, we
can not assume anything about the loop.

Cheers,
Tobias

Hi Tobias,

Is the meta data now still valid or how do we ensure the invalid meta data is
removed?

It seems it's not valid anymore. Good catch. I was requesting for these
transformation cases earlier. Probably there are more not thought of yet.

I have the feeling it may be necessary to link the loop as well as the accesses
for which we can ignore dependences together and mark the whole construct as
"this set of memory accesses does not have dependences within the mentioned
loop". Like this, the introduction of additional memory accesses which may very
well introduce dependences that we need to preserve, can be easily detected. The
parallelism check would now be: "In case the llvm.loop.ignore_assumed_deps
meta-data is found in a loop header _and_ all memory accesses in the loop are
referenced by this meta-data, the loop is parallel. If there is a memory
access that does not reference the ignore_assumed_deps meta-data, we
can not assume anything about the loop.

Sounds reasonable, as it's too intrusive to start making all the passes
parallel loop-aware.

Also, I was thinking how to *retain* this data during unrolling. At least
in-order targets benefit from the information in their alias analysis to
produce more instruction scheduling freedom while scheduling code from the
unrolled parallel iterations. Also the BBVectorizer should benefit from this.

Annotating the memory operations in the iteration should cover this use case
also if the metadata also includes the iteration id. Then at the unroll time,
the metadata can be replicated and the id incremented. An alias analyzer
(something similar we have in pocl for unrolled/chained work items) then can
check for this metadata and return NO_ALIAS for accesses from different iterations.

Going even further, maybe this same (or similar) format can be used to
retain restricted pointer (noalias) information across inlining. Lack of this
has been a bit of a nuisance for us in TCE.

In the restricted ptr case one wants to mark pointers with some kind of
context marker (e.g. a function name + pointer identification) which
communicates that the accesses through that pointer do not alias with
other pointers in the same function context.

The mem accesses in the parallel loop iteration could have a metadata
"llvm.mem.parallel_loop_iter 0 0" or similar where the first id is a
loop id, the latter the iteration id. Restricted pointer mem accesses
could use metadata "llvm.mem.restrict my_function my_pointer". A parallel
loop alias analyzer can use the parallel_loop_iter metadata (if loop id
equals but iteration id doesn't, return NO_ALIAS) and a restrict pointer
AA use the latter (if both of the queried pointers have it and it differs
for the pointer part only, return NO_ALIAS).

Sounds reasonable, as it's too intrusive to start making all the passes
parallel loop-aware.

+1. It will bloat IR a bit, but we can turn it on only when vectorization
is enabled, when it's likely that people won't be looking at IR anyway.

In the restricted ptr case one wants to mark pointers with some kind of

context marker (e.g. a function name + pointer identification) which
communicates that the accesses through that pointer do not alias with
other pointers in the same function context.

You may have issues with inlining, ie. have to teach it to change the
metadata to cope with the function changing, since inlining can expose many
forms of parallelism.

cheers,
--renato