Parallel Loop Metadata

Hi,

I am continuing the discussion about Parallel Loop Metadata from here: http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-February/059168.html and here: http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-February/058999.html

Pekka suggested that we add two kind of metadata: llvm.loop.parallel (attached to each loop latch) and llvm.mem.parallel (attached to each memory instruction!). I think that the motivation for the first metadata is clear - it says that the loop is data-parallel. I can also see us adding additional metadata such as llvm.loop.unrollcnt to allow the users to control the unroll count of loops using pragmas. That's fine. Pekka, can you think of transformations that may need invalidate or take this metadata into consideration ?

Regarding the second metadata that you proposed, I am a bit skeptical. I don't fully understand the semantics of this metadata and I am not sure why we need it. And even if we do need it, I think that it would require too many passes to change.

It is very very important to take into account the complexity of these features. In the past we rejected the parallel 'barrier' semantics to change because it required too many unrelated passes to change.

Nadav

Hi Nadav,

Pekka suggested that we add two kind of metadata: llvm.loop.parallel
(attached to each loop latch) and llvm.mem.parallel (attached to each memory
instruction!). I think that the motivation for the first metadata is clear -
it says that the loop is data-parallel. I can also see us adding additional
metadata such as llvm.loop.unrollcnt to allow the users to control the unroll
count of loops using pragmas. That's fine. Pekka, can you think of
transformations that may need invalidate or take this metadata into
consideration ?

Any pass that introduces new non-parallel memory instructions to the loop,
because they think the loop is sequential and it's ok to do so. I do not know
any other such pass than the one pointed out earlier, reg2mem (if the
variables inside the loop body reuse stack slots). E.g., inlining
should be safe. So should be unrolling an inner loop inside a parallel loop.

Anyways, the fact that I do not know more of such passes to exist does not
mean there isn't any. Especially when you consider there are out of tree
passes in external projects that use LLVM. Therefore, the "safety first"
approach of annotating the memory instructions and falling back to sequential
semantics if non-annotated memory instructions are found sounds sensible to me.

Your loop unroll metadata example does not need this as it's not related
to the parallel semantics. That one works both for parallel and sequential
loops.

The other way to go is the "jump in the cold water" approach: Assume the
parallel loop metadata itself is something that should be respected by
all passes or breakage might happen, which is a bit rough and not allowed
according to the metadata guidelines. It practically adds a new semantical
construct, a parallel loop, to the LLVM IR. Thus, it's then something that all
passes potentially *need* to know about to not accidentally break the code (by
assuming it's a sequential loop and doing transformations that actually make
it so).

Regarding the second metadata that you proposed, I am a bit skeptical. I
don't fully understand the semantics of this metadata and I am not sure why
we need it. And even if we do need it, I think that it would require too many
passes to change.

It's there exactly to avoid the *need* for the passes to know about
parallel loop semantics. If they do know about it, they can *optimize* by
retaining it as a parallel loop, but the fallback to a serial loop should
be safe for parallel-loop-unaware passes. E.g., inlining and unrolling
might want to update the metadata to still enable, e.g., the vectorizer to
consider it as a parallel loop. Definitely some kind of helper function
somewhere should do this to make it easy to add "parallel-loop-awareness"
to the passes.

It is very very important to take into account the complexity of these
features. In the past we rejected the parallel 'barrier' semantics to change
because it required too many unrelated passes to change.

The additional llvm.mem.parallel metadata tries to avoid exactly this.
Simply put, llvm.loop.parallel as itself is not legal metadata (it cannot
be ignored safely) without the other.

That being said, if you know of a better way to guarantee this type of
"safe fallback", I'll be happy to implement it.

BR,

I suggest that we only add the 'llvm.loop.parallel' metadata and not llvm.mem.parallem. I believe that it should be the job of the consumer pass (e.g loop-vectorizer) to scan the loop and detect parallelism violations. This is also the approach that we use when we optimize stack slots using lifetime markers. I understand that the consumer passes will have to be more conservative and miss some optimizations. But I still think that this is better than forcing different passes in the compiler to know about parallel metadata.

Hi Nadav,

I am not sure if I am able to follow your reasoning. How could the -loop-vectorizer detect parallelism violations? I had the feeling that we introduce the llvm.loop meta-data for the case where we want to inform the loop vectorizer that it can assume the absence of dependences even though it can not prove their absence statically. Do you possibly mean that the -loop-vectorizer should in some way detect if the llvm.loop.parallel metadata is still correct?

Given we go without llvm.mem meta-data. How would the -loop-vectorizer
detect that the test case in parallel-loops-after-reg2mem.ll (see Pekkas patch) is not parallel any more even though the llvm.loop.parallel metadata is present? The llvm.mem meta-data would give the necessary information that there was a new memory access introduced that was not covered by the llvm.loop.parallel meta-data. However, without this additional information, I have a hard time to see how you can verify if the llvm.loop.parallel metadata is still up to date.

All the best,
Tobi

Hi Tobi,

Thanks for reviewing the proposal. I imagine that it may also affects your parallelization work in Polly.

I am not sure if I am able to follow your reasoning. How could the -loop-vectorizer detect parallelism violations? I had the feeling that we introduce the llvm.loop meta-data for the case where we want to inform the loop vectorizer that it can assume the absence of dependences even though it can not prove their absence statically. Do you possibly mean that the -loop-vectorizer should in some way detect if the llvm.loop.parallel metadata is still correct?

Yes, the loop vectorizer can detect the kind of violations of the "llvm.loop.parallel" metadata that we are worried about.

Given we go without llvm.mem meta-data. How would the -loop-vectorizer
detect that the test case in parallel-loops-after-reg2mem.ll (see Pekkas patch) is not parallel any more even though the llvm.loop.parallel metadata is present?

We already detect this case right now. Its really easy to do. The llvm.loop.parallel should only provide information that is not easily available within this compilation unit. For example, assumption that the input pointers don't overlap, or that dynamic indices are within a certain range that allow us to vectorize.

The llvm.mem meta-data would give the necessary information that there was a new memory access introduced that was not covered by the llvm.loop.parallel meta-data. However, without this additional information, I have a hard time to see how you can verify if the llvm.loop.parallel metadata is still up to date.

Thanks,
Nadav

Hi Tobi,

Thanks for reviewing the proposal. I imagine that it may also affects your parallelization work in Polly.

Sure. I am interested in using it.

I am not sure if I am able to follow your reasoning. How could the -loop-vectorizer detect parallelism violations? I had the feeling that we introduce the llvm.loop meta-data for the case where we want to inform the loop vectorizer that it can assume the absence of dependences even though it can not prove their absence statically. Do you possibly mean that the -loop-vectorizer should in some way detect if the llvm.loop.parallel metadata is still correct?

Yes, the loop vectorizer can detect the kind of violations of the "llvm.loop.parallel" metadata that we are worried about.

OK. I see. Most of the references that -mem2reg introduces are obviously
destroying parallelism and I can see that the loop vectorizer could detect these obvious violations and refuse to parallelize. However, the
question remains if the loop vectorizer can (and should) detect all possible violations. I am still concerned that this is in general not
possible. Here another piece of code (+ transformation), where it is
a lot harder (impossible?) to detect the violation:

// b is always bigger than 100
#parallel
for (int i = 0; i < 100; i++) {
S1: A[i % b] += i;
}

Depending on the values of 'b' the loop is either parallel or not. It is impossible for the -loop-vectorizer to reason about this, but with
the additional information the user has, he can easily annotate the loop such that the loop vectorizer can optimize it.

Now we have some new instrumentation pass, which collects information
and uses for this a buffer with 'Size' elements. The instrumentation pass just adds a couple of additional instructions, which do not
change the sequential behavior of the program.

int *B = get_buffer();
int Size = get_buffer_size();

// b is always bigger than 100
#parallel
for (int i = 0; i < 100; i++) {
s1: A[i % b] += i;
S2: B[i % Size] += i
}

However, depending on the value of Size, the parallel execution of
the updated loop may not be legal any more. Without further information we have to assume that the loop is not parallel anymore.

For this case, will we require the loop-vectorizer to detect the outdated metadata? In case we do, how could this work?

Or do we require the instrumentation pass, to reason about the llvm.loop.parallel data and remove it in case it gets invalidated?

Or can we just assume that such an instrumentation pass can or will never exist?

Cheers
Tobi

P.S: I know that due to the modulo, we will not be able to prove stride one access and vectorization may not be profitable. The example was written to demonstrate legality issues. We can assume surrounding
instructions which would make vectorization profitable.

Good day Nadav and Tobias,

Yes, the fundamental idea of this metadata was that it would be
specifically used to avoid the need for any loop-carried dependency
analysis.

We already detect this case right now. Its really easy to do. The
llvm.loop.parallel should only provide information that is not easily
available within this compilation unit. For example, assumption that the
input pointers don't overlap, or that dynamic indices are within a certain
range that allow us to vectorize.

I see your point, but then this goes back to the similar discussion we had
with #pragma ivdep.

The risky part in this approach is to define the set of safely
ignored dependency cases in such a way that it never ignores deps that
should not be ignored.

The problematic cases are the "unknown alias" cases. Some of them might
originate from the original parallel loop (the reason why the programmer
might have added the parallellism info the first place), and some of them
might come from parallel-loop-unaware passes and cannot be safely ignored
if they have converted the loop back to a sequential one (added a
loop carried dependency).

I totally agree that annotating all the memory accesses in the parallel
loops feels somehow excessive (this wasn't done in my original patch),
but so far I do not see a more robust way, and I prefer playing it safe.

BR,

Or do we require the instrumentation pass, to reason about the
llvm.loop.parallel data and remove it in case it gets invalidated?

It's against the metadata guidelines.

Or can we just assume that such an instrumentation pass can or will
never exist?

I think it's safe to say we can't.

One additional motivational point I want to make is that the llvm.mem
metadata style might become useful for other things too. Cases like
annotating the pointer with its original function context to
preserve "restricted pointer" (noalias) information across function
call inlines, etc.

Another possibility would be for passes to be only-just parallel metadata
aware, in that if a pass which doesn't know enough to correctly preserve
parallel metadata will not do anything if it sees any parallel markers
(maybe trying to find the smallest region to which this "don't run
yourself" applies, maybe not). In that way the metadata is guaranteed to
remain correct at the cost of missing out on some reorgranisations that are
done on non-parallel-metadata code. I don't know how well this would fit in
with the general philosophy of LLVM passes though.

(I'm also aware that we're coming at this from different directions: most
people are interested in auto-parallelisation, where missing a
parallelisation opportunity is just one of those unfortunate things, while
I have a personal interest in DSLs which try to present LLVM code with huge
"parallelise this" signs pointing at bits of it. It would be frustrating to
have carefully made sure the DSL twisted things into a parallelisable form
only to have parallelisation/vectorization "fall at the last hurdle".)

Cheers,
Dave

In this case, I'd prefer metadata on the variables that are assumed not to
alias, like the restrict keyword.

It seems to me that having metadata on the loop basic blocks, since they
can be invalidated, will not help that much with the vectorizer more than
specific annotation on specific values (which are harder to lose). I'm not
saying we should annotate *all* memory instructions on a loop, just the
ones that make sense, or will help the vectorizer default to a sane value.

I'm not a big fan of basic block annotation, unless the BBs are created for
very specific reasons and no pass it allowed to touch it (especially
inliners).

cheers,
--renato

Hi Renato,

In this case, I'd prefer metadata on the variables that are assumed not
to alias, like the restrict keyword.

>

It seems to me that having metadata on the loop basic blocks, since they
can be invalidated, will not help that much with the vectorizer more
than specific annotation on specific values (which are harder to lose).
I'm not saying we should annotate *all* memory instructions on a loop,
just the ones that make sense, or will help the vectorizer default to a
sane value.

This is an interesting alternative! Do you mean that we would still add
the llvm.mem.parallel_loop_access metadata, but only to such mem accesses
that are assumed to be "hard or impossible to analyze" (to prove to be no
alias cases)? Then we'd forget about the "parallel loop metadata" as is.

Then we would rely on the regular loop carried dependency analyzer by
default, but let those (mem) annotations just *help* in the "tricky cases".

The llvm.mem.parallel_loop_access metadata would only communicate "this
instruction does not alias with any other similarly annotated instruction
from any other iteration in this loop".

Quickly thinking, this might work and might not loose the
parallelism info too easily. Anyways, the info still has to be
connected to a loop to avoid breakup in inlining, multi-level loops, etc.

Summarizing, the new metadata would be:

llvm.loop:
Just to mark a loop (points to a unique id metadata).

llvm.mem.parallel_loop_access:
The above mentioned new semantics, connected to the llvm.loop's id metadata.

What do others think? Nadav?

How does this not require you to mark all the possible alias pairs in practice?

IE
Given memory instructions A, B, C, and D, what do you think makes the
(A,B) hard to analyze (and thus you'd need to mark A and B with this
new metadata) that doesn't also make (A, C) hard to analyze? Is it
not usually the case that it is *A* itself, that is hard to analyze
(because of some property of the memory access), rather than any
particular pair?

I'd love to see example cases where the pair analysis is the
difficulty, rather than the access analysis of any single memory piece
being the difficulty.

I'm not completely sure what you mean, but is there really a difference
between doing "pair analysis" across multiple iterations of the same
instruction than doing it with any other instruction?

Say, a loop with memory operations Aw, Br, Cw (w=write, r=read).
This is unrolled just for the sake of the example. The same applies to
widening the instructions in a loop vectorizer, similar dependency
analysis is needed.

i1: Aw Br Cw
i2: Aw' Br' Cw'

To do a legal code motion across the two iterations in such a way that
you move e.g., Aw' before Br, in parallel with Aw (e.g. to pack them to a
vector instruction or statically schedule in a VLIW) requires you to know
that none of Aw, Br nor Cw access the same location of Aw'.

If we know these are parallel loop accesses we know that Aw' doesn't
alias with any of Aw, Br or Cw, thus we can perform the code motion
with a peaceful mind, and execute Aw and Aw' in parallel e.g. in a
vector instruction.

If we get some non-parallel-loop-annotated mem instructions injected in
the loop, it's up to the analyzer again to prove the code motion is legal.

For example, the reg2mem case (where Sw is produced by it and not
marked with the llvm.mem.parallel_loop_access MD).

i1: Aw, Sw, Br, Cw
i2: Aw', Sw', Br', Cw'

Cannot do the same code motion here unless the analyzer can prove Aw' and
Sw do not alias.

I'd love to see example cases where the pair analysis is the
difficulty, rather than the access analysis of any single memory piece
being the difficulty.

I'm not completely sure what you mean, but is there really a difference
between doing "pair analysis" across multiple iterations of the same
instruction than doing it with any other instruction?

Say, a loop with memory operations Aw, Br, Cw (w=write, r=read).
This is unrolled just for the sake of the example. The same applies to
widening the instructions in a loop vectorizer, similar dependency
analysis is needed.

i1: Aw Br Cw
i2: Aw' Br' Cw'

To do a legal code motion across the two iterations in such a way that
you move e.g., Aw' before Br, in parallel with Aw (e.g. to pack them to a
vector instruction or statically schedule in a VLIW) requires you to know
that none of Aw, Br nor Cw access the same location of Aw'.

If we know these are parallel loop accesses we know that Aw' doesn't
alias with any of Aw, Br or Cw, thus we can perform the code motion
with a peaceful mind, and execute Aw and Aw' in parallel e.g. in a
vector instruction.

If we get some non-parallel-loop-annotated mem instructions injected in
the loop, it's up to the analyzer again to prove the code motion is legal.

Yes, that's exactly my point.
You said "This is an interesting alternative! Do you mean that we
would still add
the llvm.mem.parallel_loop_access metadata, but only to such mem accesses
that are assumed to be "hard or impossible to analyze" (to prove to be no
alias cases)? Then we'd forget about the "parallel loop metadata" as is.

Then we would rely on the regular loop carried dependency analyzer by
default, but let those (mem) annotations just *help* in the "tricky cases".

The llvm.mem.parallel_loop_access metadata would only communicate "this
instruction does not alias with any other similarly annotated instruction
from any other iteration in this loop"."

IE we wouldn't need to annotate every memory access in the loop, only
"tricky" ones.
I don't see how this is possible, given the suggested semantics.

Let me give a motivating example myself, to explain what I mean.

Loop 1:

for ...
   ...
   load %w <--- A
   ...
   store ..., %x <---B
   ...
   load %y <---C
   ...
   store ..., %z, <---D

Assume that only %w and %z are actually complicated access functions
that we could not prove anything about without metadata.

In practice, we are going to ask about pairs (D, C), (D, B), (D, A)

If "llvm.mem.parallel_loop_access" means that, per above, things
annotated with it do not conflict, in the above loop, you stil need to
annotate all of A, B, C, D with it to get any impact. It is *not*
sufficient to just annotate D, or even D and A, because the problem is
*D* is hard to analyze and *A* is hard to analyze, not that D,
<something> is hard to analyze. Without annotating everything, we
are still going to fail to know anything about D, C, even if we know
something about D, A.

So i don't see how you can simply annotate *certain* memory accesses,
and get any value out of it, given the semantics suggested.

For example, the reg2mem case (where Sw is produced by it and not
marked with the llvm.mem.parallel_loop_access MD).

i1: Aw, Sw, Br, Cw
i2: Aw', Sw', Br', Cw'

Cannot do the same code motion here unless the analyzer can prove Aw' and
Sw do not alias.

Yes, I agree, which is why i'm trying to understand how you don't end
up having to mark *every* memory in the loop with this instruction in
practice to get anywhere.

OK. Got what you mean now, thanks.

You are right, when the parallel loop information is added to the loop
the first time, all of the memory instructions need to be annotated,
so this doesn't actually reduce the amount of metadata needed afterall :-I

However, it might help the fallback case. Instead of making
a single unannotated access ruin the whole parallelism data, the
remaining metadata can still be used to shortcut some checks.

E.g., in the example

i1: Aw, Sw, Br, Cw
i2: Aw', Sw', Br', Cw'

The MD can be still exploited to do a code motion of Aw' before Br.
Vectorization might be ruined but some ILP can be still exploited.
To get A vectorized, the analyzer needs to prove !(Sw -> Aw'),
but not the other pairs.

Nadav Rotem wrote:

> Hi Nadav,
>
>> Pekka suggested that we add two kind of metadata: llvm.loop.parallel
>> (attached to each loop latch) and llvm.mem.parallel (attached to each memory
>> instruction!). I think that the motivation for the first metadata is clear -
>> it says that the loop is data-parallel. I can also see us adding additional
>> metadata such as llvm.loop.unrollcnt to allow the users to control the unroll
>> count of loops using pragmas. That's fine. Pekka, can you think of
>> transformations that may need invalidate or take this metadata into
>> consideration ?
>
> Any pass that introduces new non-parallel memory instructions to the loop,
> because they think the loop is sequential and it's ok to do so. I do not know
> any other such pass than the one pointed out earlier, reg2mem (if the
> variables inside the loop body reuse stack slots). E.g., inlining
> should be safe. So should be unrolling an inner loop inside a parallel loop.
>

I suggest that we only add the 'llvm.loop.parallel' metadata and not llvm.mem.parallem.

Ok. A solution similar to Pekka's, using fewer annotations as you suggested, is
to reference in 'llvm.loop.parallel' all the datarefs to which the parallel
access semantics applies. When the list of datarefs in 'llvm.loop.parallel'
differs from the datarefs in the loop, the annotation is deemed invalid.

Sebastian

That sounds elegant and seems to solve the correctness problems.

Cheers,
Tobias

This is something that I can live with. Pekka, if you agree with this approach, can you write a formal proposal ?

Thanks,
Nadav

There is no big difference here except that the memory instructions
would not need the metadata.

I do not think the abundance of metadata is really the main problem, but
what to do with passes that do not know about parallel loops. How to minimize the changes needed to make the most relevant passes retain the information,
and more importantly, how to avoid code breakage if they are unaware of
its meaning.

Also, I tried something like this when I tried to find a nice way to couple
the memory and the loop metadata. It seems not possible to directly refer to
an Instruction from MDNode, or is it? I got crashes when serializing the
bitcode with MDNodes pointing to Instructions and thought that it's not meant
to work that way around.

Thus, I created the loop identifier metadata "anchor" to which the both
metadata types refer.

Even if it was possible, retaining that information across inlining etc. means
one has to update those Instruction references in the metadata to the new
matching ones in the inlined location, etc. The maintenance burden of the data
in several passes, which I thought was the actual concern of Nadav, gets AFAIU
even harder. E.g., if you remove memory instructions (e.g. DCE) you get
dangling pointers in the metadata or a null there which renders the parallel info invalid (and might even cause crashes?), which doesn't happen in the
current state of the patch.

All in all, if the number of metadata is what pains the most, this
is an improvement. Otherwise, I see it more fragile, and it's also uncertain if
it's possible to implement it.

IMHO, the best alternative presented so far is to have both metadata types
and potentially exploit the mem.parallel_loop_access MD even if there were
non-annotated instructions in the loop to *help* the dependency analysis.
isParallel() would stay as is now in the patch, it can be used as a short cut
check for a loop of which accesses all have the metadata (to skip the whole
dependency checking).

That sounds elegant and seems to solve the correctness problems.

There is no big difference here except that the memory instructions
would not need the metadata.

I do not think the abundance of metadata is really the main problem, but
what to do with passes that do not know about parallel loops. How to minimize the changes needed to make the most relevant passes retain the information,
and more importantly, how to avoid code breakage if they are unaware of
its meaning.

Also, I tried something like this when I tried to find a nice way to couple
the memory and the loop metadata. It seems not possible to directly refer to
an Instruction from MDNode, or is it? I got crashes when serializing the
bitcode with MDNodes pointing to Instructions and thought that it's not meant
to work that way around.

I think you can refer to the memory address values from within your function-local metadata.

Thus, I created the loop identifier metadata "anchor" to which the both
metadata types refer.
Even if it was possible, retaining that information across inlining etc. means
one has to update those Instruction references in the metadata to the new
matching ones in the inlined location, etc. The maintenance burden of the data
in several passes, which I thought was the actual concern of Nadav, gets AFAIU
even harder. E.g., if you remove memory instructions (e.g. DCE) you get
dangling pointers in the metadata or a null there which renders the parallel info invalid (and might even cause crashes?), which doesn't happen in the
current state of the patch.

How is this not a problem for !tbaa?

-Andy