parallel loop metadata simplification

Hi,

I've been working on clang codegen for #pragma ivdep and creating the llvm.mem.parallel_loop_access metadata seems quite difficult. The main problem is that there are so many places where loads and stores are created and all of them need to be changed when emitting a parallel loop. Note that creating llvm.loop.parallel is not a problem.

One option is to modify IRBuilder to enable attaching the metadata for loads and stores but that seems like a huge hack.

I'd like to reopen the discussion on requiring the llvm.mem.parallel_loop_access metadata. I understand the reason for the metadata is to protect against transformations that may introduce unsafe parallel memory accesses (the reg2mem example.) I'm wondering if perhaps we can make the metadata more user-friendly by providing a single loop metadata which can be expanded into "safer" metadata as required. Specifically a loop pass could be added that expands llvm.loop.parallel into llvm.loop.parallel_protected + llvm.mem.parallel_loop_access.

Perhaps there are simpler options I've overlooked?

paul

Hi Paul,

I'd like to reopen the discussion on requiring the
llvm.mem.parallel_loop_access metadata. I understand the reason for the
metadata is to protect against transformations that may introduce unsafe
parallel memory accesses (the reg2mem example.) I'm wondering if perhaps we
can make the metadata more user-friendly by providing a single loop metadata
which can be expanded into "safer" metadata as required. Specifically a loop
pass could be added that expands llvm.loop.parallel into
llvm.loop.parallel_protected + llvm.mem.parallel_loop_access.

Did I understand this correctly: llvm.loop.parallel would become
a version which does not require the parallel_loop_access? Basically it
would be a placeholder for a later pass (before unsafe optimizations?) that
converts it to the "safe" metadata pair? Then you can generate the loads/stores
in Clang without worrying about the mem metadata but add the safe
metadata once and for all.

The problem I see with this is that I do not think there is use for
"unsafe"/"unprotected" parallel metadata after Clang. Thus, the
"unprotected metadata" would be merely book keeping until the final IR is
generated in Clang (with the safe metadata). Maybe some other book keeping
mechanism could do? If it's not easily workable (I'm not a Clang expert)
then I'd propose a descriptive name for the new metadata like llvm.loop.parallel_under_construction that strongly states that
it's for this purpose only and should not be used for anything
during optimizations (as it's not safe to do so!).

Hi Pekka,

Hi Paul,

I'd like to reopen the discussion on requiring the
llvm.mem.parallel_loop_access metadata. I understand the reason for the
metadata is to protect against transformations that may introduce unsafe
parallel memory accesses (the reg2mem example.) I'm wondering if perhaps we
can make the metadata more user-friendly by providing a single loop metadata
which can be expanded into "safer" metadata as required. Specifically a loop
pass could be added that expands llvm.loop.parallel into
llvm.loop.parallel_protected + llvm.mem.parallel_loop_access.

Did I understand this correctly: llvm.loop.parallel would become
a version which does not require the parallel_loop_access? Basically it
would be a placeholder for a later pass (before unsafe optimizations?) that
converts it to the "safe" metadata pair? Then you can generate the loads/stores
in Clang without worrying about the mem metadata but add the safe
metadata once and for all.

Yes, that is correct.

The problem I see with this is that I do not think there is use for
"unsafe"/"unprotected" parallel metadata after Clang. Thus, the
"unprotected metadata" would be merely book keeping until the final IR is
generated in Clang (with the safe metadata). Maybe some other book keeping
mechanism could do? If it's not easily workable (I'm not a Clang expert)
then I'd propose a descriptive name for the new metadata like llvm.loop.parallel_under_construction that strongly states that
it's for this purpose only and should not be used for anything
during optimizations (as it's not safe to do so!).

Those are fair points and I can't say whether or not there are other potential users. Certainly clang could use custom loop metadata and the pass could live in its codegen.

Another option is two make Loop::isParallel aware of both safe and unsafe parallel metadata and not requires translation at all. In practice, passes like reg2mem are not run so the concern goes away. Perhaps a single loop metadata is good enough for most uses?

paul

From: "Paul Redmond" <paul.redmond@intel.com>
To: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
Sent: Thursday, February 28, 2013 1:30:57 PM
Subject: [LLVMdev] parallel loop metadata simplification

Hi,

I've been working on clang codegen for #pragma ivdep and creating the
llvm.mem.parallel_loop_access metadata seems quite difficult. The
main problem is that there are so many places where loads and stores
are created and all of them need to be changed when emitting a
parallel loop. Note that creating llvm.loop.parallel is not a
problem.

One option is to modify IRBuilder to enable attaching the metadata
for loads and stores but that seems like a huge hack.

Can you please explain why this is a bad option? To be honest, this is what I expected you to do. The IRBuilder sits on top of all of the loads and stores, and seems like the perfect place to keep state for something that affects "all" loads and stores in some code region. In addition, putting this in IRBuilder should also make it easier to use this feature from other frontends.

-Hal

I don't get this. Not require translation at all? Allow using the unsafe
metadata which is non-ignorable for skipping dependency analysis? Again,
reg2mem is not the only concern, but only an example of a pass that cannot ignore parallel loop info without breaking it. LLVM is a project with
external users and out-of-tree passes which we are even unaware of.
I thought this was already agreed on. I understand the mem MD is a drag,
but I prefer it before miscompilations, and there hasn't been a better
proposal so far.

If it seems hard to attach the mem MD right away in Clang, I propose you use the llvm.loop.parallel to mark the loop branches during Clang and make
sure that the mem metadata is added to the loops at the end. You can accomplish
this, e.g., by having a mode in your proposed pass that refreshes the metadata
that forces it to treat also the loops which have only the loop MD as parallel
ones. This would not require another book keeping metadata type even.

Hi Hal,

From: "Paul Redmond" <paul.redmond@intel.com>
To: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
Sent: Thursday, February 28, 2013 1:30:57 PM
Subject: [LLVMdev] parallel loop metadata simplification

Hi,

I've been working on clang codegen for #pragma ivdep and creating the
llvm.mem.parallel_loop_access metadata seems quite difficult. The
main problem is that there are so many places where loads and stores
are created and all of them need to be changed when emitting a
parallel loop. Note that creating llvm.loop.parallel is not a
problem.

One option is to modify IRBuilder to enable attaching the metadata
for loads and stores but that seems like a huge hack.

Can you please explain why this is a bad option? To be honest, this is what I expected you to do. The IRBuilder sits on top of all of the loads and stores, and seems like the perfect place to keep state for something that affects "all" loads and stores in some code region. In addition, putting this in IRBuilder should also make it easier to use this feature from other frontends.

One concern I have is that the llvm.mem.parallel_loop_access can refer to nested loops. Is it possible that instructions in the same loop require different metadatas (one is parallel in both inner and outer loop and another is only parallel in the inner loop?) This is a more general problem I think but the IRBuilder would only be useful for brute force adding the same metadata to all instructions with mayReadOrWriteMemory().

Thoughts?

paul

Hal, (and everyone who might care about post increment generation)...

I have an interesting question/observation. Consider this vector loop.

void vec_add_const(unsigned N, short __attribute__ ((aligned (16))) *A,
                                  short __attribute__ ((aligned (16))) val)
{
unsigned i,j;
for (i=0; i<N; i++) {
  for (j=0; j<N; j++) {
   A[i*N+j] += val;
  }
}
}

The innermost loop looks like this right before the DAG selection begins.

  p.loop_body.us65: ; preds =
%p.loop_body.lr.ph.us78, %p.loop_body.us65
  %p_arrayidx.us69.phi = phi i16* [ %p_arrayidx.us69.gep,
%p.loop_body.lr.ph.us78 ], [ %p_arrayidx.us69.inc, %p.loop_body.us65 ]
  %p.loopiv48.us66 = phi i32 [ 0, %p.loop_body.lr.ph.us78 ], [
%p.next_loopiv.us67, %p.loop_body.us65 ]
  %vector_ptr.us70 = bitcast i16* %p_arrayidx.us69.phi to <4 x i16>*
  %p.next_loopiv.us67 = add nsw i32 %p.loopiv48.us66, 4 <<<<<<<<<<<<<<<<<<
IV
  %_p_vec_full.us71 = load <4 x i16>* %vector_ptr.us70, align 16
<<<<<<<<<<<<<<<<<<<Load
  %add5p_vec.us72 = add <4 x i16> %_p_vec_full.us71, %5
  store <4 x i16> %add5p_vec.us72, <4 x i16>* %vector_ptr.us70, align 16
<<<<<<<<<<<<<<<Store
  %p_arrayidx.us69.inc = getelementptr i16* %p_arrayidx.us69.phi, i32 4
<<<<<<<<<<<<<<< Common Ptr
  %11 = icmp slt i32 %p.next_loopiv.us67, %leftover_lb
  br i1 %11, label %p.loop_body.us65, label %p.loop_header38.preheader.us84

When it gets to the DAG Combiner, in CombineToPostIndexedLoadStore() two
opportunities for post inc are recognized - the load and the store.
Now, you can easily see that in this case you would want the store to get
the post inc, not the load, but since the DAG combiner simply scans
top-down, the opposite happens.

  So here is the question - do you recognize this as a deficiency, and can
you see the same in PPC? The fix is code trivial, but it would introduce a
general concept of a primitive cost function to the PostInc candidacy
selection in DAG combine. If you recognize the issue, I will post a patch
with more details, but if I am missing the big picture here, please advise.

Sergei

From: "Paul Redmond" <paul.redmond@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
Sent: Friday, March 1, 2013 10:06:51 AM
Subject: Re: [LLVMdev] parallel loop metadata simplification

Hi Hal,

>> From: "Paul Redmond" <paul.redmond@intel.com>
>> To: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
>> Sent: Thursday, February 28, 2013 1:30:57 PM
>> Subject: [LLVMdev] parallel loop metadata simplification
>>
>> Hi,
>>
>> I've been working on clang codegen for #pragma ivdep and creating
>> the
>> llvm.mem.parallel_loop_access metadata seems quite difficult. The
>> main problem is that there are so many places where loads and
>> stores
>> are created and all of them need to be changed when emitting a
>> parallel loop. Note that creating llvm.loop.parallel is not a
>> problem.
>>
>> One option is to modify IRBuilder to enable attaching the metadata
>> for loads and stores but that seems like a huge hack.
>
> Can you please explain why this is a bad option? To be honest, this
> is what I expected you to do. The IRBuilder sits on top of all of
> the loads and stores, and seems like the perfect place to keep
> state for something that affects "all" loads and stores in some
> code region. In addition, putting this in IRBuilder should also
> make it easier to use this feature from other frontends.

One concern I have is that the llvm.mem.parallel_loop_access can
refer to nested loops. Is it possible that instructions in the same
loop require different metadatas (one is parallel in both inner and
outer loop and another is only parallel in the inner loop?) This is
a more general problem I think but the IRBuilder would only be
useful for brute force adding the same metadata to all instructions
with mayReadOrWriteMemory().

Well, it would be more complicated than that anyway because you specifically don't want to tag the loads and stores to the local alloca'd variable storage locations, only the "explicit" memory references. I think an extra boolean parameter to CreateLoad, etc. could take care of those? Nevertheless, IRBuilder could keep a stack of parallel loops to handle nested cases, right?

-Hal

From: "Paul Redmond" <paul.redmond@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
Sent: Friday, March 1, 2013 10:06:51 AM
Subject: Re: [LLVMdev] parallel loop metadata simplification

Hi Hal,

From: "Paul Redmond" <paul.redmond@intel.com>
To: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
Sent: Thursday, February 28, 2013 1:30:57 PM
Subject: [LLVMdev] parallel loop metadata simplification

Hi,

I've been working on clang codegen for #pragma ivdep and creating
the
llvm.mem.parallel_loop_access metadata seems quite difficult. The
main problem is that there are so many places where loads and
stores
are created and all of them need to be changed when emitting a
parallel loop. Note that creating llvm.loop.parallel is not a
problem.

One option is to modify IRBuilder to enable attaching the metadata
for loads and stores but that seems like a huge hack.

Can you please explain why this is a bad option? To be honest, this
is what I expected you to do. The IRBuilder sits on top of all of
the loads and stores, and seems like the perfect place to keep
state for something that affects "all" loads and stores in some
code region. In addition, putting this in IRBuilder should also
make it easier to use this feature from other frontends.

One concern I have is that the llvm.mem.parallel_loop_access can
refer to nested loops. Is it possible that instructions in the same
loop require different metadatas (one is parallel in both inner and
outer loop and another is only parallel in the inner loop?) This is
a more general problem I think but the IRBuilder would only be
useful for brute force adding the same metadata to all instructions
with mayReadOrWriteMemory().

Well, it would be more complicated than that anyway because you specifically don't want to tag the loads and stores to the local alloca'd variable storage locations, only the "explicit" memory references. I think an extra boolean parameter to CreateLoad, etc. could take care of those? Nevertheless, IRBuilder could keep a stack of parallel loops to handle nested cases, right?

Hmm, I guess I'm missing something. It is my understanding that all loads and stores require the metadata (and based on the implementation of isAnnotatedParallel). What you describe suggests that a loop goes from non-parallel to parallel after SROA?

paul

From: "Paul Redmond" <paul.redmond@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
Sent: Friday, March 1, 2013 10:49:24 AM
Subject: Re: [LLVMdev] parallel loop metadata simplification

>> From: "Paul Redmond" <paul.redmond@intel.com>
>> To: "Hal Finkel" <hfinkel@anl.gov>
>> Cc: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
>> Sent: Friday, March 1, 2013 10:06:51 AM
>> Subject: Re: [LLVMdev] parallel loop metadata simplification
>>
>> Hi Hal,
>>
>>
>>>> From: "Paul Redmond" <paul.redmond@intel.com>
>>>> To: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
>>>> Sent: Thursday, February 28, 2013 1:30:57 PM
>>>> Subject: [LLVMdev] parallel loop metadata simplification
>>>>
>>>> Hi,
>>>>
>>>> I've been working on clang codegen for #pragma ivdep and
>>>> creating
>>>> the
>>>> llvm.mem.parallel_loop_access metadata seems quite difficult.
>>>> The
>>>> main problem is that there are so many places where loads and
>>>> stores
>>>> are created and all of them need to be changed when emitting a
>>>> parallel loop. Note that creating llvm.loop.parallel is not a
>>>> problem.
>>>>
>>>> One option is to modify IRBuilder to enable attaching the
>>>> metadata
>>>> for loads and stores but that seems like a huge hack.
>>>
>>> Can you please explain why this is a bad option? To be honest,
>>> this
>>> is what I expected you to do. The IRBuilder sits on top of all of
>>> the loads and stores, and seems like the perfect place to keep
>>> state for something that affects "all" loads and stores in some
>>> code region. In addition, putting this in IRBuilder should also
>>> make it easier to use this feature from other frontends.
>>
>> One concern I have is that the llvm.mem.parallel_loop_access can
>> refer to nested loops. Is it possible that instructions in the
>> same
>> loop require different metadatas (one is parallel in both inner
>> and
>> outer loop and another is only parallel in the inner loop?) This
>> is
>> a more general problem I think but the IRBuilder would only be
>> useful for brute force adding the same metadata to all
>> instructions
>> with mayReadOrWriteMemory().
>
> Well, it would be more complicated than that anyway because you
> specifically don't want to tag the loads and stores to the local
> alloca'd variable storage locations, only the "explicit" memory
> references. I think an extra boolean parameter to CreateLoad, etc.
> could take care of those? Nevertheless, IRBuilder could keep a
> stack of parallel loops to handle nested cases, right?
>

Hmm, I guess I'm missing something. It is my understanding that all
loads and stores require the metadata (and based on the
implementation of isAnnotatedParallel). What you describe suggests
that a loop goes from non-parallel to parallel after SROA?

Or after mem2reg; something like that, yes. I think that's right.

-Hal

From: "Hal Finkel" <hfinkel@anl.gov>
To: "Paul Redmond" <paul.redmond@intel.com>
Cc: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
Sent: Friday, March 1, 2013 11:13:06 AM
Subject: Re: [LLVMdev] parallel loop metadata simplification

> From: "Paul Redmond" <paul.redmond@intel.com>
> To: "Hal Finkel" <hfinkel@anl.gov>
> Cc: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
> Sent: Friday, March 1, 2013 10:49:24 AM
> Subject: Re: [LLVMdev] parallel loop metadata simplification
>
>
>
> >> From: "Paul Redmond" <paul.redmond@intel.com>
> >> To: "Hal Finkel" <hfinkel@anl.gov>
> >> Cc: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
> >> Sent: Friday, March 1, 2013 10:06:51 AM
> >> Subject: Re: [LLVMdev] parallel loop metadata simplification
> >>
> >> Hi Hal,
> >>
> >>
> >>>> From: "Paul Redmond" <paul.redmond@intel.com>
> >>>> To: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
> >>>> Sent: Thursday, February 28, 2013 1:30:57 PM
> >>>> Subject: [LLVMdev] parallel loop metadata simplification
> >>>>
> >>>> Hi,
> >>>>
> >>>> I've been working on clang codegen for #pragma ivdep and
> >>>> creating
> >>>> the
> >>>> llvm.mem.parallel_loop_access metadata seems quite difficult.
> >>>> The
> >>>> main problem is that there are so many places where loads and
> >>>> stores
> >>>> are created and all of them need to be changed when emitting a
> >>>> parallel loop. Note that creating llvm.loop.parallel is not a
> >>>> problem.
> >>>>
> >>>> One option is to modify IRBuilder to enable attaching the
> >>>> metadata
> >>>> for loads and stores but that seems like a huge hack.
> >>>
> >>> Can you please explain why this is a bad option? To be honest,
> >>> this
> >>> is what I expected you to do. The IRBuilder sits on top of all
> >>> of
> >>> the loads and stores, and seems like the perfect place to keep
> >>> state for something that affects "all" loads and stores in some
> >>> code region. In addition, putting this in IRBuilder should also
> >>> make it easier to use this feature from other frontends.
> >>
> >> One concern I have is that the llvm.mem.parallel_loop_access can
> >> refer to nested loops. Is it possible that instructions in the
> >> same
> >> loop require different metadatas (one is parallel in both inner
> >> and
> >> outer loop and another is only parallel in the inner loop?) This
> >> is
> >> a more general problem I think but the IRBuilder would only be
> >> useful for brute force adding the same metadata to all
> >> instructions
> >> with mayReadOrWriteMemory().
> >
> > Well, it would be more complicated than that anyway because you
> > specifically don't want to tag the loads and stores to the local
> > alloca'd variable storage locations, only the "explicit" memory
> > references. I think an extra boolean parameter to CreateLoad,
> > etc.
> > could take care of those? Nevertheless, IRBuilder could keep a
> > stack of parallel loops to handle nested cases, right?
> >
>
> Hmm, I guess I'm missing something. It is my understanding that all
> loads and stores require the metadata (and based on the
> implementation of isAnnotatedParallel). What you describe suggests
> that a loop goes from non-parallel to parallel after SROA?

Or after mem2reg; something like that, yes. I think that's right.

On second thought, to be more specific, it might depend on whether the alloca is inside or outside the loop.

-Hal

From: "Hal Finkel" <hfinkel@anl.gov>
To: "Paul Redmond" <paul.redmond@intel.com>
Cc: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
Sent: Friday, March 1, 2013 11:13:06 AM
Subject: Re: [LLVMdev] parallel loop metadata simplification

From: "Paul Redmond" <paul.redmond@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
Sent: Friday, March 1, 2013 10:49:24 AM
Subject: Re: [LLVMdev] parallel loop metadata simplification

From: "Paul Redmond" <paul.redmond@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
Sent: Friday, March 1, 2013 10:06:51 AM
Subject: Re: [LLVMdev] parallel loop metadata simplification

Hi Hal,

From: "Paul Redmond" <paul.redmond@intel.com>
To: "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
Sent: Thursday, February 28, 2013 1:30:57 PM
Subject: [LLVMdev] parallel loop metadata simplification

Hi,

I've been working on clang codegen for #pragma ivdep and
creating
the
llvm.mem.parallel_loop_access metadata seems quite difficult.
The
main problem is that there are so many places where loads and
stores
are created and all of them need to be changed when emitting a
parallel loop. Note that creating llvm.loop.parallel is not a
problem.

One option is to modify IRBuilder to enable attaching the
metadata
for loads and stores but that seems like a huge hack.

Can you please explain why this is a bad option? To be honest,
this
is what I expected you to do. The IRBuilder sits on top of all
of
the loads and stores, and seems like the perfect place to keep
state for something that affects "all" loads and stores in some
code region. In addition, putting this in IRBuilder should also
make it easier to use this feature from other frontends.

One concern I have is that the llvm.mem.parallel_loop_access can
refer to nested loops. Is it possible that instructions in the
same
loop require different metadatas (one is parallel in both inner
and
outer loop and another is only parallel in the inner loop?) This
is
a more general problem I think but the IRBuilder would only be
useful for brute force adding the same metadata to all
instructions
with mayReadOrWriteMemory().

Well, it would be more complicated than that anyway because you
specifically don't want to tag the loads and stores to the local
alloca'd variable storage locations, only the "explicit" memory
references. I think an extra boolean parameter to CreateLoad,
etc.
could take care of those? Nevertheless, IRBuilder could keep a
stack of parallel loops to handle nested cases, right?

Hmm, I guess I'm missing something. It is my understanding that all
loads and stores require the metadata (and based on the
implementation of isAnnotatedParallel). What you describe suggests
that a loop goes from non-parallel to parallel after SROA?

Or after mem2reg; something like that, yes. I think that's right.

On second thought, to be more specific, it might depend on whether the alloca is inside or outside the loop.

I have discovered that you can provide a custom inserter to IRBuilder (who knew!). This has basically solved all my problems and allowed me to generate the proper metadata with minimal changes to clang codegen. Currently it adds the metadata to all loads and stores but I don't think this is a problem and can be refined later if necessary.

paul

Good. I do not see why allocas should cause problems/special handling
here: if the programmer is declaring a loop as "parallel" and still
writes/reads a stack object in such a way that it adds loop-carried dependencies, it's a programmer-error (it's not a parallel loop).

From: "Sergei Larin" <slarin@codeaurora.org>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: llvmdev@cs.uiuc.edu
Sent: Friday, March 1, 2013 10:24:39 AM
Subject: Interesting post increment situation in DAG combiner

Hal, (and everyone who might care about post increment generation)...

Sergei,

Perhaps this is a problem that I wish that I had :wink: -- PPC does not have post-increment loads and stores, only pre-increment ones. Nevertheless, I think that the situation is similar...

For one thing, I recently committed an enhancement to pre-increment generation that causes later constant offsets to use the new incremented address instead of the old address. I thought that this would not be an issue for post-increment generation, but it seems that I was wrong: you'd have the same problem here: if you post-increment the load and not the store, then you might need an extra register to hold the original base address for the store. In this case, you'd not have a problem if you just chose to post-increment the store instead, but the general problem still exists.

Regarding the selection of which load/store to pre/post increment, I think that this is also a general issue. At least for pre-increment generation I've seen it make some odd choices.

In short, I do recognize the issue, and I'm curious to see your patch. If it can improve pre-increment selection as well, that would help me too.

Thanks again,
Hal

Hal,

  Here is my patch for the post inc case. I think it is symmetrically applicable to the pre-inc, but I have not tested it for that.
I think you can clearly see my intent here - I simply select the "latest" candidate when multiple are available.

   Who else might be interested in this?

Sergei

DagCombiner_PostInc.patch (3.36 KB)

From: "Sergei Larin" <slarin@codeaurora.org>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: llvmdev@cs.uiuc.edu
Sent: Friday, March 1, 2013 5:16:04 PM
Subject: RE: Interesting post increment situation in DAG combiner

Hal,

  Here is my patch for the post inc case. I think it is symmetrically
  applicable to the pre-inc, but I have not tested it for that.
I think you can clearly see my intent here - I simply select the
"latest" candidate when multiple are available.

I'll look at it, hopefully in the next few days.

   Who else might be interested in this?

ARM perhaps.

Thanks again,
Hal

[...]

I have discovered that you can provide a custom inserter to IRBuilder (who knew!). This has basically solved all my problems and allowed me to generate the proper metadata with minimal changes to clang codegen. Currently it adds the metadata to all loads and stores but I don't think this is a problem and can be refined later if necessary.

Great that you found a good solution. I have one example, which we may want to have a look into:

Given the following input (test.c):

         #pragma ivdep
  for (long i = 0; i < 100; i++) {
    long t = i + 2;
    A[i] = t;
  }

clang creates the following IR for the loop body:

   %1 = load i64* %i, align 8
   %add = add nsw i64 %1, 2
   store i64 %add, i64* %t, align 8
   %2 = load i64* %t, align 8
   %3 = load i64* %i, align 8
   %4 = load i64** %A.addr, align 8
   %arrayidx = getelementptr inbounds i64* %4, i64 %3
   store i64 %2, i64* %arrayidx, align 8
   br label %for.inc

What are the implications of pragma ivdep on this code? My intuition would be that 't' is iteration private and that using pragma ivdep for
this loop is correct. If the use of ivdep is correct, it seems necessary to _not_ annotate the loads and stores from and to 't'. Only after 't' is moved into a register, the loop is actually parallel on the IR level.

Cheers,
Tobi

test.c (89 Bytes)

I didn't realize this is a problem in general because in pocl we
explicitly "privatize" the OpenCL C kernel private variables in the
generated work-item loops. In this case, the 't' would not be a scalar,
but a per-iteration variable in an array with an element for each iteration or
a private vreg.

We have two cases for the kernel private variables in OpenCL C:

1) Variables accessed outside one loop (private variables that
span multiple parallel regions). These need to be allocated to function
scope per-iteration (work-item) arrays.
2) Temporary variables accessed only inside one loop. These can stay as loop
private variables. No need to allocate stack space for all the iterations at
once, but for only those executed in parallel.

Seems Hal was right that allocas need to be treated specially. But
how to deal with this type of cases without excluding the parallel-safe alloca
cases? E.g. a programmer-written array (in stack) that is accessed safely from
the parallel iterations or scalars that are only read inside the loop?

The general problem is that parallel loops, due to their differing
semantics, should be treated differently from standard serial C loops during
the Clang codegen to avoid cases like this properly. E.g., the alloca in this
case should not be simply pushed outside the loop because it converts the
parallel loop to a serial one. Instead, each iteration should have their own
private instance of the loop body scope temporary variables.

As a conclusion, at this state, it is not safe to just blindly annotate all
memory accesses with the llvm.mem.parallel_access. It seems quite easy to
produce broken code that way. The easy way forward is to skip marking allocas
altogether and hope mem2reg/SROA makes the loop parallel, but unfortunately it
serializes some of the valid parallel loop cases too. Improved version would
generate loop-scope (temporary) variables in a parallel-loop aware way.

BTW I noted Clik Plus has actually two different parallel loop constructs.
Have you, Cilk Plus developers, thought about the parallel loop code
generation yet?

BR,

I attached three more examples. All are vectorized by icc without multi-versioning if pragma ivdep is given.

that are inserted due to memory accesses in the source code. Meaning they are due to an array or pointer access.

Cheers,
Tobi

ivdep_t_private.c (131 Bytes)

ivdep_t_non-private.c (130 Bytes)

ivdep_pointer.c (160 Bytes)

What about loop-scope arrays?

void foo(long *A, long b) {
     long i;

     #pragma ivdep
     for (i = 0; i < 100; i++) {
         long t[100];
         t[0] = i + 2;
         A[i] = A[i+b] + t[0];
     }
}

Clang places the alloca for t to the entry block, creating
a new race condition.

In your example where you moved t outside the loop
it's a programmer's mistake (icc might vectorize it but the
results are undefined due to the dependency), but here I don't
think it is. The t array is supposed to be a loop-private variable,
and each parallel iteration refer to their own isolated instance.