(no subject)

<mehdi.amini@apple.com>,
Bcc:
Reply-To:
In-Reply-To: <20170224221713.GA931@arch-linux-jd.home>

Ping.

PS.

  Are there actually people interested in this?

  We will continue working anyway but it might not make sense to put it
  on reviews and announce it on the ML if nobody cares.

<mehdi.amini@apple.com>,
Bcc:
Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension
Reply-To:
In-Reply-To: <20170224221713.GA931@arch-linux-jd.home>

Ping.

PS.

   Are there actually people interested in this?

   We will continue working anyway but it might not make sense to put it
   on reviews and announce it on the ML if nobody cares.

I certainly care, and will also respond to the RFC, this week or next.

  -Hal

A quick update, we have been looking through all LLVM passes to identify the impact of "IR-region annotation", and interaction issues with the rest of LoopOpt and scalarOpt, e.g. interaction with vectorization when you have schedule(simd:guided: 64). What are the common properties for optimizer to know on IR-region annotations. We have our implementation working from O0, O1, O2 to O3. So far, the changes we made in the existing LLVM opt passes are < 200 LOC. ClangFE to executable end-to-end linked our omp library, and we also updated our implementation using token/tag on the intrinsics based on feedback from Google (David) and Xilinx (Hongbin). I am still owe Mehdi some answers.

One feedback for PIR RFC, putting "fork" into loop body does not work for loop vectorizer, it shuts down vectorization, this is the same issue as MIT's scheme.

Thanks,
Xinmin

(adding the subject back to this thread)

I expect this to takes many months, unless it is on the critical path of some core devs, because it seems to me to be a huge feature to consider and quite a time sink (which is why even people that care haven’t been able to invest enough time into this).

My impression is that this is major enough that it is unlikely to land in LLVM 5.0, and even before the next US dev meeting. Ideally we’d have enough discussions and review before then that we could have a BoF there to make sure we can move forward with the right representation. I hope the EuroLLVM meeting will be the opportunity for you to get some good feedback on this!

Just my 2 cents :slight_smile:

>
> <mehdi.amini@apple.com>,
> Bcc:
> Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension
> Reply-To:
> In-Reply-To: <20170224221713.GA931@arch-linux-jd.home>
>
> Ping.
>
>
>
>
> PS.
>
> Are there actually people interested in this?
>
> We will continue working anyway but it might not make sense to put it
> on reviews and announce it on the ML if nobody cares.

I expect this to takes many months, unless it is on the critical path
of some core devs, because it seems to me to be a huge feature to
consider and quite a time sink (which is why even people that care
haven’t been able to invest enough time into this).

My impression is that this is major enough that it is unlikely to land
in LLVM 5.0, and even before the next US dev meeting. Ideally we’d
have enough discussions and review before then that we could have a
BoF there to make sure we can move forward with the right
representation. I hope the EuroLLVM meeting will be the opportunity
for you to get some good feedback on this!

Just my 2 cents :slight_smile:

Thanks Mehdi :slight_smile:

I get your points and I generally agree. My intention was to get some
kind of feedback to determine if (*) and possibly when people are going
to invest that time. Since I already have one answer in that regard all
is good!

Cheers,
  Johannes

(*) The RFC and patches were basically uncommented for quite some time now.

PS.
  Maybe we should have scheduled a BoF at EuroLLVM for this general
  topic. Are interested people coming?

A quick update, we have been looking through all LLVM passes to
identify the impact of "IR-region annotation", and interaction issues
with the rest of LoopOpt and scalarOpt, e.g. interaction with
vectorization when you have schedule(simd:guided: 64). What are the
common properties for optimizer to know on IR-region annotations. We
have our implementation working from O0, O1, O2 to O3. So far, the
changes we made in the existing LLVM opt passes are < 200 LOC. ClangFE
to executable end-to-end linked our omp library, and we also updated
our implementation using token/tag on the intrinsics based on feedback
from Google (David) and Xilinx (Hongbin). I am still owe Mehdi some
answers.

Thanks for the update. I will look into the discussion thread again
soon. I am especially interested in the problem Mehdi pointed out
regarding the missed initializations of array elements, did you comment
on that one yet?

One feedback for PIR RFC, putting "fork" into loop body does not work
for loop vectorizer, it shuts down vectorization, this is the same
issue as MIT's scheme.

Totally true. The vectorizer needs to be thought about the fork-join at
some point. However, you can for now sequentialize and introduce the
llvm.parallel.loop annotations it already knows about. We (Simon Moll
and me) are currently discussing how to teach the RV vectorizer to
handle fork-join parallel regions in loops but also in otherwise
straight line code. Once we make progress we'll let you know!

Thanks!
  Johannes

".... the problem Mehdi pointed out regarding the missed initializations of array elements, did you comment on that one yet?"

What is the initializations of array elements question? I don't remember this question. Please refresh my memory. Thanks.
I thought Mehdi's question is more about what are attributes needed for these IR-annotation for other LLVM pass to understand and make sense out of them.

Xinmin

I don't know who pointed it out first but Mehdi made me aware of it at
CGO. I try to explain it shortly.

Given the following situation (in pseudo code):

  alloc A[100];
  parallel_for(i = 0; i < 100; i++)
    A[i] = f(i);

  acc = 1;
  for(i = 0; i < 100; i++)
    acc = acc * A[i];

Afaik, with your parallel regions there won't be a CFG loop for the
parallel initialization, right? Instead some intrinsics that annotate
the parallel region and then one initialization. I imagine it somewhat
like this:

  token = llvm.parallel.for.start(0, 100, 1)
  i = llvm.parallel.for.iterator(token)
  A[i] = f(i)
  llvm.parallel.for.end(token)

Which would (in serial semantics) allow an optimization to remove the
second loop as it uses uninitialized (undef) values. In other words, it
is a problem if we assume only one A[i] was initialized but the rest was
not.

@Mehdi Does that reflect what you explained to some degree?

I don't know who pointed it out first but Mehdi made me aware of it at
CGO. I try to explain it shortly.

Given the following situation (in pseudo code):

   alloc A[100];
   parallel_for(i = 0; i < 100; i++)
     A[i] = f(i);

   acc = 1;
   for(i = 0; i < 100; i++)
     acc = acc * A[i];

Afaik, with your parallel regions there won't be a CFG loop for the
parallel initialization, right? Instead some intrinsics that annotate
the parallel region and then one initialization. I imagine it somewhat
like this:

   token = llvm.parallel.for.start(0, 100, 1)
   i = llvm.parallel.for.iterator(token)
   A[i] = f(i)
   llvm.parallel.for.end(token)

Which would (in serial semantics) allow an optimization to remove the
second loop as it uses uninitialized (undef) values. In other words, it
is a problem if we assume only one A[i] was initialized but the rest was
not.

@Mehdi Does that reflect what you explained to some degree?

Yes, this is essentially the issue that Sanjoy pointed out on the mailing list (as well). We definitely need semantics that do not have this problem.

  -Hal

That was Sanjoy’s example, mine was different and related to defining the right abstract set of properties these intrinsics can have with respect to transformation that don’t have a specific handling for them. The problems may be disjoint, or maybe my concern is just a generalization of the issue, not sure.

Best,

The IR-region annotation we proposed is as below, there is no @llvm.parallel.for.iterator()..... There is no change to loop CFG.

   alloc A[100];
   %t = call token @llvm.region.entry()["parallel.for"()]
    for(i = 0; i < 100; i++) {
       a[i] = f(i);
   }
   @llvm.region.exit(%t)() ["end.parallel.for"()]

Xinmin

The IR-region annotation we proposed is as below, there is no @llvm.parallel.for.iterator()..... There is no change to loop CFG.

    alloc A[100];
    %t = call token @llvm.region.entry()["parallel.for"()]
     for(i = 0; i < 100; i++) {
        a[i] = f(i);
    }
    @llvm.region.exit(%t)() ["end.parallel.for"()]

Yes, although as Sanjoy pointed out, the problem occurs when just using a parallel region and omp_get_thread_num().

  -Hal

I assume the referring case is something like below, right?

#pragma omp parallel num_threads(n)
{
        #pragma omp critical
        {
             x = x + 1;
        }
  }

If that is the case, the programmer is already writing the code that is not "serial equivalent".
Our representation for parallelizer is

%t = @llvm.region.entry()["omp.parallel"(), "num_threads"(int32 %n)]
{
        __kmpc_critical()
             x = x + 1;
        __kmpc_end_critical()
  }
@llvm.region.exit()["end.omp.parallel"()]

Xinmin

I assume the referring case is something like below, right?

  #pragma omp parallel num_threads(n)
  {
         #pragma omp critical
         {
              x = x + 1;
         }
   }

If that is the case, the programmer is already writing the code that is not "serial equivalent".
Our representation for parallelizer is

%t = @llvm.region.entry()["omp.parallel"(), "num_threads"(int32 %n)]
  {
         __kmpc_critical()
              x = x + 1;
         __kmpc_end_critical()
   }
@llvm.region.exit()["end.omp.parallel"()]

Yes, it is not parallel equivalent. However, we do need to make sure that the optimizer does not "know" that x was incremented only by one at the end of the region. It needs to know that the parallel region runs some non-trivial number of times. Another option, potentially, is to do this:

%t = @llvm.region.entry()["omp.parallel"(), "num_threads"(int32 %n), "whatever"(i32* %x)]
  {
         __kmpc_critical()
              x = x + 1;
         __kmpc_end_critical()
   }
@llvm.region.exit()["end.omp.parallel"(), "whatever"(i32* %x)
]

the subtle part here is that we need to update these as we inline things into the region body. This assumes that the region markers themselves aren't tagged as arbitrarily writing to memory (which I think we don't want to do if possible).

  -Hal

Are you expecting the op-bundle to prevent any uses of x after the region to not infer that x has been incremented by one? (I wouldn’t). Can you clarify?

Thanks,

Right, the exact code we have is .

%t = @llvm.region.entry()["omp.parallel"(), "num_threads"(int32 %n), "shared"(i32* %x)]
  {
         __kmpc_critical()
              x = x + 1;
         __kmpc_end_critical()
   }
@llvm.region.exit()["end.omp.parallel"(), "shared"(i32* %x) ]

I assume the referring case is something like below, right?

  #pragma omp parallel num_threads(n)
  {
         #pragma omp critical
         {
              x = x + 1;
         }
   }

If that is the case, the programmer is already writing the code that is not "serial equivalent".
Our representation for parallelizer is

%t = @llvm.region.entry()["omp.parallel"(), "num_threads"(int32 %n)]
  {
         __kmpc_critical()
              x = x + 1;
         __kmpc_end_critical()
   }
@llvm.region.exit()["end.omp.parallel"()]

Yes, it is not parallel equivalent. However, we do need to make sure that the optimizer does not "know" that x was incremented only by one at the end of the region. It needs to know that the parallel region runs some non-trivial number of times. Another option, potentially, is to do this:

%t = @llvm.region.entry()["omp.parallel"(), "num_threads"(int32 %n), "whatever"(i32* %x)]
{
        __kmpc_critical()
             x = x + 1;
        __kmpc_end_critical()
  }
@llvm.region.exit()["end.omp.parallel"(), "whatever"(i32* %x)
]

Are you expecting the op-bundle to prevent any uses of x after the region to not infer that x has been incremented by one? (I wouldn’t). Can you clarify?

Yes, because pointer-valued operand-bundle operands are assumed to be read-from/written-to by the call unless otherwise attributed. Besides, we need something to prevent SSA formation from promoting 'x' to a register within the region, and then hoisting the definition of x outside of the region. If x is a local variable (alloca) in the enclosing function then only passing its address to the region intrinsic will prevent this.

  -Hal

Oh, I assumed that x was already an SSA value (snippets mixing C and IR can be confusing sometimes).

Makes sense. Thanks.

<mehdi.amini@apple.com>,
Bcc:
Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension
Reply-To:
In-Reply-To: <20170224221713.GA931@arch-linux-jd.home>

Ping.

PS.

Are there actually people interested in this?

I’m definitely interested too. I will have some high-level comments after next week.

We will continue working anyway but it might not make sense to put it
on reviews and announce it on the ML if nobody cares.

I expect this to takes many months, unless it is on the critical path of some core devs, because it seems to me to be a huge feature to consider and quite a time sink (which is why even people that care haven’t been able to invest enough time into this).

There are enough commercial and open-source projects working actively on this that I think the issue is reaching critical mass. A few of us — Hal, Jeff (Vetter), Johannes, TB Schardl, Xinmin, I, and a few others — have been discussing design ideas, building on lessons from some of these ongoing projects:

* Intel’s new OpenMP compiler
* ARES from Los Alamos and Oak Ridge
* PIR from Saarland.
* TAPIR from MIT
* HPVM from Illinois

Our goal is to put together a design for how to add representations for parallel programs that can leverage LLVM capabilities and features without requiring excessive changes to the existing infrastructure. We’ll submit design docs, patches for review, and one or more prototype compilers built on the design.

We’re having a conference call this week or early next week, and a face-to-face meeting at the end of next week to work on the details.

My impression is that this is major enough that it is unlikely to land in LLVM 5.0,

That’s probably true.

and even before the next US dev meeting.

I’m hoping we’ll have info submitted for review well before that.

Ideally we’d have enough discussions and review before then that we could have a BoF there to make sure we can move forward with the right representation. I hope the EuroLLVM meeting will be the opportunity for you to get some good feedback on this!

+1 on both.

—Vikram

// Vikram S. Adve
// Professor, Department of Computer Science
// University of Illinois at Urbana-Champaign
// vadve@illinois.edu
// http://llvm.org