PPC64 patch from Intel's fourth cmake patch

From: "Carlo Bertolli" <cbertol@us.ibm.com>
To: "C. Bergström" <cbergstrom@pathscale.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "James H Cownie" <james.h.cownie@intel.com>, "Michael Wong"
<michaelw@ca.ibm.com>, openmp-dev@dcs-maillist2.engr.illinois.edu
Sent: Wednesday, August 6, 2014 12:59:42 PM
Subject: Re: [Openmp-dev] PPC64 patch from Intel's fourth cmake patch

Hi,

No apologies needed - I am glad that you highlighted these issues and
that you helped making the patch stronger.
Let me see what I can do about the imlementation of
kmp__invoke_microtask.

Thanks for looking into this. As you've discovered, clang-omp currently does not use the arbitrary-length-parameter-list aspect of the microtasks. I've cc'd Alexey here, and perhaps he can describe why. I suspect that using function parameters directly, instead of putting everything into structures, will be easier on the IPA (and more efficient -- although it is not clear how important that is relative to the dispatching overhead).

-Hal

As you've discovered, clang-omp currently does not use the arbitrary-length-parameter-list aspect of the microtasks. I've cc'd Alexey here, and perhaps he can describe why. I suspect that using function parameters directly, instead of putting everything into structures, will be easier on the IPA (and more efficient -- although it is not clear how important that is relative to the dispatching overhead).

There are 2 reasons to use structure.
1. All variables are captured by CapturedStmt into record and I just reused what we already had in the code base.
2. There were plans to modify implementation of __kmpc_fork_call() function and replace a list of references to shared variables by a single pointer to a record of references to shared variables (just like it is done in libgomp). Probably Jim Cownie can tell a little bit more about it.

Best regards,
Alexey Bataev

As you've discovered, clang-omp currently does not use the arbitrary-length-parameter-list aspect of the microtasks.

Indeed, and I don't expect it too. IMO the way the Intel compiler does it by passing a pointer argument for each
reference into the parent stack is unpleasant.

I have canvased internally to change it, but without success. ("We've done it like that for 15 years, why would we change?")

If you stick with a single argument to the outlined routine you should be able to eliminate a bunch of code in the rutime
for creating and copying the argument vector and (as a result) have no need to write some of that code in assembler.

As (I think) you're pointing out, you do need different code to handle disambiguating references so that
you can vectorize. If you have code something like this

float a[SIZE];
float b[SIZE];

// initialize b
#pragma omp parallel for
   for (int i=1; i<SIZE; i++)
      a[i] = b[i-1];

Once you outline you either have something like
void outlinedFunc (float * restrict a, float * restrict b)
    // compute local slice into low, high
    for (i=low; i<high; i++)
        a[i]=b[i-1];

or like this

struct localState {
    float a[SIZE];
    float b[SIZE];
};

void outlinedFunc( localState * args )
    // compute local slice into low, high
    for (i=low; i<high; i++)
        args->a[i]=args->b[i-1];

To vectorize you need to be able to prove to yourself that a and b don’t overlap here.
That doesn't seem too hard (fields in a struct can't overlap), but is certainly different from the other case.

-- Jim

James Cownie <james.h.cownie@intel.com>
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)
Tel: +44 117 9071438

From: "James H Cownie" <james.h.cownie@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>, "Carlo Bertolli" <cbertol@us.ibm.com>
Cc: "Michael Wong" <michaelw@ca.ibm.com>, openmp-dev@dcs-maillist2.engr.illinois.edu, "C. Bergström"
<cbergstrom@pathscale.com>, "Alexey Bataev" <a.bataev@gmx.com>
Sent: Wednesday, August 20, 2014 5:48:38 AM
Subject: RE: [Openmp-dev] PPC64 patch from Intel's fourth cmake patch

> As you've discovered, clang-omp currently does not use the
> arbitrary-length-parameter-list aspect of the microtasks.

Indeed, and I don't expect it too. IMO the way the Intel compiler
does it by passing a pointer argument for each
reference into the parent stack is unpleasant.

Okay, this is all fine. I recommend that we do the following:

1. In the runtime source code (and its documentation), clearly document that the arbitrary-parameter-list aspect of microtasks is deprecated, and what the maximum number of arguments is that we expect from "new" frontends.

2. As I recall, we now have a libffi dependency to support arbitrary-length parameter lists on ARM. Is this really needed? Maybe they can use a switch statement just as I did on PPC?

Thanks again,
Hal

I've got access to the target platforms and should be able to do validation with our compiler in the next 1-2 weeks. If you post a patch I'll test it.

Can someone write a testcase for this - either functional test and or even better would be a performance test which might be sensitive to this.

Thanks

1. In the runtime source code (and its documentation), clearly document that the arbitrary-parameter-list
aspect of microtasks is deprecated, and what the maximum number of arguments is that we expect from "new" frontends.

I had a feeling that I'd already written something like that, but maybe it's not sufficiently explicit
(since I can't find it myself now:-)). There is some discussion in the manual early on, but nothing
that is completely clear on the subject that I can find. (Though I'm sure I wrote something somewhere).

2. As I recall, we now have a libffi dependency to support arbitrary-length parameter lists on ARM.
Is this really needed? Maybe they can use a switch statement just as I did on PPC?

I think you're right, it’s probably not needed.

-- Jim

James Cownie <james.h.cownie@intel.com>
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)
Tel: +44 117 9071438