Automatic Insertion of OpenACC/OpenMP directives

Dear LLVMers,

    we have released a tool that uses LLVM to insert OpenACC or OpenMP
4.0 directives in programs. You can use the tool online here:
Dawn Project webpage. Our tool, dawn-cc, analyzes the LLVM IR
to infer the sizes of memory chunks, and to find dependences within
loops. After that, we use debug information to translate the low-level
information back into annotations that we insert into C/C++ programs.
For instance, if we take a program like this one below:

void saxpy(float a, float *x, float *y, int n) {
  for (int i = 0; i < n; ++i)
    y[i] = a*x[i] + y[i];
}

Then dawn-cc produces the code below:

void saxpy(float a, float *x, float *y, int n) {
  long long int AI1[6];
  AI1[0] = n - 1;
  AI1[1] = 4 * AI1[0];
  AI1[2] = AI1[1] + 4;
  AI1[3] = AI1[2] / 4;
  AI1[4] = (AI1[3] > 0);
  AI1[5] = (AI1[4] ? AI1[3] : 0);
  #pragma acc data pcopy(x[0:AI1[5]],y[0:AI1[5]])
  #pragma acc kernels
  for (int i = 0; i < n; ++i)
    y[i] = a * x[i] + y[i];
}

I was wondering if we could add a link to dawn-cc in the LLVM's
project page (http://llvm.org/ProjectsWithLLVM/). There are a number
of papers that describe what dawn-cc does. The main publication is
this paper:

* Automatic Insertion of Copy Annotation in Data-Parallel Programs -
SBAC-PAD 2016

The array size inference analysis comes from this work:

* Runtime Pointer Disambiguation - OOPSLA 2015

The source code of dawn-cc, including all the static analyses, is available at:

* https://github.com/gleisonsdm/DawnCC-Compiler

And, as I've mentioned, you can try it through an online interface:

* cuda.dcc.ufmg.br/dawn

Feel free to report bugs, or send us questions.

Fernando

Hi,

Dear LLVMers,

   we have released a tool that uses LLVM to insert OpenACC or OpenMP
4.0 directives in programs. You can use the tool online here:
Dawn Project webpage. Our tool, dawn-cc, analyzes the LLVM IR
to infer the sizes of memory chunks, and to find dependences within
loops. After that, we use debug information to translate the low-level
information back into annotations that we insert into C/C++ programs.
For instance, if we take a program like this one below:

void saxpy(float a, float *x, float *y, int n) {
for (int i = 0; i < n; ++i)
   y[i] = a*x[i] + y[i];
}

Then dawn-cc produces the code below:

void saxpy(float a, float *x, float *y, int n) {
long long int AI1[6];
AI1[0] = n - 1;
AI1[1] = 4 * AI1[0];
AI1[2] = AI1[1] + 4;
AI1[3] = AI1[2] / 4;
AI1[4] = (AI1[3] > 0);
AI1[5] = (AI1[4] ? AI1[3] : 0);
#pragma acc data pcopy(x[0:AI1[5]],y[0:AI1[5]])
#pragma acc kernels
for (int i = 0; i < n; ++i)
   y[i] = a * x[i] + y[i];
}

That’s nice!

I’m wondering about how would you handle the issue that retro-fitting the pragma on the original code from the optimized can be fragile. For example if I use a different variable for indexing into the output array:

float saxpy(float a, float *x, float *y, int n) {
int j = 0;

for (int i = 0; i < n; ++i) {
   y[j] = a*x[i] + y[I];
   ++j;
}
}

The optimized LLVM IR is likely identical to the original code, but simply adding the pragma wouldn’t be correct.

Dear Mehdi,

    I've changed your example a little bit:

float saxpy(float a, float *x, float *y, int n) {
int j = 0;
for (int i = 0; i < n; ++i) {
   y[j] = a*x[i] + y[I]; // Change 'I' into 'j'?
   ++j;
}
}

I get this code below, once I replace 'I' with 'j'. We are copying n
positions of both arrays, 'x' and 'y':

float saxpy(float a, float *x, float *y, int n) {
  int j = 0;

  long long int AI1[6];
  AI1[0] = n + -1;
  AI1[1] = 4 * AI1[0];
  AI1[2] = AI1[1] + 4;
  AI1[3] = AI1[2] / 4;
  AI1[4] = (AI1[3] > 0);
  AI1[5] = (AI1[4] ? AI1[3] : 0);
  #pragma acc data pcopy(x[0:AI1[5]],y[0:AI1[5]])
  #pragma acc kernels
  for (int i = 0; i < n; ++i) {
    y[j] = a * x[i] + y[j];
    ++j;
  }
}

Regards,

Fernando

Dear Mehdi,

    I've changed your example a little bit:

float saxpy(float a, float *x, float *y, int n) {
int j = 0;
for (int i = 0; i < n; ++i) {
   y[j] = a*x[i] + y[I]; // Change 'I' into 'j'?
   ++j;
}
}

I get this code below, once I replace 'I' with 'j'. We are copying n
positions of both arrays, 'x' and 'y':

float saxpy(float a, float *x, float *y, int n) {
  int j = 0;

  long long int AI1[6];
  AI1[0] = n + -1;
  AI1[1] = 4 * AI1[0];
  AI1[2] = AI1[1] + 4;
  AI1[3] = AI1[2] / 4;
  AI1[4] = (AI1[3] > 0);
  AI1[5] = (AI1[4] ? AI1[3] : 0);
  #pragma acc data pcopy(x[0:AI1[5]],y[0:AI1[5]])
  #pragma acc kernels
  for (int i = 0; i < n; ++i) {
    y[j] = a * x[i] + y[j];
    ++j;
  }

I'm not familiar with OpenACC, but doesn't this still have a loop carried dependence on j, and therefore isn't correctly parallelizable as written?

Jon

That was my original concern as well, but I had forgot that OpenACC pragma are not necessarily saying to the compiler that the loop is parallel:

#pragma acc kernels

only tells the compiler to “try” to parallelize the loop if it can prove it safe, but:

#pragma acc parallel kernels

bypasses the compiler checks and force parallelization.

The tool takes care of figuring out the sizes of the array AFAIK (haven’t read the paper yet to understand the novelty in the approach here).

Dear Jonathan and Mehdi,

    you are right. To extend the ability of the OpenACC compliant
compiler, we use a technique called restrictification to disambiguate
pointers at runtime. So, if you use this program below as input in
http://cuda.dcc.ufmg.br/dawn:

float saxpy(float a, float *x, float *y, int n) {
int j = 0;
for (int i = 0; i < n; ++i) {
   y[j] = a*x[i] + y[j];
   ++j;
}
}

    then the final program that you obtain is this one:

float saxpy(float a, float *x, float *y, int n) {
  int j = 0;
  long long int AI1[6];
  AI1[0] = n + -1;
  AI1[1] = 4 * AI1[0];
  AI1[2] = AI1[1] + 4;
  AI1[3] = AI1[2] / 4;
  AI1[4] = (AI1[3] > 0);
  AI1[5] = (AI1[4] ? AI1[3] : 0);
  char RST_AI1 = 0;
  RST_AI1 |= !((x + 0 > y + AI1[5]) || (y + 0 > x + AI1[5])); //
Restrictification test!
  #pragma acc data pcopyin(x[0:AI1[5]]) pcopy(y[0:AI1[5]]) if(!RST_AI1)
  #pragma acc kernels if(!RST_AI1) // See the test being used here!
  for (int i = 0; i < n; ++i) {
    y[j] = a * x[i] + y[j];
    ++j;
  }
}

    The input kernel, saxpy is, in the absence of aliasing, a map.
Thus, it is parallelizable (assuming no aliasing between arrays x and
y). So, how does dawn-cc handle aliasing? Our tool does it via a
restrictification check. In the code above, this restrictification
check is encoded in the variable RST_AI1. This variable is the result
of a test that ensures that the limits of arrays x and y do not
overlap. LLVM itself uses some form of restrictification to carry out
vectorization. Our checks are produced after the paper:

Pericles Rafael Alves, Fabian Gruber, Johannes Doerfert, Alexandros
Labrineas, Tobias Grosser, Fabrice Rastello and Fernando Magno Quintão
Pereira. Runtime Pointer Disambiguation. OOPSLA'2015.

    In this example, aliasing does not occur as long as &x + n <= &y
(or vice versa). Dawn-cc can create restrictification checks as long
as all the involved variables encode affine expressions of induction
variables.

Regards,

Fernando

Dear Jonathan and Mehdi,

     you are right. To extend the ability of the OpenACC compliant
compiler, we use a technique called restrictification to disambiguate
pointers at runtime. So, if you use this program below as input in
http://cuda.dcc.ufmg.br/dawn:

float saxpy(float a, float *x, float *y, int n) {
  int j = 0;
  for (int i = 0; i < n; ++i) {
    y[j] = a*x[i] + y[j];
    ++j;
  }
}

I think I'm still not clearly getting my point across. Let me explain with a simpler example. Suppose this were the input:

void ascend1(int *y, int n) {
   int j = 0;
   for (int i = 0; i < n; ++i) {
     y[j] = j;
     j++;
   }
}

One would expect that the entries in the array y would contain: y[0] = 0, y[1] = 1, y[2] = 2, etc. However, if one were to annotate that with (in OpenMP, as I don't know the semantics of OpenACC):

void ascend2(int *y, int n) {
   int j = 0;
   #pragma omp parallel for
   for (int i = 0; i < n; ++i) {
     y[j] = j;
     j++;
   }
}

now there's a race because the value of j at each iteration of the loop depends on the previous iteration's value of it. Now compare that to:

void ascend3(int *y, int n) {
   for (int i = 0; i < n; ++i) {
     y[i] = i;
   }
}

this one does not have the loop carried dependence on the variable used for indexing, so it is indeed safe to parallelize. Mehdi's comment was that the llvm IR for `ascend1()` and `ascend2()` will look almost the same (at least after optimization), so you'll need some way to recover the information telling you that `ascend1()`s source is *not* parallelizable with the annotations, but that `ascend3()` *is*.

Jon

Oh, never mind... just saw Mehdi's other reply explaining the semantics of the openacc pragmas being used.

Jon

It’s online now: http://llvm.org/ProjectsWithLLVM/#dawncc

Best,