About discussion of vectorization pass and openmp `simd` and `ordered simd` directives

Hi,

I would like to discuss the behaviors of openmp simd and ordered simd directives. I think current Clang may not give expected results as OpenMP 5.0 standard defines.

Let’s start one c++ example:


void func(float *a, float *b, float *c, float *d, int N) {

#pragma omp simd

for (int i = 0; i < N; i++) {

d[i] = c[i] + 1.0;

#pragma omp ordered simd

a[i] = b[i] + 1.0;

}

}

What is expected according to OpenMP 5.0 standard is like the following:


void func(float *a, float *b, float *c, float *d, int N) {

for (int i = 0; i < N; i += 4) {

#pragma omp simd

for (int j = i; j < 4; j++)

d[i] = c[i] + 1.0; // vectorized

for (int j = i; j < 4; j++)

a[i] = b[i] + 1.0; // not vectorized

}

}

It seems that current Clang and LLVM do not support it.

Without openmp enabled, clang vectorizes the loop with memcheck as follows:


$ clang++ -O3 test.cpp -c -emit-llvm -S && cat test.ll

%scevgep = getelementptr float, float* %d, i64 %wide.trip.count

%scevgep22 = getelementptr float, float* %a, i64 %wide.trip.count

%scevgep25 = getelementptr float, float* %c, i64 %wide.trip.count

%scevgep28 = getelementptr float, float* %b, i64 %wide.trip.count

%bound030 = icmp ugt float* %scevgep25, %d

%bound131 = icmp ugt float* %scevgep, %c

%found.conflict32 = and i1 %bound030, %bound131

... fadd <4 x float> ...

With openmp-simd enabled, clang vectorizes the loop without memcheck. This means that only simd directive is enabled, while ordered simd directive is disabled. The results are expected.


clang++ -fopenmp-simd -O3 test.cpp -c -emit-llvm -S && cat test.ll

With openmp enabled, both simd and ordered simd directives are enabled. Clang frontend generates the outlined function captured_stmt(float** %a.addr, i32* %i3, float** %b.addr) with AlwaysInline attribute when optimization level is more than 0. The generated IR is to vectorize the loop with memcheck as follows:


$ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S && cat test.ll

%scevgep = getelementptr float, float* %d, i64 %wide.trip.count

%scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count

%scevgep32 = getelementptr float, float* %c, i64 %wide.trip.count

%scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count

%bound037 = icmp ugt float* %scevgep32, %d

%bound138 = icmp ugt float* %scevgep, %c

%found.conflict39 = and i1 %bound037, %bound138

... fadd <4 x float> ...

But the expected IR should be like the following:


%scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count

%scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count

%found.conflict...

... fadd <4 x float> ...

I have two questions here:

  1. Does the outlined function captured_stmt(float** %a.addr, i32* %i3, float** %b.addr) with AlwaysInline attribute cause the memcheck? And how?

  2. If my understanding is correct according to the above analysis, should the codegen of ordered simd directive be fixed to support the expected behaviors? And should memcheck function (emitMemRuntimeChecks) also support partial check instead of the whole region inside the loop?

Also, for the following test case, both of vectorization of d[i] = c[i] + 1.0; and a[i] = a[i-1] + 1.0; are disabled.


void func(float *a, float *b, float *c, float *d, int N) {

#pragma omp simd

for (int i = 1; i < N; i++) {

d[i] = c[i] + 1.0;

#pragma omp ordered simd

a[i] = a[i-1] + 1.0;

}

}

What is expected is to vectorize the statement d[i] = c[i] + 1.0;.

I also test icc and gcc and here are the results:


$ icc -v

icc version 2021.1

$ icc -qopenmp test.cpp -O3 -qopt-report -qopt-report-phase=vec -S && cat test.optrpt

LOOP BEGIN at test.cpp(3,3)

remark #15531: Block of statements was serialized due to user request [ test.cpp(5,5) ]

remark #15301: SIMD LOOP WAS VECTORIZED

LOOP END

$ g++ -v

gcc version 9.3.0 (GCC)

$ g++ test.cpp -fopenmp -fdump-tree-all -fdump-rtl-all -O3 -ftree-vectorize -S && cat test.s

fadd s0, s0, s1 // not vectorized

...

fadd s0, s0, s1 // not vectorized

// There is `GOMP_SIMD_ORDERED_START` and `GOMP_SIMD_ORDERED_END` before and after the statement of `a[i] = a[i-1] + 1.0` in ifcvt pass, after which they are used in vect pass to break the vectorization.

For the following test case:


void func(float *b, float *c, float *d, int N) {

float a[N];

for (int i = 0; i < N; i++)

a[i] = 0;

#pragma omp simd

for (int i = 1; i < N; i++) {

d[i] = c[i] + 1.0;

#pragma omp ordered simd

a[i] = a[i-1] + 1.0;

}

}

The IR generated is as follows:


$ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S

%scevgep = getelementptr float, float* %d, i64 1

%3 = add nuw nsw i64 %wide.trip.count, 1

%scevgep41 = getelementptr float, float* %d, i64 %3

%scevgep43 = getelementptr float, float* %c, i64 1

%scevgep45 = getelementptr float, float* %c, i64 %3

%bound0 = icmp ult float* %scevgep, %scevgep45

%bound1 = icmp ult float* %scevgep43, %scevgep41

%found.conflict = and i1 %bound0, %bound1

%induction = fadd <4 x float> %.splat, <float 0.000000e+00, float 1.000000e+00, float 2.000000e+00, float 3.000000e+00>

The result for the statement of d[i] = c[i] + 1.0 and a[i] = a[i-1] + 1.0 are both unexpected. It is safe to vectorize the statement of a[i] = a[i-1] + 1.0 although it violates the definition of ordered construct in OpenMP 5.0 standard. But the memcheck of variables d and c should not be correct as the simd directive is there.

All the best,

Peixin

Hi Peixin,

First, I think CC'ing a lot of folks is not always the best strategy.
This is also a topic for openmp-dev and I'll reply there instead.

~ Johannes

Hi Peixin,

I think you are right that the code we generate is not correct.
The problem is not that a[i] is vectorized, the problem is that
we might vectorize it without a memory check (with O2 instead of
O3, see Compiler Explorer).

@Alexey, what was the intention of the outlined ordered region?
I'm not really sure how to handle this best but the access.group
on the call to the outlined region seems to be wrong as it implies
vectorization is sound while it isn't. WDYT?

~ Johannes

Peixin,

are you interested in trying to fix this?

As described, to make it at least sound we should not emit
the access.group metadata for the call to the outlined function.
That will not necessarily resolve your "problem", e.g., that the
vectorizer uses memory checks etc., but it will at least stop us
from generating wrong code.

~ Johannes

Hi Johannes,

First, I would like to thank you and Alexey for the explanations.

As described, to make it at least sound we should not emit the access.group metadata for the call to the outlined function.

I would like make sure that I understand what code you want to generate. Not emitting the `access.group` metadata would cause all the code inside simd region not vectorized. Again, for the following code:

void func(float *a, float *b, float *c, float *d, int N) {
   #pragma omp simd
   for (int i = 1; i < N; i++) {
     d[i] = c[i] + 1.0;
     #pragma omp ordered simd
     a[i] = a[i-1] + 1.0;
   }
}

I tried to generate the IR from clang frontend and delete the metadata by hand. It turns out to break the vectorization of simd region. What I did is as follows:
$ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S -Xclang -disable-llvm-passes
(Change "call void @__captured_stmt(float** %a.addr, i32* %i3), !llvm.access.group !15" into "call void @__captured_stmt(float** %a.addr, i32* %i3)" by hand.)
$ clang++ -fopenmp -O3 test.ll -c -emit-llvm -S
warning: <unknown>:0:0: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
2 warnings generated.

Is this IR what you want to generate for now? I think the expected IR should vectorize the statement of ` d[i] = c[i] + 1.0` and serialize the statement of ` a[i] = a[i-1] + 1.0`. Do you want to delay this generation until vectorization pass supports patial vectorization in one for loop?

Yes, that is the IR we want to generate until the loop vectorizer can be told to serialize some code.
At least as far as I know it cannot right now.

We could also investigate other solutions, e.g., unroll and do straight-line code vectorization based on access group metadata. Or split the loop. But everything will require more work.

~ Johannes

Hi Johannes,

I agree with your. I can take a try to remove the metadata for the outlined function to correct the current wrong code generation. If Alexey or you wants to do it, please let me know.

All the best,
Peixin

Also, I thought about vectorizing this again.

I think what you can do to make it vectorize the code is to provide a scalarized function
that replicates the ordered code N times and which is then given to the vectorizer as the
"vector version" of the outlined function. That way you might need to add `noinline` to
the `ordered simd` outlined function but it should allow us to vectorize the code properly.

Let me know if you are interested in that.

~ Johannes

Hi Johannes,

I don't think this is the right solution. The outlined function will affect the vectorization of simd region and make the simd region not vectorized. I tried to add `noinline` to the `ordered simd` outlined function before, and it breaks the vectorization of the whole simd region.

I agree with you that one possible method is to replicate the ordered code multiple times, but I think this method should be done in vectorization pass.

In addition, one fast way to correct current code generation is to remove `alwaysinline` attribute and add `noinline` attribute. I prefer to dig more on vectorization pass to check how does it stop vectorization of the whole for loop when there is one outlined function inside.

All the best,
Peixin

Peixin,

Hi Johannes,

I don't think this is the right solution. The outlined function will affect the vectorization of simd region and make the simd region not vectorized. I tried to add `noinline` to the `ordered simd` outlined function before, and it breaks the vectorization of the whole simd region.

I agree with you that one possible method is to replicate the ordered code multiple times, but I think this method should be done in vectorization pass.

In addition, one fast way to correct current code generation is to remove `alwaysinline` attribute and add `noinline` attribute. I prefer to dig more on vectorization pass to check how does it stop vectorization of the whole for loop when there is one outlined function inside.

Do not add noinline, we went over this.

Here is what I suggested in my last email as high-level C code.
It works perfectly fine and is properly vectorized incl. the scalarized region:
Compiler Explorer

~ Johannes

I think we can emit only declare simd attrs for the ordered region function, inner loop (in the function) should be emitted by the declare simd vectorization pass.

If we get the declare simd vectorization pass to create a sequential version, sure.
I was thinking something along the lines of:

We create a sequentialized version of the ordered outlined function which computes multiple iterations.
We present that one to the vectorizer as the "vectorized" version of the outlined function. That way the
vectorizer can vectorize the loop, pick the "vectorized" version of the outlined region as replacement for
the original version, and we end up with vectorized code but a sequentialized ordered region.

So, we would need to create the outlined function, incl. body, as we do now but without the access group
metadata. In addition, the outlined "sequentialized vector version" has to be created by the frontend
(or someone else) and connected to the outlined function as if there was a vector version
(e.g., a vector library was connected).

So basically everything that is marked as "create" here: Compiler Explorer

~ Johannes.

Thanks for the high-level code. I think this method is not a good method by considering the following things:

1. Extra `getelementptr` and `load` instructions are generated because of the vector version of outlined region. If we can serialize the ordered region, much less instructions are needed.

2. What if the vectorization of the outlined region cannot be done such as the following example?
void func(float *a, float *b, float *c, float *d, int N, int j) {
       #pragma omp simd
       for (int i = 1; i < N; i++) {
         d[i] = c[i] + 1.0;
         #pragma omp ordered simd
         if (i > j)
           a[i] = a[i-1] + 1.0;
       }
}

3. What if the outlined region is one user-defined function?
void func(float *a, float *b, float *c, float *d, int N, int j) {
       #pragma omp simd
       for (int i = 1; i < N; i++) {
         d[i] = c[i] + 1.0;
         #pragma omp ordered simd
         a[i] = user_defined_func(a[i-1]);
       }
}

All the best,
Peixin

Thanks for the high-level code. I think this method is not a good method by considering the following things:

1. Extra `getelementptr` and `load` instructions are generated because of the vector version of outlined region. If we can serialize the ordered region, much less instructions are needed.

I don't follow. My code did not serialize the outlined region, for that you need to do the actual work.
My code showed that the scheme works in general.

2. What if the vectorization of the outlined region cannot be done such as the following example?
void func(float *a, float *b, float *c, float *d, int N, int j) {
        #pragma omp simd
        for (int i = 1; i < N; i++) {
          d[i] = c[i] + 1.0;
          #pragma omp ordered simd
          if (i > j)
            a[i] = a[i-1] + 1.0;
        }
}

The outlined region should not be vectorized, that is the entire point, no?
Also, FWIW, the code above is as vectorizable as the original version you shared.

3. What if the outlined region is one user-defined function?
void func(float *a, float *b, float *c, float *d, int N, int j) {
        #pragma omp simd
        for (int i = 1; i < N; i++) {
          d[i] = c[i] + 1.0;
          #pragma omp ordered simd
          a[i] = user_defined_func(a[i-1]);
        }
}

I don't think you understood what I am trying to tell you.

The outlined function would be the same as we have now.
The sequentialized "vector version" of it would look like this:

void seq_outlined(float *a, int i) {
   a[i+0] = user_defined_func(a[i-1+0]);
   a[i+1] = user_defined_func(a[i-1+1]);
   a[i+2] = user_defined_func(a[i-1+2]);
   a[i+3] = user_defined_func(a[i-1+3]);

I think you should take a look at how vector versions of functions work in IR and the vectorizer.
It seems there is a conceptual mismatch between what I'm trying to say and what you try to argue.

~ Johannes

Best regards,
Alexey Bataev

14 сент. 2021 г., в 13:25, Johannes Doerfert via Openmp-dev <openmp-dev@lists.llvm.org> написал(а):

Thanks for the high-level code. I think this method is not a good method by considering the following things:

1. Extra `getelementptr` and `load` instructions are generated because of the vector version of outlined region. If we can serialize the ordered region, much less instructions are needed.

I don't follow. My code did not serialize the outlined region, for that you need to do the actual work.
My code showed that the scheme works in general.

2. What if the vectorization of the outlined region cannot be done such as the following example?
void func(float *a, float *b, float *c, float *d, int N, int j) {
        #pragma omp simd
        for (int i = 1; i < N; i++) {
          d[i] = c[i] + 1.0;
          #pragma omp ordered simd
          if (i > j)
            a[i] = a[i-1] + 1.0;
        }
}

The outlined region should not be vectorized, that is the entire point, no?
Also, FWIW, the code above is as vectorizable as the original version you shared.

3. What if the outlined region is one user-defined function?
void func(float *a, float *b, float *c, float *d, int N, int j) {
        #pragma omp simd
        for (int i = 1; i < N; i++) {
          d[i] = c[i] + 1.0;
          #pragma omp ordered simd
          a[i] = user_defined_func(a[i-1]);
        }
}

I don't think you understood what I am trying to tell you.

The outlined function would be the same as we have now.
The sequentialized "vector version" of it would look like this:

void seq_outlined(float *a, int i) {
   a[i+0] = user_defined_func(a[i-1+0]);
   a[i+1] = user_defined_func(a[i-1+1]);
   a[i+2] = user_defined_func(a[i-1+2]);
   a[i+3] = user_defined_func(a[i-1+3]);

Johannes, I agree with your solution in general. Just the frontend does not have info about the vectorization factor. We can emit a scalar version of the function and decorate it with declare simd attributes. We need a pass in LLVM to generate vector variants of such functions.

You're right.

We should use the `declare simd` scheme with an annotation that says it should actually be sequentialized instead. The generator in the vectorizer can then do the right thing.

~ Johannes

Hi Johannes,

Sorry for the misunderstanding since I did not fully understand `declare simd` directive. I think you are right.

I am working on codegen of `threadprivate` directive in LLVM Flang, after which I should be able to have time to investigate how to implement the code generation for `ordered simd` in clang frontend and vectorization or other passes. I will ask you after I finished codegen of `threadprivate` directive in LLVM Flang. If this issue is still not implemented, I can take up it. Or if this issue has high priority for you or Alexey, you or Alexey can take up it if you have time to do it.

All the best,
Peixin