Hi,
I would like to discuss the behaviors of openmp simd
and ordered simd
directives. I think current Clang may not give expected results as OpenMP 5.0 standard defines.
Let’s start one c++ example:
void func(float *a, float *b, float *c, float *d, int N) {
#pragma omp simd
for (int i = 0; i < N; i++) {
d[i] = c[i] + 1.0;
#pragma omp ordered simd
a[i] = b[i] + 1.0;
}
}
What is expected according to OpenMP 5.0 standard is like the following:
void func(float *a, float *b, float *c, float *d, int N) {
for (int i = 0; i < N; i += 4) {
#pragma omp simd
for (int j = i; j < 4; j++)
d[i] = c[i] + 1.0; // vectorized
for (int j = i; j < 4; j++)
a[i] = b[i] + 1.0; // not vectorized
}
}
It seems that current Clang and LLVM do not support it.
Without openmp enabled, clang vectorizes the loop with memcheck as follows:
$ clang++ -O3 test.cpp -c -emit-llvm -S && cat test.ll
%scevgep = getelementptr float, float* %d, i64 %wide.trip.count
%scevgep22 = getelementptr float, float* %a, i64 %wide.trip.count
%scevgep25 = getelementptr float, float* %c, i64 %wide.trip.count
%scevgep28 = getelementptr float, float* %b, i64 %wide.trip.count
%bound030 = icmp ugt float* %scevgep25, %d
%bound131 = icmp ugt float* %scevgep, %c
%found.conflict32 = and i1 %bound030, %bound131
... fadd <4 x float> ...
With openmp-simd enabled, clang vectorizes the loop without memcheck. This means that only simd
directive is enabled, while ordered simd
directive is disabled. The results are expected.
clang++ -fopenmp-simd -O3 test.cpp -c -emit-llvm -S && cat test.ll
With openmp enabled, both simd
and ordered simd
directives are enabled. Clang frontend generates the outlined function captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)
with AlwaysInline
attribute when optimization level is more than 0. The generated IR is to vectorize the loop with memcheck as follows:
$ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S && cat test.ll
%scevgep = getelementptr float, float* %d, i64 %wide.trip.count
%scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
%scevgep32 = getelementptr float, float* %c, i64 %wide.trip.count
%scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
%bound037 = icmp ugt float* %scevgep32, %d
%bound138 = icmp ugt float* %scevgep, %c
%found.conflict39 = and i1 %bound037, %bound138
... fadd <4 x float> ...
But the expected IR should be like the following:
%scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
%scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
%found.conflict...
... fadd <4 x float> ...
I have two questions here:
-
Does the outlined function
captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)
withAlwaysInline
attribute cause the memcheck? And how? -
If my understanding is correct according to the above analysis, should the codegen of
ordered simd
directive be fixed to support the expected behaviors? And shouldmemcheck
function (emitMemRuntimeChecks) also support partial check instead of the whole region inside the loop?
Also, for the following test case, both of vectorization of d[i] = c[i] + 1.0;
and a[i] = a[i-1] + 1.0;
are disabled.
void func(float *a, float *b, float *c, float *d, int N) {
#pragma omp simd
for (int i = 1; i < N; i++) {
d[i] = c[i] + 1.0;
#pragma omp ordered simd
a[i] = a[i-1] + 1.0;
}
}
What is expected is to vectorize the statement d[i] = c[i] + 1.0;
.
I also test icc and gcc and here are the results:
$ icc -v
icc version 2021.1
$ icc -qopenmp test.cpp -O3 -qopt-report -qopt-report-phase=vec -S && cat test.optrpt
LOOP BEGIN at test.cpp(3,3)
remark #15531: Block of statements was serialized due to user request [ test.cpp(5,5) ]
remark #15301: SIMD LOOP WAS VECTORIZED
LOOP END
$ g++ -v
gcc version 9.3.0 (GCC)
$ g++ test.cpp -fopenmp -fdump-tree-all -fdump-rtl-all -O3 -ftree-vectorize -S && cat test.s
fadd s0, s0, s1 // not vectorized
...
fadd s0, s0, s1 // not vectorized
// There is `GOMP_SIMD_ORDERED_START` and `GOMP_SIMD_ORDERED_END` before and after the statement of `a[i] = a[i-1] + 1.0` in ifcvt pass, after which they are used in vect pass to break the vectorization.
For the following test case:
void func(float *b, float *c, float *d, int N) {
float a[N];
for (int i = 0; i < N; i++)
a[i] = 0;
#pragma omp simd
for (int i = 1; i < N; i++) {
d[i] = c[i] + 1.0;
#pragma omp ordered simd
a[i] = a[i-1] + 1.0;
}
}
The IR generated is as follows:
$ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S
%scevgep = getelementptr float, float* %d, i64 1
%3 = add nuw nsw i64 %wide.trip.count, 1
%scevgep41 = getelementptr float, float* %d, i64 %3
%scevgep43 = getelementptr float, float* %c, i64 1
%scevgep45 = getelementptr float, float* %c, i64 %3
%bound0 = icmp ult float* %scevgep, %scevgep45
%bound1 = icmp ult float* %scevgep43, %scevgep41
%found.conflict = and i1 %bound0, %bound1
%induction = fadd <4 x float> %.splat, <float 0.000000e+00, float 1.000000e+00, float 2.000000e+00, float 3.000000e+00>
The result for the statement of d[i] = c[i] + 1.0
and a[i] = a[i-1] + 1.0
are both unexpected. It is safe to vectorize the statement of a[i] = a[i-1] + 1.0
although it violates the definition of ordered construct in OpenMP 5.0 standard. But the memcheck of variables d
and c
should not be correct as the simd
directive is there.
All the best,
Peixin