Hello everyone,
Writing an OpenMP + MPI code I've triggered a debug assert in __kmp_task_start:
KMP_DEBUG_ASSERT(taskdata->td_flags.tasktype == TASK_EXPLICIT);
I attach a simpler code that does not do anything special with additional info.
#include <mpi.h>
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char **argv)
{
int TIMESTEPS = 10;
int BLOCKS = 100;
MPI\_Init\(&argc, &argv\);
int rank, nranks;
MPI\_Comm\_rank\(MPI\_COMM\_WORLD, &rank\);
MPI\_Comm\_size\(MPI\_COMM\_WORLD, &nranks\);
int DATA;
\#pragma omp parallel
\#pragma omp single
\{
for \(int t = 0; t < TIMESTEPS; \+\+t\) \{
for \(int r = 0; r < nranks; \+\+r\) \{
for \(int b = 0; b < BLOCKS; \+\+b\) \{
\#pragma omp task depend\(in: DATA\)
\{ \}
\}
\}
\#pragma omp task depend\(inout: DATA\)
\{\}
\}
\#pragma omp taskwait
\}
MPI\_Finalize\(\);
}
llvm_project debug build, commitaafdeeade8d
MPICH Version: 3.3a2
MPICH Release date: Sun Nov 13 09:12:11 MST 2016
$ MPICH_CC=clang mpicc -fopenmp t1.c -o t1
$ for i in {1..100}; do mpiexec.hydra -n 4 ./t1; done
http://bsc.es/disclaimer
Hello again,
I've managed to remove MPI from the equation. It seems a race condition in the runtime.
int main(int argc, char **argv)
{
int TIMESTEPS = 10;
int BLOCKS = 100;
int nranks = 4;
int DATA;
\#pragma omp parallel
\#pragma omp single
\{
for \(int t = 0; t < TIMESTEPS; \+\+t\) \{
for \(int r = 0; r < nranks; \+\+r\) \{
for \(int b = 0; b < BLOCKS; \+\+b\) \{
\#pragma omp task depend\(in: DATA\)
\{ \}
\}
\}
\#pragma omp task depend\(inout: DATA\)
\{\}
\}
\#pragma omp taskwait
\}
}
To run it execute:
clang -fopenmp t1.c -o t1
for i in {1..5000}; do echo $i; OMP_NUM_THREADS=3 ./t1; done
Regards,
Raúl
Hello again,
I've managed to remove MPI from the equation. It seems a race condition in the runtime.
int main(int argc, char **argv)
{
int TIMESTEPS = 10;
int BLOCKS = 100;
int nranks = 4;
int DATA;
\#pragma omp parallel
\#pragma omp single
\{
for \(int t = 0; t < TIMESTEPS; \+\+t\) \{
for \(int r = 0; r < nranks; \+\+r\) \{
for \(int b = 0; b < BLOCKS; \+\+b\) \{
\#pragma omp task depend\(in: DATA\)
\{ \}
\}
\}
\#pragma omp task depend\(inout: DATA\)
\{\}
\}
\#pragma omp taskwait
\}
}
To run it execute:
clang -fopenmp t1.c -o t1
for i in {1..5000}; do echo $i; OMP_NUM_THREADS=3 ./t1; done
Thanks for the reproducer! We might need to file a bug report for this one
but maybe someone will pick it up from here, let's wait a little while 
I looked a bit into this issue, because I ran into the same issue with a blocked cholesky factorization code this week. Thanks for providing this reproducer!
I think, the bookkeeping of task queues is broken, so that under certain conditions, the tail/head marker is not updated correctly.
In addition to the assertion, I see regular stalling in the runtime.
I'm not convinced, that TCW_4 &Co have any effect in current builds. Therefore, I think, that the compiler might move the accesses to the head/tail counters out of the locked region?!?
For testing purposes, I added KMP_MB() after locking and before unlocking the lock function (kmp_tasking.diff). Or should this actually become a part of the locking function (kmp_lock.diff)?!?
I did no performance tests of those change, but the latter solution fixed my stalls as well as the spurious assertion violations.
Best
Joachim
kmp_tasking.diff (2.75 KB)
kmp_lock.diff (862 Bytes)
I think, I found the issue and posted a fix at:
https://reviews.llvm.org/D80480
- Joachim
Thanks! Can you also commit the reproducer as a test?