I don’t know if this is the correct list to talk about this - I did not find a better place…
I am doing performance experiments with a few OpenMP implementations (IOMP, GOMP and our private impl.) and I am seeing a severe slowdown when I use IOMP (GOMP and others are performing well).
Really, the slowdown is huge. For one of the programs (plasma/dpotrf_taskdep -n 8192 -b 64 -i 1 -c) the serial version executes in ~28s and the parallel one executes in ~110s. I did some profiling and found that most of the time is being spent on synchronization barriers and dependence tracking (see attached image). Before digging deeper I would like to hear back from you if I am doing something wrong here:
My guess is that you are blocking rather than spinning. Using OMP_WAIT_POLICY=active doesn’t seem to be enough with Intel’s runtime to turn off all blocking. As I recall, there is a KMP_xxx flag that applies as well. You can grep the barrier code in the runtime or hope that someone from intel responds to your inquiry.
this is strange because when I compile with “clang-3.5 -fopenmp” the executable that is produced is parallel. I am sure of this because I’m able to see the threads and also because I can see the symbols used by the IOMP runtime in the binary.
$ clang -O3 -g -fopenmp toy13.cpp -o toy13 -lm
$ nm toy13 | grep kmpc
U __kmpc_cancel_barrier@@VERSION
U __kmpc_end_single@@VERSION
U __kmpc_fork_call@@VERSION
U __kmpc_omp_task_alloc@@VERSION
U __kmpc_omp_task_with_deps@@VERSION
U __kmpc_single@@VERSION
Indeed, I meant official released 3.5. Did you get your compiler from clang-omp.github? It’s probably outdated and can’t be used for reliable performance measurements.
I will update clang-omp.github home page to avoid further confusion.
OMP_WAIT_POLICY=ACTIVE is equivalent to KMP_LIBRARY=turnaround.
If KMP_LIBRARY=throughput, then each thread pauses / releases its timeslice in spin-wait loops.
If KMP_LIBRARY=turnaround, the threads only pause it they know that the machine is oversubscribed.
KMP_BLOCKTIME controls the blocking, not the pausing. The default value is 200 ms. If a thread spins for longer than 200 ms (actually, some value between 200-400 ms), then it goes to sleep.
If KMP_BLOCKTIME=0, then the thread does a single check to see if it can proceed, then immediately goes to sleep if it cannot.
If KMP_BLOCKTIME=infinite (which is implemented as 2^31-1), then the threads will never block at a barrier.
KMP_BLOCKTIME=infinite also disables a lot of checks (and corresponding cache misses) in the barrier spin-wait loop, and can result in a 2x-3x speedup over KMP_BLOCKTIME=2^31-2 on EPCC parallel, for a machine with many procs.
FYI
Maybe we should have mapped OMP_WAIT_POLICY=ACTIVE to KMP_BLOCKTIME=infinite, and not KMP_LIBRARY=turnaround. I don’t know…
Indeed, I meant official released 3.5. Did you get your compiler from
clang-omp.github? It's probably outdated and can't be used for reliable
performance measurements.
I will update clang-omp.github home page to avoid further confusion.
Yes, that was exactly what happend! Thanks Alexey/Andrey.
I have just downloaded clang-3.8 (trunk) and started some experiments,
however I see that clang is trying to link with "lib gomp" (GNU) and not
"libomp" (Intel). Is this really the intended default behavior? How can I
tell clang to use the Intel OMP?
I don't know if this is the correct list to talk about this - I did not find
a better place..
I am doing performance experiments with a few OpenMP implementations (IOMP,
GOMP and our private impl.) and I am seeing a severe slowdown when I use
IOMP (GOMP and others are performing well).
That web page claims the benchmarks use parts of the OpenMP 4.0 specification.
"The KaStORS benchmark suite has been designed to evaluate the implementation of
the OpenMP dependent task paradigm, introduced as part of the OpenMP 4.0
specification."
Currently openmp is only complete for the OpenMP 3.2 specification
> Hello,
>
> I don't know if this is the correct list to talk about this - I did not
find
> a better place..
>
> I am doing performance experiments with a few OpenMP implementations
(IOMP,
> GOMP and our private impl.) and I am seeing a severe slowdown when I use
> IOMP (GOMP and others are performing well).
>
> The benchmarks I am using are these ones:
> http://kastors.gforge.inria.fr/#!index.md
That web page claims the benchmarks use parts of the OpenMP 4.0
specification.
"The KaStORS benchmark suite has been designed to evaluate the
implementation of
the OpenMP dependent task paradigm, introduced as part of the OpenMP 4.0
specification."
Currently openmp is only complete for the OpenMP 3.2 specification
I am able to compile a few benchmarks that use task dependence annotations
(from OMP 4.0) but for those that specify the range of the memory
dependence I get syntax error. So, should I assume that this part is not
implemented, right? Is there a list for the OMP 4.0 items that are
currently supported?
BTW, the Clang version from Github was able to parse these annotations, was
it dropped from the current newer version?