Hi,
My name is Adhityaa and I’m an undergrad student. This is my first time getting involved in OpenMP development at LLVM. I have used the library a bunch of times before, but I haven’t been involved in the behind-the-scenes work before.
So I was just taking a look at https://bugs.llvm.org/show_bug.cgi?id=33543 and I tried solving it (it looked like a simple, reproducible bug that looked quite important). First I ran strace on the program to see exactly what was happening - I found that the threads without any tasks were issuing a whole bunch of sched_yield()
, making the CPU go to 100%.
Then I tried going into the code. I came up with a fairly simple patch that almost solved the issue:
— kmp_wait_release.h (revision 313888)
+++ kmp_wait_release.h (working copy)
@@ -188,6 +188,7 @@
// Main wait spin loop
while (flag->notdone_check()) {
int in_pool;
- int executed_tasks = 1;
kmp_task_team_t *task_team = NULL;
if (__kmp_tasking_mode != tskm_immediate_exec) {
task_team = this_thr->th.th_task_team;
@@ -200,10 +201,11 @@
disabled (KMP_TASKING=0). */
if (task_team != NULL) {
if (TCR_SYNC_4(task_team->tt.tt_active)) {
- if (KMP_TASKING_ENABLED(task_team))
- flag->execute_tasks(
- if (KMP_TASKING_ENABLED(task_team)) {
- executed_tasks = flag->execute_tasks(
this_thr, th_gtid, final_spin,
&tasks_completed USE_ITT_BUILD_ARG(itt_sync_obj), 0);
- }
else
this_thr->th.th_reap_state = KMP_SAFE_TO_REAP;
} else {
@@ -269,7 +271,7 @@
continue;
// Don’t suspend if there is a likelihood of new tasks being spawned.
- if ((task_team != NULL) && TCR_4(task_team->tt.tt_found_tasks))
- if ((task_team != NULL) && TCR_4(task_team->tt.tt_found_tasks) && executed_tasks)
continue;
#if KMP_USE_MONITOR
Alas, this led to a deadlock - the thread went into a futex wait and was never woke again. So I looked at how GCC did things and they issue a FUTEX_WAKE_PRIVATE for INT_MAX threads at the end of things. This should solve the problem, I think?
Anyway, it’s pretty late here and I’ve been at this for over 6-7 hours at a stretch, so I’m really tired. I was just wondering if I could get some help on how to proceed from here (and whether I’m even on the right track).
Thanks,
Adhityaa
You might find https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/556869 relevant.
I consider idling in the OpenMP runtime to be a sign of a badly behaved application, not a runtime bug in need of fixing. But in any case, OMP_WAIT_POLICY/KMP_BLOCKTIME exist to address that.
Best,
Jeff
You might find Software - Intel Communities
integrated-core/topic/556869 relevant.
Thanks for that link. I agree with the statement that time to solution is a
more important metric in comparison to CPU time. However, having a thread
run at 100% is detrimental to other unrelated processes in the same machine.
Is there anything else you want me to notice in that thread?
I consider idling in the OpenMP runtime to be a sign of a badly behaved
application, not a runtime bug in need of fixing. But in any case,
OMP_WAIT_POLICY/KMP_BLOCKTIME exist to address that.
OMP_WAIT_POLICY and KMP_BLOCKTIME indeed are designed to solve that -
however, I don't think the clang OpenMP runtime is respecting the
OMP_WAIT_POLICY value -- I set `export OMP_WAIT_POLICY=passive` and
compiled the example program with g++ and clang++. The g++ binary used
virtually no CPU while the clang++ binary had three processes at 100%
(three because OMP_NUM_THREADS=4 and only 1 is executing a task; the other
three are idle).
Is this intended? Are there any other parameters to tweak?
In any case, I think a default of "temporarily active for a while before
switching to passive" instead of "always active unless the user manually
overrides" is saner, no?
having a thread run at 100% is detrimental to other unrelated processes in the same machine.
You could limit the HW resources for your application (e.g. using KMP_HW_SUBSET environment variable), leaving the rest of machine for other processes. The OpenMP runtime library has default settings those give as much performance as possible, and these settings are in contrary with composability with other applications running in parallel on the same HW. You may need to manually adjust settings so that your application behave in composable way.
Is this intended? Are there any other parameters to tweak?
Yes, it is intended. If there are no tasks being executed and no tasks to execute, then the blocktime works for idle threads. If there are tasks then threads never go to sleep waiting for more tasks to be scheduled. That is usual tasking scenario – when one task can generate many more tasks. If we would let idle threads to go to sleep when some thread generates new tasks, that will slow down a lot of existing codes. So I don’t see how your highly unbalanced case can be improved by the library without hurting real applications with better work balance.
Regarding the wait policy, it is currently implemented so that threads call sched_yield for passive policy and spin more actively otherwise. Unfortunately modern kernels behave so that yielding threads are scheduled for execution more frequently in order to let all threads consume equal CPU time, regardless of possible work imbalance in the application. We may think of re-implementing the passive wait policy, or introducing new “very passive” policy, but this would need careful design and time for implementation.
Your current suggested change breaks the task stealing algorithm, because threads do multiple attempts to steal tasks from randomly chosen victim thread until success, and the change breaks this letting a thread go to sleep after the first stealing attempt. It would be possible to count number of tasks globally and let threads go to sleep if there are no tasks to execute. But I suspect this can cause significant slowdown of existing codes, so a lot of performance testing should be done first before applying something like this.
– Andrey
> having a thread run at 100% is detrimental to other unrelated processes
in the same machine.
You could limit the HW resources for your application (e.g. using
KMP_HW_SUBSET environment variable), leaving the rest of machine for other
processes. The OpenMP runtime library has default settings those give as
much performance as possible, and these settings are in contrary with
composability with other applications running in parallel on the same HW.
You may need to manually adjust settings so that your application behave in
composable way.
> Is this intended? Are there any other parameters to tweak?
Yes, it is intended. If there are no tasks being executed and no tasks to
execute, then the blocktime works for idle threads. If there are tasks
then threads never go to sleep waiting for more tasks to be scheduled.
That is usual tasking scenario – when one task can generate many more
tasks. If we would let idle threads to go to sleep when some thread
generates new tasks, that will slow down a lot of existing codes. So I
don’t see how your highly unbalanced case can be improved by the library
without hurting real applications with better work balance.
Regarding the wait policy, it is currently implemented so that threads
call sched_yield for passive policy and spin more actively otherwise.
Unfortunately modern kernels behave so that yielding threads are scheduled
for execution more frequently in order to let all threads consume equal CPU
time, regardless of possible work imbalance in the application. We may
think of re-implementing the passive wait policy, or introducing new “very
passive” policy, but this would need careful design and time for
implementation.
Just to clarify: would this new "very passive" (dormant would be a better
term IMO) policy be akin to GCC's passive (where they use zero resources
because of a futex wait)?
Your current suggested change breaks the task stealing algorithm, because
threads do multiple attempts to steal tasks from randomly chosen victim
thread until success, and the change breaks this letting a thread go to
sleep after the first stealing attempt. It would be possible to count
number of tasks globally and let threads go to sleep if there are no tasks
to execute. But I suspect this can cause significant slowdown of existing
codes, so a lot of performance testing should be done first before applying
something like this.
Does it have to be global? Making it global would mean a lot of cache
misses, right? Have a per-thread counter that's decremented every time it
can't steal a task, if it reaches zero reset to default value and go to
sleep. At any point in time, if we steal a task successfully, reset the
counter.
Adhityaa -
Indeed, I would like to see data for interesting tests, not code that doesn’t do anything and is pathologically load-imbalanced.
https://github.com/ParRes/Kernels contain some interesting tests that are very easy to run. See p2ptask, stenciltask, and transposetask in C1z, Cxx11, and/or FORTRAN subdirectories. The language should not matter, as they are designed to be as similar as possible. Configuring the build is just “ln -sf common/make.defs.${toolchain} common/make.defs” after appropriate changes to make.defs.${toolchain}.
What impact does “very passive” have on these programs? Feel free to contact me privately about experimental setup.
Jeff