OpenMP threads slow to start on Windows

Executive summary: Windows 10 system. I run parallel for on short workloads - say, 2-10us/thread. My problem is that the delay before the worker threads start is dozens to hundreds of us quite often, even when I run the task with high priority. It's as if the scheduler were nonpreemptive, but the docs say it is preemptive. Can someone explain? It's more a Windows question than an OpenMP question.

Detail:

My application operates on pairs of long vectors - say it adds them together to produce a vector result. Its rules state that it must completely finish with one pair before it can be given another. I would like to use multiple threads to speed things up. I am running Windows 10.

I created an OpenMP parallel for construct and divided the vector among all the threads of the team. All threads start, all threads run pretty fast, so the multithreading is effective.

But the speedup is slight, and the reason is that some of the time, one of the worker threads takes way longer than usual. I have instrumented the operation, and I see that sometimes the worker threads take a long time to start - delay varies from 20 microseconds on average to dozens of milliseconds depending on system load. The master thread does not show this delay.

That makes me think that the scheduler is taking some time to start the worker threads. The master thread is already running, so it doesn't have to wait to be started.

But here is the nub of the question: raising the priority of the process doesn't make any difference. I can raise it to high priority or even realtime priority, and I still see that startup of the worker threads is often delayed. It looks like the Windows scheduler is not fully preemptive, and sometimes lets a lower-priority thread run when a higher-priority one is eligible. Can anyone confirm this?

I have verified that the worker threads are created with the default OS priority, namely the base priority of the class of the master process. This should be higher that the priority of any running thread, I think. Or is it normal for there to be some thread with realtime priority that might be blocking my workers? I don't see one with Task Manager.

I guess one last possibility is that the task switch might take 20-2000 usec. Is that plausible?

I have a 4-core system without hyperthreading.

Henry Rich