questions about loop scheduling

Hi everyone,

I’d like to ask and clarify questions that I came up with while I am working on performance tracing with various loop schedules.

  1. For static/chunked schedule with decreasing loop, I’ve observed something strange with kmp(c)_for_static_init(). I checked loop iteration lower and upper bound, a size of chunk, and a size of loop increment at the beginning and end of kmp_static_init by enabling debug (KMP_D_DEBUG) and they looked like as if I set up increasing loop while it was working well as I expected.

To be more specific, let me give you a very simple example as follows:

#pragma omp parallel for schedule(static, 4)
for( int i = 31; i >=0; i–) { LOOP_BODY };

In this case, thread0 would get the chunk of iteration between 31 and 28 and between 15 and 12, thread1 between 27 and 24 and between 11 and 8, and so forth. However I saw kmp(c)_for_static_init for thread0 had (LB:0, UB:3, ST:16) as if it were an increasing loop starting at 0 or there should be a way for runtime to interpret a task of LB 0 as a loop iteration with the index of 31.

The reason why I care about this is because I’d like to track each work chunk by scheduling it arbitrarily and I was considering LB (lower bound) as a unique ID for each work chunk. However in this case, I don’t know how I extract the correct bound associate with loop iteration number of scheduled chunk. I am also looking into an implementation of Clang codegen library for OpenMP to see if what’s going on there but I feel I am a bit lost.

Please help me find what I missed and correct me if I went anything wrong.

  1. To measure an overhead of dynamic/chunked scheduling, I’ve added callbacks at the beginning and end of __kmp_dispatch_next_algorithm() in kmp_dispatch.cpp and monitored an elapsed time in between. I was told dynamic scheduling could be pretty expensive and thus I would expect a contention around (a sort of conceptually) centralized work queue would be expensive as we have more chunks and threads as discussed in the paper [REF1].

However it seems it’s pretty acceptable with moderate number of threads (16-64) and a number of chunks per thread (16-64) unless I try too extreme case such as chunk of 1. I took a quick look at the implementation and it seemed to make sense since what we paid for dynamic scheduling as compared to static scheduling is an atomic operation (test_then_inc_acq) and there was not such a centralized queue substantiated.

I’d like to just double check what I’ve observed make sense to those who contributed performance tune for the runtime implementation. Any comment on your experience would be greatly appreciated!

FYI, I am working on Clang 7 and LLVM OpenMP 7.0.

[REF1] OpenMP task scheduling strategies for multicore NUMA systems

  1. Not sure I got the question right, but if you want to get real chunk bounds inside the OpenMP runtime library, I am afraid it is impossible. Because clang goes to the runtime with normalized loop bounds, thus the runtime does not know actual values, it only knows normalized iteration space.

  2. You are right, the dynamic scheduling uses single atomic operation per chunk. To access if it is expensive or not, try to distribute 1B of chunks, and you will find that the scheduling is pretty expensive comparing to static schedule for example. Working with a dozen of chunks the overhead will always be small (relatively).