counts in the work sharing

Hello,

I am looking into the void __kmp_for_static_init4 function which marks the start work sharing. The function computes the upper and lower bounds and stride to be used for the set of iterations to be executed by the current thread from the statically scheduled loop that is described by the initial values of the bround, stride, increment and chunk size. How the record of iterations is kept. Which function/file should I look into to keep track of the iteration?

Thanks,

For static schedule the runtime does not track iterations. The __kmpc_for_static_init_4 (or similar, depending on the iteration variable type) is the only runtime call for a statically scheduled loop (not counting leading barrier if any). After it compiler just loops over obtained range of iterations without involving runtime any more.

Regards,

Andrey

Thanks for the information.

For each static_init_4 call, there is a static_fini call. Is this specific to its static_init_4 call? Is there a pairing of calls that can not be broken? If we look at the static_fini call, the function arguments are the same. It seems that if we have two static_init_4 function calls, we can interchange their static_fini calls. Does that make sense?

Suppose we have two loops, Loop-A and Loop-B ( Loop-A and Loop-B are identical in terms of iteration, chunk, increment, etc). One with omp for loop and other without it

#pragma omp for
{
Loop-A
}

Loop-B

If I move the static_fini call beyond Loop-B in IR; Should Loop-B be part of the Work Sharing environment and its iteration should be distributed similarly to Loop-A over the available threads?

Thanks,

Couple of weeks ago Jim Cownie <jcownie@gmail.com> effectively answered your question.

Let me try to re-iterate:

For each static_init_4 call, there is a static_fini call. Is this specific to its static_init_4 call?

Yes.

Is there a pairing of calls that can not be broken?

Yes, though I am not completely sure what you are asking about here. You can break the code any way you want, but why?

If we look at the static_fini call, the function arguments are the same.

Same as what? They are probably same location info and global thread id as in the corresponding static_init call, if I got your question right.

It seems that if we have two static_init_4 function calls, we can interchange their static_fini calls. Does that make sense?

Not sure what is your goal here… All you can achieve by such broken code generation is to confuse an OMPT tool with unspecified results, possibly break statistics gathering, break checking of consistency of the OpenMP constructs, - all the code of static_fini does.

If I move the static_fini call beyond Loop-B in IR; Should Loop-B be part of the Work Sharing environment and its iteration should be distributed similarly to Loop-A over the available threads?

All you can achive here is again to get broken code generation, and as a result to confuse an OMPT tool with unspecified results, etc. The static_fini call does not affect parallelization in any way, so your second Loop-B will remain to be serial loop and will be redundantly executed by all threads of the team, regardless of the location of the static_fini call. To make a loop work sharing, it should be parallelized (e.g. by adding “#pragma omp for” in the source code, given there exist enclosing parallel region).

As Jim suggested earlier, look at the code!

And please first describe the problem you want to solve. As opposed to trying to shuffle statements by guess.

Regards,

Andrey

My apology. I think I didn’t get Jim’s email

Couple of weeks ago Jim Cownie <jcownie@gmail.com> effectively answered your question.

Let me try to re-iterate:

For each static_init_4 call, there is a static_fini call. Is this specific to its static_init_4 call?

Yes.

Is there a pairing of calls that can not be broken?

Yes, though I am not completely sure what you are asking about here. You can break the code any way you want, but why?

If we look at the static_fini call, the function arguments are the same.

Same as what? They are probably same location info and global thread id as in the corresponding static_init call, if I got your question right.

It seems that if we have two static_init_4 function calls, we can interchange their static_fini calls. Does that make sense?

Not sure what is your goal here… All you can achieve by such broken code generation is to confuse an OMPT tool with unspecified results, possibly break statistics gathering, break checking of consistency of the OpenMP constructs, - all the code of static_fini does.

If I move the static_fini call beyond Loop-B in IR; Should Loop-B be part of the Work Sharing environment and its iteration should be distributed similarly to Loop-A over the available threads?

All you can achive here is again to get broken code generation, and as a result to confuse an OMPT tool with unspecified results, etc. The static_fini call does not affect parallelization in any way, so your second Loop-B will remain to be serial loop and will be redundantly executed by all threads of the team, regardless of the location of the static_fini call. To make a loop work sharing, it should be parallelized (e.g. by adding “#pragma omp for” in the source code, given there exist enclosing parallel region).

As Jim suggested earlier, look at the code!

And please first describe the problem you want to solve. As opposed to trying to shuffle statements by guess.

If two #omp for loops are compatible (i.e they have same in terms of parameter values for call functions), I am trying two runs with single pair of static_init and static)fini calls

#pragma omp for
Loop-A

#pragma omp for
Loop-B

Loop-A and Loop-B are compatible ( same iterations and chunk size)

—> will produce

call __kmpc_static_init4A()
Loop-A
call __kmpc_static_finiA()

call __kmpc_static_init4B()
Loop-B
call __kmpc_static_finiB()

—> want to achieve

call __kmpc_static_init4A()
Loop-A
//call __kmpc_static_finiA()
// Remove these two call values and adjust use values as necessay

//call __kmpc_static_init4B()
Loop-B
call __kmpc_static_finiB()

this will avoid the recalculation of chunk size etc.

Hope this helps!

Still unclear what you want to achieve. OK, you can eliminate two library calls, those are among the cheapest in the runtime library (there are no synchronizations, only a couple of integer math operations). You will break tools support, and unlikely get any performance gain. In real applications there will definitely be no performance gain. So why do you want to remove the library calls, loosing support of tools, and without an observable gain?

BTW, you may think of eliminating all the static_init/fini calls. As it in not hard to query number of threads and calculate iterations distribution for statically scheduled loops without runtime library help. But it will be hard to add tools support for such an implementation, I think. So its value is unclear.

Regards,

Andrey