Are these now going to be always on by default? Can we add a build flag to disable them?
They are on by default in the build.pl (Makefile) build system. I left them out of the CMake build system because I figured people could add them in if they wanted to via CFLAGS, etc.
There isn't a real convenient way to add build flags to the build.pl system because it does not contain a configuration stage. So I honestly don't know what to do about it.
Also, sorry about not putting this one on Phabricator.
For one thing, this comment explaining what is going on should really be in the code too. Second, I don't understand why this is desirable, can you please explain.
Ok to the comment going in the code. Basically you are right. We want the threads to all be offset differently from page alignment. For example (assuming 64 byte cache lines):
Thread0 offset 0 from its base page
Thread1 offset 64 bytes from its base page
Thread2 offset 128 bytes from its base page
Thread3 offset 192 bytes from its base page
Thread4 ...
It is useful for Intel(R) Many Integrated Core Architecture (lots of threads), but had no negative effects for regular Linux builds either. The goal is to help reduce cache conflicts/cache thrashing of local stack data of individual threads where the local stack data is offset by the same amount relative to a page for each thread.
#pragma omp parallel
{
// lots of threads here ~ 240 threads on MIC
double update_me;
for(...) {
read update_me;
...
write update_me;
}
}
If update_me is offset by the same amount for each thread's stack, then a possible cache thrashing condition could occur. It's remote, but possible. And again, it only really affects MIC. Since it didn't hurt performance in our own testing, we left it in there for Linux.
#if KMP_OS_LINUX || KMP_OS_FREEBSD
if ( __kmp_stkoffset > 0 && gtid > 0 ) {
padding = alloca( gtid * __kmp_stkoffset );
}
#endif
this padding variable is dead, and I'd hope that the compiler removes it.
Yes, it sure does... I've now confirmed that. Somehow, the:
stack_size += gtid * __kmp_stkoffset;
status = pthread_attr_setstacksize( & thread_attr, stack_size );
lines triggers the offset. I've tested the offset values by reading out %rsp.
I believe what the code is trying to do is perform the offset with the alloca rather than the setstacksize() call.
The calling tree should look like:
1) [master thread] Master thread calls __kmp_create_worker()
2) [master thread] __kmp_create_worker() sets the stacksize and calls pthread_create()
3) pthread_create() is called with __kmp_launch_worker() as the function to perform
4) [worker thread] __kmp_launch_worker() performs alloca() to offset current thread's stack (or it's supposed to);
5) [worker thread] enter loop waiting for work, but all worker thread's stack sizes are supposed to be identical even though the offset was performed.
In other words:
When thread 1 is pthread_created, its stacksize = __kmp_stksize + 1*64, then alloca(1*64) shortens the stacksize back to kmp_stksize;
When thread 2 is pthread_created, its stacksize = __kmp_stksize + 2*64, then alloca(2*64) shortens the stacksize back to kmp_stksize;
When thread 3 is pthread_created, its stacksize = __kmp_stksize + 3*64, then alloca(3*64) shortens the stacksize back to kmp_stksize;
.... and so on.
We need to fix this obviously.
Whats the best way to enforce the alloca() to take place? Is there a common trick to do this?
-- Johnny