Some smaller patches.

Hal,

I have some small patches here. I put the bigger ones on Phabricator. If you want all of these on Phabricator I’ll start doing that instead.

  1. security_flags.patch – added some flags for security on Linux and Mac link stages.

  2. stack_offset.patch – changes default stack offset for threads on non-Mac architectures to a CACHE_LINE. This puts threads at different offsets from a page during creation.

  3. omp_flush_fix.patch – removes unused varargs from #pragma omp flush api function.

– Johnny

omp_flush_fix.patch (1.64 KB)

security_flags.patch (3 KB)

stack_offset.patch (549 Bytes)

For the stuff on phab can you post links to the mailing list or cc me.
I'm not subscribed there.

Thanks

For the security patch - what's the actual wall clock measured performance hit?

From: "C Bergström" <cbergstrom@pathscale.com>
To: "Jonathan L Peyton" <jonathan.l.peyton@intel.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, openmp-commits@dcs-maillist2.engr.illinois.edu,
openmp-dev@dcs-maillist2.engr.illinois.edu
Sent: Friday, February 13, 2015 11:54:27 PM
Subject: Re: [Openmp-dev] Some smaller patches.

For the stuff on phab can you post links to the mailing list or cc
me.
I'm not subscribed there.

FWIW, I did not see the customary e-mail from Phabricator either (and I am subscribed to openmp-commits). Did you add openmp-commits to the list of subscribers?

-Hal

From: "Jonathan L Peyton" <jonathan.l.peyton@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: openmp-dev@dcs-maillist2.engr.illinois.edu, openmp-commits@dcs-maillist2.engr.illinois.edu
Sent: Friday, February 13, 2015 6:02:53 PM
Subject: Some smaller patches.

Hal,

I have some small patches here. I put the bigger ones on Phabricator.
If you want all of these on Phabricator I’ll start doing that
instead.

1) security_flags.patch – added some flags for security on Linux and
Mac link stages.

Are these now going to be always on by default? Can we add a build flag to disable them?

P.S. Even though this patch is small, it is also a good candidate for Phabricator because it is hard, just from the patch itself, what build targets are being modified.

2) stack_offset.patch – changes default stack offset for threads on
non-Mac architectures to a CACHE_LINE. This puts threads at
different offsets from a page during creation.

For one thing, this comment explaining what is going on should really be in the code too. Second, I don't understand why this is desirable, can you please explain.

Also, some of the uses of this variable seem questionable (KMP_DEFAULT_STKOFFSET is used to set __kmp_stkoffset), and we have this in z_Linux_util.c:

#if KMP_OS_LINUX || KMP_OS_FREEBSD
    if ( __kmp_stkoffset > 0 && gtid > 0 ) {
        padding = alloca( gtid * __kmp_stkoffset );
    }
#endif

this padding variable is dead, and I'd hope that the compiler removes it.

And then, as you imply, this is used to adjust the thread's stack size:

            /* Set stack size for this thread now. */
            stack_size += gtid * __kmp_stkoffset;
...
                status = pthread_attr_setstacksize( & thread_attr, stack_size );

any while I understand that this might also affect the starting offset of subsequent thread's stacks, I don't see what would be true (I assume that the OS will create each thread's stack so that the start of the stack is really page aligned). Again, why you're multiplying by gtid is not explained. Are you trying to reduce false sharing?

In short, this needs more comments (at least).

3) omp_flush_fix.patch – removes unused varargs from #pragma omp
flush api function.

LGTM.

Thanks again,
Hal

For the security patch - what's the actual wall clock measured performance hit?

We haven't seen any performance issues during testing.

Thank you.

Regards,
Olga

Are these now going to be always on by default? Can we add a build flag to disable them?

They are on by default in the build.pl (Makefile) build system. I left them out of the CMake build system because I figured people could add them in if they wanted to via CFLAGS, etc.
There isn't a real convenient way to add build flags to the build.pl system because it does not contain a configuration stage. So I honestly don't know what to do about it.
Also, sorry about not putting this one on Phabricator.

For one thing, this comment explaining what is going on should really be in the code too. Second, I don't understand why this is desirable, can you please explain.

Ok to the comment going in the code. Basically you are right. We want the threads to all be offset differently from page alignment. For example (assuming 64 byte cache lines):
Thread0 offset 0 from its base page
Thread1 offset 64 bytes from its base page
Thread2 offset 128 bytes from its base page
Thread3 offset 192 bytes from its base page
Thread4 ...

It is useful for Intel(R) Many Integrated Core Architecture (lots of threads), but had no negative effects for regular Linux builds either. The goal is to help reduce cache conflicts/cache thrashing of local stack data of individual threads where the local stack data is offset by the same amount relative to a page for each thread.

#pragma omp parallel
{
   // lots of threads here ~ 240 threads on MIC
   double update_me;
   for(...) {
      read update_me;
       ...
      write update_me;
   }
}

If update_me is offset by the same amount for each thread's stack, then a possible cache thrashing condition could occur. It's remote, but possible. And again, it only really affects MIC. Since it didn't hurt performance in our own testing, we left it in there for Linux.

#if KMP_OS_LINUX || KMP_OS_FREEBSD
  if ( __kmp_stkoffset > 0 && gtid > 0 ) {
        padding = alloca( gtid * __kmp_stkoffset );
    }
#endif

this padding variable is dead, and I'd hope that the compiler removes it.

Yes, it sure does... I've now confirmed that. Somehow, the:
            stack_size += gtid * __kmp_stkoffset;
                status = pthread_attr_setstacksize( & thread_attr, stack_size );
lines triggers the offset. I've tested the offset values by reading out %rsp.

I believe what the code is trying to do is perform the offset with the alloca rather than the setstacksize() call.
The calling tree should look like:
1) [master thread] Master thread calls __kmp_create_worker()
2) [master thread] __kmp_create_worker() sets the stacksize and calls pthread_create()
3) pthread_create() is called with __kmp_launch_worker() as the function to perform
4) [worker thread] __kmp_launch_worker() performs alloca() to offset current thread's stack (or it's supposed to);
5) [worker thread] enter loop waiting for work, but all worker thread's stack sizes are supposed to be identical even though the offset was performed.
In other words:
  When thread 1 is pthread_created, its stacksize = __kmp_stksize + 1*64, then alloca(1*64) shortens the stacksize back to kmp_stksize;
  When thread 2 is pthread_created, its stacksize = __kmp_stksize + 2*64, then alloca(2*64) shortens the stacksize back to kmp_stksize;
  When thread 3 is pthread_created, its stacksize = __kmp_stksize + 3*64, then alloca(3*64) shortens the stacksize back to kmp_stksize;
.... and so on.

We need to fix this obviously.
Whats the best way to enforce the alloca() to take place? Is there a common trick to do this?

-- Johnny

From: "Jonathan L Peyton" <jonathan.l.peyton@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: openmp-dev@dcs-maillist2.engr.illinois.edu, openmp-commits@dcs-maillist2.engr.illinois.edu
Sent: Tuesday, February 17, 2015 4:43:25 PM
Subject: RE: Some smaller patches.

> Are these now going to be always on by default? Can we add a build
> flag to disable them?
They are on by default in the build.pl (Makefile) build system. I
left them out of the CMake build system because I figured people
could add them in if they wanted to via CFLAGS, etc.
There isn't a real convenient way to add build flags to the build.pl
system because it does not contain a configuration stage. So I
honestly don't know what to do about it.
Also, sorry about not putting this one on Phabricator.

Okay, fair enough. We should add an option to the CMake build to make this easy. I suppose that, from the LLVM project perspective, we'll never really advocate using the build.pl system. We'll simply say it is there for the purpose of matching, to the extent possible, the builds that Intel distributes. So LGTM (and add some option to the cmake builds to match).

> For one thing, this comment explaining what is going on should
> really be in the code too. Second, I don't understand why this is
> desirable, can you please explain.
Ok to the comment going in the code. Basically you are right. We
want the threads to all be offset differently from page alignment.
For example (assuming 64 byte cache lines):
Thread0 offset 0 from its base page
Thread1 offset 64 bytes from its base page
Thread2 offset 128 bytes from its base page
Thread3 offset 192 bytes from its base page
Thread4 ...

It is useful for Intel(R) Many Integrated Core Architecture (lots of
threads), but had no negative effects for regular Linux builds
either. The goal is to help reduce cache conflicts/cache thrashing
of local stack data of individual threads where the local stack data
is offset by the same amount relative to a page for each thread.

#pragma omp parallel
{
   // lots of threads here ~ 240 threads on MIC
   double update_me;
   for(...) {
      read update_me;
       ...
      write update_me;
   }
}

If update_me is offset by the same amount for each thread's stack,
then a possible cache thrashing condition could occur. It's remote,
but possible. And again, it only really affects MIC. Since it
didn't hurt performance in our own testing, we left it in there for
Linux.

> #if KMP_OS_LINUX || KMP_OS_FREEBSD
> if ( __kmp_stkoffset > 0 && gtid > 0 ) {
> padding = alloca( gtid * __kmp_stkoffset );
> }
> #endif
>
> this padding variable is dead, and I'd hope that the compiler
> removes it.
Yes, it sure does... I've now confirmed that. Somehow, the:
            stack_size += gtid * __kmp_stkoffset;
                status = pthread_attr_setstacksize( & thread_attr,
                stack_size );
lines triggers the offset. I've tested the offset values by reading
out %rsp.

I believe what the code is trying to do is perform the offset with
the alloca rather than the setstacksize() call.
The calling tree should look like:
1) [master thread] Master thread calls __kmp_create_worker()
2) [master thread] __kmp_create_worker() sets the stacksize and calls
pthread_create()
3) pthread_create() is called with __kmp_launch_worker() as the
function to perform
4) [worker thread] __kmp_launch_worker() performs alloca() to offset
current thread's stack (or it's supposed to);
5) [worker thread] enter loop waiting for work, but all worker
thread's stack sizes are supposed to be identical even though the
offset was performed.
In other words:
  When thread 1 is pthread_created, its stacksize = __kmp_stksize +
  1*64, then alloca(1*64) shortens the stacksize back to
  kmp_stksize;
  When thread 2 is pthread_created, its stacksize = __kmp_stksize +
  2*64, then alloca(2*64) shortens the stacksize back to
  kmp_stksize;
  When thread 3 is pthread_created, its stacksize = __kmp_stksize +
  3*64, then alloca(3*64) shortens the stacksize back to
  kmp_stksize;
.... and so on.

Ah, okay. Now I understand. This really needs to be well explained by the comments in the code -- the text from this e-mail would be great :wink:

We need to fix this obviously.
Whats the best way to enforce the alloca() to take place? Is there a
common trick to do this?

Yes, you pass the pointer to some external function. Since the compiler cannot prove that the buffer is unused, it must keep it.

-Hal

We need to fix this obviously.
Whats the best way to enforce the alloca() to take place? Is there a
common trick to do this?

Yes, you pass the pointer to some external function. Since the compiler cannot prove that the buffer
is unused, it must keep it.

I generally cheat worse than that and use something like this

// Ensure that there is a reference to the variable passed in which the compiler cannot discard.
static void reference(void * value)
{
    __asm__ volatile ("# reference"::"r"(value));
}

though it relies on GCC style inline asm, so won't work when compiling for Windows.
OTOH, it is immune to -ipo compilation which could, potentially, remove the call (and thus the reference).

Another possibility (maybe the best), would be to

1) declare "padding" as int volatile *
2) cast the alloca result to (int volatile *)
3) do a *padding = 0;

The compiler should never remove a store into a volatile type, so it can't remove the reference to the alloca-ed space, and therefore can't decide the space is dead. (At least, that's the theory).

-- Jim

James Cownie <james.h.cownie@intel.com>
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)
Tel: +44 117 9071438