initial clang-omp/openmp benchmarking

I’ve done some initial benchmarking of openmp performance using the
clang compiler from our fink llvm34-3.4.1-0e packaging which has the
current openmp trunk svn built against the llvm/compiler-rt/clang 3.4.1
with a back port of current clang-omp from github applied. The results for
the heated_plate_openmp.c demo code compiled and run with the
heated_plate_gcc.sh shell script revealed some interesting results. The
demo code is run at one, two and four OMP processes. Ratioing these
timings to the one OMP process timing shows the following on a 16-core
MacPro on darwin13…

1:1.90:3.31 for FSF gcc 4.8.3

1:1.90:3.30 for FSF gcc 4.9.0

1:1.99:3.71 for clang 3.4.1 with openmp and merged clang-omp

this compares to the results on a 24-core Fedora 15 linux box

1:1.99:3.92 for FSF gcc 4.6.3

1:1.99:3.93 for FSF gcc 4.8 branch svn

I’ve filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61333 on the
reduced performance of gomp on darwin compared to iomp5 on darwin and
gomp on linux. Their response was that darwin’s use of pthread_mutex calls
rather than futex was the cause in gomp and that we should be using linux.
While the results for iomp5 are far better on darwin than those for
gomp on darwin, we still are lagging behind the performance of gomp using
futex on linux. FYI, the heated_plate_openmp.c and heated_plate_gcc.sh
are attached to PR 61333.
Jack

Without looking at the benchmark code (do you have a link to it?), given the description it sounds as if it is omp_lock_t intensive.

If so you can explicitly use FUTEX locks with libiomp5.so on Linux by setting the environment variable KMP_LOCK_KIND=futex.

You may also want to play with KMP_LOCK_KIND=tas, which uses a “test and test-and-set” lock.

The choice of a default lock implementation is not trivial, since some lock benchmarks (such as that in EPCC) are for heavily contended locks, whereas many codes have lightly contended locks and benefit from simpler (and unfair) lock implementations.

– Jim

James Cownie james.h.cownie@intel.com
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)

Tel: +44 117 9071438

James,
The files I used (with the shell script adjusted for the compiler of course) are attached to the gcc
bugzilla at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61333. As for using futex, I am focused on
using darwin so that isn’t really an option.
Jack
ps Attached is an archive with a set of openmp open source demos that I found. There might be a
better example to benchmark locks in there than heated_plate_openmp.c and heated_plate_gcc.sh,

more_openmp_code.tar.bz2 (56 KB)

Sorry, I read your description of the problem

“While the results for iomp5 are far better on darwin than those for gomp on darwin, we still are lagging behind the performance of gomp using futex on linux.”

as being that libiomp5.so was underperforming on Linux because we’re not using futex there, so I was explaining how we could do that.

I now grok that what you’re saying is that you’d like to see performance on Darwin (without futexes) that is faster than on Linux (with futexes).

So, I suggest trying the TAS lock, (KMP_LOCK_KIND=tas on Darwin).

Depending on what you think OpenMP is used for, though, locks may be irrelevant. If you look at the latest SPECOMP codes, there are none that use locks (down from the previous version that had a couple).

In HPC locks should be rare and heavily contended locks absent completely. (Because if there are heavily contended locks in a significant part of the code, it won’t perform well anyway, so doesn’t qualify for the “HPC” name J).

– Jim

James Cownie james.h.cownie@intel.com
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)

Tel: +44 117 9071438

Sorry, I read your description of the problem

“While the results for iomp5 are far better on darwin than those for gomp
on darwin, we still are lagging behind the performance of gomp using futex
on linux.”

as being that libiomp5.so was underperforming on Linux because we’re not
using futex there, so I was explaining how we could do that.

I now grok that what you’re saying is that you’d like to see performance
on Darwin (without futexes) that is faster than on Linux (with futexes).

So, I suggest trying the TAS lock, (KMP_LOCK_KIND=tas on Darwin).

Not necessarily faster on darwin but at least equal in performance to
linux. FYI, I posted the raw timings for this test case on both darwin and
linux…

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61333#c13
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61333#c14

A cursory examination suggests that the ratios of the one to four OMP
process timings are pretty much identical between the clang-omp and FSF gcc
compilers on linux but the darwin ratios on clang-omp are about 5% lower
than futex on linux and even worse for gomp on darwin.

I don’t really understand what problem you are complaining about.

Your numbers show clang-omp as the fastest implementation in all directly comparable cases. That doesn’t seem like something we want to change!

– Jim

James Cownie james.h.cownie@intel.com
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)

Tel: +44 117 9071438

I think the complaint is this: on Darwin, the scaling to 4 "processes" is
worse than on Linux.

However, the reason is stated already: Linux provides a *very* fast futex
implementation. Darwin either doesn't provide it or iomp doesn't use it.

If Darwin provides a fast futex interface, then iomp should use it. That's
a useful request. I don't know enough about Darwin to help investigate
whether the OS has a futex interface exposed to userland.

If Darwin doesn't provide a futex interface, there is literally nothing we
can do about that. You aren't going to match the scalability of a
kernel-supported futex with something in userspace.

Anyways, I do agree that micro-optimizing mutex performance for something
like openmp seems somewhat less important....

I think the complaint is this: on Darwin, the scaling to 4 “processes” is worse than on Linux.

Four threads is small. The OpenMP runtime is tested scaling in the 200+ thread range for Xeon Phi, and on big-iron servers. We measure the scaling of a variety of more interesting things there (such as SpecOMP).

Futexes are fast, but then so are our spin-locks. The difference is what happens when the lock is contended (whether you enter the kernel or not, and therefore allow the kernel to schedule something else onto the same HW thread). That should make little difference in this case, since the machine is not over-subscribed.

If Darwin provides a fast futex interface, then iomp should use it.

Darwin does not provide it, so we can’t use it J.

I’d guess that the issue here is more likely related to affinity choices made by the operating system (whether it chooses to place threads as hyper-threads on the same core, as threads in the same socket, or across sockets) than details of the locking. I believe that Darwin also has no specific support that would let us control that either…

– Jim

James Cownie james.h.cownie@intel.com
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)

Tel: +44 117 9071438

Darwin has a very weak notion of "affinity hints":

https://developer.apple.com/library/mac/releasenotes/Performance/RN-AffinityAPI/

But it's so dumbed down (only a concept of distinct affinity "tags"
based solely on L2 cache sharing) that it's pretty useless. I did some
microbenchmarks with it to simulate an OpenMP workload with pinning,
and as far as I'm able to tell, the Darwin kernel just ignores those
hints and does whatever it pleases.

Steven,
Have you filed a radar bug report with Apple on this?. There always is the remote possibility that this issue could be addressed in a future OS release.
Jack

No, I haven't. I'm pretty sure Apple's stance is that they don't
*want* people to affinitize processes because they believe people
would abuse it. Comments like this seem to indicate that to me,
anyway:

https://github.com/opensource-apple/xnu/blob/10.9/osfmk/kern/sched_prim.c#L1720

They used to make it possible to affinitize to CPUs via a framework
that came with Xcode called CHUD, but they never made an Intel 64-bit
version of it, and now it's gone altogether.

Never hurts to ask them via radar. Especially if you can point at specific instances where the openmp support which would benefit in performance from that feature.

I'm pretty sure Apple's stance is that they don't *want* people to affinitize
processes because they believe people would abuse it.

They are right, people get it wrong, *but* it can make 2x performance difference when used right,
and sharp knives are useful tools while blunt ones (or no knife at all) are not.

I think it's an issue of target market, though. OpenMP is mostly used in HPC; there
people are pushing for the last 5% performance on a machine where their code is all
that is running. In that environment having fine-grained control and telling the OS
what to do makes sense. But, that's not where Apple is at all...
(There are no Apple machines in the Top500, whereas 482 of them run Linux).

(And, yes, I grant you, it may make sense for some applications on the Mac Pro, though
even those seem only to be single-socket machines so affinity is less critical).

-- Jim

James Cownie <james.h.cownie@intel.com>
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)
Tel: +44 117 9071438