[libc++] Working on the parallel STL algorithms

(I apologize for not including the thread history properly - I was not on the list until recently.)

Once llvm OpenMP can do things like handle nested parallelism and a few more advanced things properly all this might be fun (We can go down a big list if anyone wants to digress)
This is why I said we might consider using taskloop :wink: – There are other ways of handling nesting as well (colleagues of mine work on one: http://www.bolt-omp.org/), but we should probably have a separate thread on OpenMP and nesting to discuss this aspect of things.

OpenMP tasks are the recommended way to solve the OpenMP nesting/composition problem. Implementations are not ready for this yet though. For example, Clang 4.0 generates incorrect code for one of my simple tests of taskloop. However, if PSTL relies on taskloop, that should have the nice side effect of ensuring the OpenMP implementation supports this well.

For what it’s worth, Intel 18 implements PSTL using TBB instead of OpenMP to solve the composition problem (the compiler uses the OpenMP simd pragma for vectorization). I have apples-to-apples performance tests with PSTL, TBB, OpenMP, etc. if anyone is interested (it’s probably appropriate for out-of-band discussion).

Jeff

Hello Jeff,

Would you mind sharing results, please.

Thank you,
Serge.

Hi Serge,

The code I’m using for these evaluations is https://github.com/ParRes/Kernels. These tests are oriented towards numerical simulation and only use std::for_each and std::transform.

I gathered data last week but I need to run new tests since I added and changed quite a bit this week. I’ve included documentation at the bottom is this message describing how others can perform experiments, since I will not have time to produce them until next week.

A few caveats:

  • The PRK project is designed to be language-agnostic and contain high-quality reference implementations. The C++ implementations are closely aligned with the implementations in Python, Fortran 2008, etc., which mean they are less idiomatic than they could be.
  • The code is not tuned for any particular architecture, although heavy testing on Intel Haswell processors may mean unconscious bias towards that architecture. In most cases, we thread the outer loop and SIMD-ize the inner loop, because the results with _Pragma(“omp for simd collapse(2)”) are usually worse and encounter more compiler bugs.
  • Loop blocking helps stencil and transpose, but is not implemented uniformly. I am working to fix this.
  • The p2p kernel is difficult to implement in some models, because it lacks traditional data parallelism. I discourage drawing conclusions from the implementations for now.
  • I am not an expert C++ programmer. However, my colleagues who are experts in TBB and PSTL reviewed that code and I’ve implemented their suggestions.
  • Most of the PRK code has been written by Intel employees, but it is a research project. The PRKs are not intended to be hardware benchmarks nor does the presence of implementations of various models constitute an endorsement of their use (e.g. CUDA).
  • This is all a work-in-progress. All of the C++ code is new in the last month and will continue to evolve rapidly for at least another month.

Please create GitHub issues or email me privately to suggest improvements. Contributions via pull request are also welcome.

Jeff

Grab the code via git or wget

git clone https://github.com/ParRes/Kernels.git
wget https://github.com/ParRes/Kernels/archive/master.zip

We do not use a fancy build system. The critical file is ./common/make.defs, for which we provide examples for all supported toolchains.

Assuming you want to test Clang, start with the LLVM example and modify appropriately. The Intel and GCC examples are up-to-date, but others may be stale. Many example options are from my Mac+Homebrew setup.

cd common && cp make.defs.llvm make.defs

We have scripts to build all of the dependencies for Travis in ./travis, which you can adapt as appropriate.

If you want to test GCC parallel STL, you need 7.2+ because of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81221.

I use https://github.com/jeffhammond/HPCInfo/blob/master/buildscripts/gcc-git.sh to build GCC on Linux.

If you want to test Intel 18 PSTL in beta, it’s freely available from https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2018-beta

The code of interest to you is in ./Cxx11. Some of the tests do not build because of compiler bugs, so ignore errors for simplicity.

cd ./Cxx11 && make -k

All of the programs try to be self-documenting when provided no arguments. Below is an example.

When the documentation isn’t perfect, the source should be clear.

Cxx11 $ ./stencil-vector-pstl
Parallel Research Kernels version 2.16
C++17/Parallel STL Stencil execution on 2D grid
Usage: <# iterations> [<star/grid> ]

10 iterations and a matrix of rank 1000 is a good starting point, but this is too few iterations on a noisy system and matrices that fit in cache give artificially good results.

Cxx11 $ ./stencil-vector-pstl 10 1000
Parallel Research Kernels version 2.16
C++17/Parallel STL Stencil execution on 2D grid
Number of iterations = 10
Grid size = 1000
Type of stencil = star
Radius of stencil = 2
Solution validates
Rate (MFlops/s): 3147.89 Avg time (s): 0.0059876

Programs will not print out performance data if the results are not correct. Our correctness testing isn’t perfect but we are able to find many compiler/runtime bugs.

This is a more interesting test, showing the additional options.

$ ./stencil-vector-tbb 30 8000 200 grid 4
Parallel Research Kernels version 2.16
C++11/TBB Stencil execution on 2D grid
Number of iterations = 30
Grid size = 8000
Tile size = 200
Type of stencil = grid
Radius of stencil = 4
Solution validates
Rate (MFlops/s): 11218 Avg time (s): 0.928072