Parallel STL

Hello friends,

I have spent some time looking at the mailing archives and git logs for the parallel STL. I’m not clear what state it is in, since the oneAPI/tbb seems to be production ready and comes with the parallel STL. Also, it appears the GCC has shipped a PSTL based on the same code that clang is using.

I was wondering if someone could clarify for me what state the PSTL is in, and if there is some work needed to help get it over the finish line I may be able to help. I’m very interested in using it in our production software, so I’m a motivated helper. :slight_smile:

Thank you for your time,
-={C}=-

Hi,

Long story short, the PSTL is pretty much ready to be shipped with LLVM. I did the integration between it and libc++, and it all worked last time I checked. I think the next step would be to change whatever LLVM scripts are used to create releases to also install the PSTL, which is the part I haven't had time to look into yet.

That being said, the PSTL will then default to using the Serial backend, which isn't very useful. We could decide to ship a different backend if we wanted, however I think what makes sense is to use a backend specific to the platform we're running on instead of adding a dependency to LLVM.

Louis

Okay, that makes sense. I can see how you might want to use Grand Central Dispatch on macOS, and the Windows system thread pool on Windows. I’m not really sure what that means for Linux, though. Other than maybe pthreads, which is not great.

Is there any documentation on what is needed to create a backend? Or are there perhaps already plans in motion? I don’t want to step on any toes, but I would love to have a usable pstl on macOS and Linux for the next LLVM release. We use libc++ on Linux as well as macOS. Depending on what’s involved, I might be able to contribute a backend for those two platforms.

  • Mikhail, who wrote most of the PSTL

Okay, that makes sense. I can see how you might want to use Grand Central Dispatch on macOS, and the Windows system thread pool on Windows. I’m not really sure what that means for Linux, though. Other than maybe pthreads, which is not great.

Is there any documentation on what is needed to create a backend? Or are there perhaps already plans in motion? I don’t want to step on any toes, but I would love to have a usable pstl on macOS and Linux for the next LLVM release.

We use libc++ on Linux as well as macOS. Depending on what’s involved, I might be able to contribute a backend for those two platforms.

You’re not stepping on any toes, far from that. If we have backends with satisfactory performance and we’re confident about ABI stability, I don’t see a reason why we wouldn’t ship the PSTL as soon as we have those. One big issue to shipping it so far has been that the only backends are serial (not great to ship that), and the other one relies on an external dependency (TBB).

Mikhail might be able to provide documentation. We should check it into the PSTL repository. I meant to write such documentation when I wrote the serial backend, but never got around to writing something that was enough to check in. You can see the minimal API needed to implement a backend here: pstl/include/pstl/internal/parallel_backend_serial.h. It’s the serial backend, which tries to be as trivial as possible.

Are you familiar with libc++ contribution? If so, contributing to PSTL works basically the same – just send a Phabricator review and I’ll review it. We can also chat on Slack in the Cpplang workspace and I can give some guidance – look for “ldionne”.

Cheers,
Louis

Fantastic. I will study the serial backend and see what I can do!

Hi Cristopher,

One good way to contribute, I think, is to develop an OpenMP-based parallel backend. LLVM already supports OpenMP, so it resolves the dependency problem Louis mentioned. While it’s arguably not the best default engine in the long term, there is certainly some demand for it. The GCC community is also interested in it. Moreover, Mikhail and the team at Intel in collaboration with Thomas (CC’d) from GCC already developed a basic prototype: https://reviews.llvm.org/D70530, but further work is postponed. If you are interested to continue, you are more than welcome, and we will help with guidance and feedback.

Regards,

  • Alexey

Hi Cristopher,

Please find the attachment - Parallel Backend API documentation.

Don’t hesitate to ask questions regarding API or the mentioned OpenMP backend prototype.

Parallel_backend_API.docx (19.1 KB)

Okay, that makes sense. I can see how you might want to use Grand
Central Dispatch on macOS, and the Windows system thread pool on
Windows. I'm not really sure what that means for Linux, though. Other than maybe pthreads, which is not great.

I am currently working on a new backend for GCC which is neither TBB nor
OpenMP which will support both the PSTL and (presumably) C++23
Executors.

Kukanov, Alexey writes:

I would be really awesome if all of this could be done on Phabricator so we can coordinate the various efforts.

Louis

That’s very exciting to hear! I was thinking of taking up the stalled OpenMP implementation, but if you are a good way along then it may not make any sense to do both.

On the other hand, if you feel it might be a year or more before you have something production ready, it might make sense to try and finish up the OpenMP version. What do you think?

It is worth having a detailed discussion of what is meant by the OpenMP version. If one maps exec::par onto omp-parallel-for, nested loops will be transformed into the OpenMP nested-parallel anti-pattern (or one has to check omp_in_parallel and generate two paths every time). One of the reasons why TBB is a better backend is that it self-composes better than OpenMP parallel-for. OpenMP tasking should compose better, but does not perform as well as omp-parallel-for in the simple cases. (I have performance data comparing all of these.)

If supporting nested parallel loops is a non-goal for PSTL, then my comments can be ignored.

GCD is likely a good back-end on MacOS, especially since Apple Clang doesn’t support OpenMP.

Jeff

It will definitely be more than a year out for any sort of production
ready thread pool implementation (it needs to cover more than just the
pstl use case).

Tom.

Christopher Nelson writes:

Louis Dionne writes:

I would be really awesome if all of this could be done on Phabricator so we can coordinate the various efforts.

I'm not sure how that would jive with the GNU-ness of the license for my
thread pool work.

I am looking over the OpenMP code, and it does seem to handle the nested parallel region problem correctly.

What is the simple case that omp-parallel-for works better on? Perhaps that can be detected and preferred in these cases? I am by no means an OpenMP expert.

I can see that the TBB folks have done a lot of work to make it easier to take TBB as a dependency. For us there is still some hesitancy around embracing a large dependency like that, for a variety of reasons. I have a number of business problems that can be drastically improved with even basic parallelization. So while it would be great to eke out every erg of efficiency, having something that allows our developers to trivially use all the cores in the machine in even basic scenarios would be a huge help. :slight_smile: And it would be nice if it were standardized, and came with our compiler… :slight_smile:

I am looking over the OpenMP code, and it does seem to handle the nested parallel region problem correctly.

What is the simple case that omp-parallel-for works better on? Perhaps that can be detected and preferred in these cases? I am by no means an OpenMP expert.

I can see that the TBB folks have done a lot of work to make it easier to take TBB as a dependency. For us there is still some hesitancy around embracing a large dependency like that, for a variety of reasons. I have a number of business problems that can be drastically improved with even basic parallelization. So while it would be great to eke out every erg of efficiency, having something that allows our developers to trivially use all the cores in the machine in even basic scenarios would be a huge help. :slight_smile: And it would be nice if it were standardized, and came with our compiler… :slight_smile:

Christopher,

I’m very much with you on the dependencies question. I think the PSTL we end up shipping with libc++ needs to be free of third-party dependencies. If you’re interested, I would very much encourage you to work on an OpenMP backend to the PSTL and I can help review it.

It looks like Mikhail already has an implementation here: https://reviews.llvm.org/D70530. Maybe we just need to give it a small push to make it ready?

Louis

Sounds good to me! I’ve been looking it over. It looks fairly complete. I’ll see what I can do to finish it up.

Yes, indeed OpenMP tasking is used so it will compose well. Below are some measurements I took a while ago (2018 - https://www.ixpug.org/images/docs/KAUST_Workshop_2018/IXPUG_Invited2_Hammond.pdf). This compares a bunch of different C++ parallel loop abstractions (https://github.com/ParRes/Kernels/tree/default/Cxx11 contains all the code).

I suppose I should measure again but some of the performance limitations of OpenMP tasking are baked into the semantics and cannot be improved upon by a compliant implementation.

Jeff

I downloaded the OpenMP parallel patch, modified the parallel backend header, and the cmake files. Then I tried to build the project with:

cmake -G Ninja -DLLVM_ENABLE_PROJECTS=“libcxx;libcxxabi;pstl” -DLIBCXX_ENABLE_PARALLEL_ALGORITHMS=ON -DPSTL_PARALLEL_BACKEND=omp …/llv

m

This seems to turn on all the right stuff, but I get a surprising error at the end that I’m not sure what to do with:

CMake Error at /home/csnelson/projects/llvm-project/pstl/CMakeLists.txt:34 (add_library):

INTERFACE_LIBRARY targets may only have whitelisted properties. The

property “OUTPUT_NAME” is not allowed.

CMake Error at /home/csnelson/projects/llvm-project/pstl/CMakeLists.txt:34 (add_library):

INTERFACE_LIBRARY targets may only have whitelisted properties. The

property “OUTPUT_NAME” is not allowed.

CMake Error at /home/csnelson/projects/llvm-project/pstl/CMakeLists.txt:34 (add_library):

INTERFACE_LIBRARY targets may only have whitelisted properties. The

property “OUTPUT_NAME” is not allowed.

Any guidance would be helpful here. A full dump of the cmake output follows:

– clang project is disabled

– clang-tools-extra project is disabled

– compiler-rt project is disabled

– debuginfo-tests project is disabled

– libc project is disabled

– libclc project is disabled

– libcxx project is enabled

– libcxxabi project is enabled

– libunwind project is disabled

– lld project is disabled

– lldb project is disabled

– mlir project is disabled

– openmp project is disabled

– parallel-libs project is disabled

– polly project is disabled

– pstl project is enabled

– flang project is disabled

– Could NOT find LibXml2 (missing: LIBXML2_LIBRARY LIBXML2_INCLUDE_DIR)

– Native target architecture is X86

– Threads enabled.

– Doxygen disabled.

– Go bindings enabled.

– Ninja version: 1.10.0

– Could NOT find OCaml (missing: OCAMLFIND OCAML_VERSION OCAML_STDLIB_PATH)

– Could NOT find OCaml (missing: OCAMLFIND OCAML_VERSION OCAML_STDLIB_PATH)

– OCaml bindings disabled.

– LLVM host triple: x86_64-unknown-linux-gnu

– LLVM default target triple: x86_64-unknown-linux-gnu

– Building with -fPIC

– Constructing LLVMBuild project information

– Targeting AArch64

– Targeting AMDGPU

– Targeting ARM

– Targeting AVR

– Targeting BPF

– Targeting Hexagon

– Targeting Lanai

– Targeting Mips

– Targeting MSP430

– Targeting NVPTX

– Targeting PowerPC

– Targeting RISCV

– Targeting Sparc

– Targeting SystemZ

– Targeting WebAssembly

– Targeting X86

– Targeting XCore

– Parallel STL uses the omp backend

– Libc++abi will be using libc++ includes from /home/csnelson/projects/llvm-project/libcxxabi/…/libcxx/include

– Looking for cxxabi.h in /home/csnelson/projects/llvm-project/libcxx/…/libcxxabi/include

– Looking for cxxabi.h in /home/csnelson/projects/llvm-project/libcxx/…/libcxxabi/include - found

– Looking for __cxxabi_config.h in /home/csnelson/projects/llvm-project/libcxx/…/libcxxabi/include

– Looking for __cxxabi_config.h in /home/csnelson/projects/llvm-project/libcxx/…/libcxxabi/include - found

– Registering Bye as a pass plugin (static build: OFF)

– Failed to find LLVM FileCheck

– Version: 0.0.0

– Performing Test HAVE_THREAD_SAFETY_ATTRIBUTES – failed to compile

– Performing Test HAVE_GNU_POSIX_REGEX – failed to compile

– Performing Test HAVE_POSIX_REGEX – success

– Performing Test HAVE_STEADY_CLOCK – success

– Configuring done

CMake Error at /home/csnelson/projects/llvm-project/pstl/CMakeLists.txt:34 (add_library):

INTERFACE_LIBRARY targets may only have whitelisted properties. The

property “OUTPUT_NAME” is not allowed.

CMake Error at /home/csnelson/projects/llvm-project/pstl/CMakeLists.txt:34 (add_library):

INTERFACE_LIBRARY targets may only have whitelisted properties. The

property “OUTPUT_NAME” is not allowed.

CMake Error at /home/csnelson/projects/llvm-project/pstl/CMakeLists.txt:34 (add_library):

INTERFACE_LIBRARY targets may only have whitelisted properties. The

property “OUTPUT_NAME” is not allowed.

– Generating done

CMake Generate step failed. Build files cannot be regenerated correctly.

Hello,

I have followed the advice about taking over the review above, and have gotten to a place where I’m working on getting the existing code to compile cleanly. A few functions were not implemented, so I have forwarded them to the serial backend for now. Just to get compilation working.

I have a few questions:

  1. I notice that neither the TBB backend, nor the existing OpenMP backend code evaluates the execution policy to understand what to do. I may have misunderstood Louis Dionne, but it appears like the “sequential” mode is not handled at all if the user requests it. That seems wrong, so I must be missing something. I also notice that the execution modes are not enums, they are objects. However, when I try to overload on them in order to specialize for sequential, I get a compile error saying that the types are not fully defined. What is the design expectation for handling the different execution policies?

  2. The existing code refers to a type:

using _CombinerType = __pstl::__internal::_Combiner<_Value, _Reduction>;

_CombinerType __result{__identity, &__reduction};

However, this type does not exist in __pstl::__internal, at least so far as I can tell. Also, the D70530 code dump does not contain a definition of that object. Has this migrated? Should I provide my own implementation of it?

  1. I have tried to implement a very, very simple function:
template <class _ExecutionPolicy, typename _F1, typename _F2>
void __parallel_invoke(_ExecutionPolicy &&, _F1 &&__f1, _F2 &&__f2) {
    if (omp_in_parallel()) {
        _PSTL_PRAGMA(omp sections) {
            _PSTL_PRAGMA(omp section)
            std::forward<_F1>(__f1)();
            _PSTL_PRAGMA(omp section)
            std::forward<_F2>(__f2)();
        }
    } else {
        _PSTL_PRAGMA(omp parallel)
        _PSTL_PRAGMA(omp sections) {
            _PSTL_PRAGMA(omp section)
            std::forward<_F1>(__f1)();
            _PSTL_PRAGMA(omp section)
            std::forward<_F2>(__f2)();
        }
    }
}

Does this look sane? I have just started reading through the OpenMP documentation. This looks like it could be correct, but there is also the “omp task” directive, and it’s not clear which of these is superior in this case. Also, this seems awfully repetitive. Is this just OpenMP?

Thanks!

Hi Cristopher,

Briefly about Parallel design and execution policies handling in particular:

Parallel STL design is based on pattern of bricks approach and has a compile-time dispatching mechanism which is based on overload resolution of a couple of type-tags – is_parallel and is_vectror. A set of combinations of the tags gives four execution policies – seq, par, unseq, par_unseq. A parallel backend doesn’t handle a passed execution policy - that parameter may be usefull for some special back-ends. It doesn’t matter for Open MP backend. (See include*/pstl/internal/parallel_backend.h* for more details about OpenMP backend dispatching).

In other words, an implementation of each PSTL algorithm based on two patterns - parallel (chosen by is_parallel type-tag) and serial (chosen by is_vector type-tag).

Each parallel pattern may call serial brick or vector(unsequenced) brick. It “gives” par and par_unseq policies implementations.

Each serial pattern also may call serial brick or vector(unsequenced) brick. It “gives” seq and unseq policies implementations.

Yes, we missed to add a definition of “_Combiner” into this review. In that prototype It was moved to an utility file and another namespace… But it doesn’t matter. Just now you can find “_Combiner” in https://github.com/llvm/llvm-project/blob/master/pstl/include/pstl/internal/unseq_backend_simd.h

In case of omp_in_parallel to avoid oversubscription you should use a task API instead of sections. A task doesn’t create a new thread. A task is added to a task pool and may be executed by the first “free” thread from the tread pool.

In else section, I think, It would be prefer to use a task API as well, for better workload balance.

P.S. + @Pavlov, Evgeniy who wrote OpenMP backend prototype.

Best regards,

Mikhail Dvorskiy