Parallel STL

In the “if (omp_in_parallel())” part, OpenMP worksharing constructs such as sections cannot be used. The problem is that worksharing constructs cannot be nested, and it’s quite likely that the outer parallel region already runs one. Therefore, tasks are the only option.

In the “else” part, the parallel region is created by PSTL, so worksharing can be used; tasks can be used as well.

If the same approach (tasks) is used in both cases, repetition can be avoided if the relevant code is separated into a function.

Thank you for all the advice. I have made a second pass on this same function. Does this seem to be a more idiomatic approach for OpenMP?

//------------------------------------------------------------------------
// parallel_invoke
//------------------------------------------------------------------------

template <typename _F1, typename _F2>
void
__parallel_invoke_body(_F1&& __f1, _F2&& __f2)
{
    _PSTL_PRAGMA(omp task) { std::forward<_F1>(__f1)(); }
    _PSTL_PRAGMA(omp task) { std::forward<_F2>(__f2)(); }
}

template <class _ExecutionPolicy, typename _F1, typename _F2>
void
__parallel_invoke(_ExecutionPolicy&&, _F1&& __f1, _F2&& __f2)
{
    if (omp_in_parallel())
    {
        __parallel_invoke_body(std::forward<_F1>(__f1), std::forward<_F2>(__f2));
    }
    else
    {
        _PSTL_PRAGMA(omp parallel)
        _PSTL_PRAGMA(omp single)
        __parallel_invoke_body(std::forward<_F1>(__f1), std::forward<_F2>(__f2));
    }
}

Thank you so much for this. I have studied more of the implementation and I think I now see how it performs dispatch based on the template arguments down to the backend. Everything now compiles and runs, although some of the tests are failing. I’m currently trying to understand the failing tests in order to debug them. I do appreciate your time and help!

It looks right to me.

Regards,

  • Alexey

Hello friends,

I have been working on the OpenMP backend for the parallel STL, and most of the tests are passing. However, among the failures is the “is_partitioned” test. I have rewritten the __parallel_reduce backend function to be simpler to understand in an attempt to understand what is failing (code is below.)

I also rewrote it as a serial function that splits the iteration range in two and then calls __reduction() on each half of the range being passed in. The result I get from the serial execution as compared to the result I get from the parallel execution is different.

I have verified that the parallel execution tasks are run, and that their results match what each serial execution would be if I ran them that way.

I am wondering if there is something wrong with the way OpenMP is running the reducer here? Perhaps it is injecting a value into the computation that is unexpected for this algorithm? Does anything jump out at anyone as being suspicious?

Thank you again for your time and assistance!

template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
_Value
__parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
                       _RealBody __real_body, _Reduction __reduction)
{
    std::size_t __item_count = __last - __first;
    std::size_t __head_items = (__item_count / __default_chunk_size) * __default_chunk_size;

    // We should encapsulate a result value and a reduction operator since we
    // cannot use a lambda in OpenMP UDR.
    using _CombinerType = __pstl::__internal::_Combiner<_Value, _Reduction>;
    _CombinerType __result{__identity, &__reduction};
    _PSTL_PRAGMA_DECLARE_REDUCTION(__combiner, _CombinerType)

    // To avoid over-subscription we use taskloop for the nested parallelism
    //_PSTL_PRAGMA(omp taskloop reduction(__combiner : __result))
    for (std::size_t __i = 0; __i < __item_count; __i += __default_chunk_size)
    {
        auto __begin = __first + __i;
        auto __end = __i < __head_items ? __begin + __default_chunk_size : __last;
        __result.__value = __real_body(__begin, __end, __identity);
    }

    return __result.__value;
}    

Hi Christopher,

Yes, “is_partitioned” algo implementation is based on a reduction parallel pattern.

And it looks that a binary operation (combiner) is not commutative.

In general, “reduction” algorithm requires a commutative binary operation. And OpenMP reduction requires that.

For TBB backend it works because TBB parallel reduction algorithm doesn’t require a commutative binary operation.

We (me or Evgeniy) will check that hypo and inform you.

Best regards,

Mikhail Dvorskiy

Thank you. I wondered if you had an update on this. I’ve done some further looking, and I think that is correct. I’ve tried to find example implementations of performing reductions with openmp that don’t require a commutative operator. It seems like rewriting the is_partioned algorithm to provide a commutative operator might be a larger / undesirable change.

Do you have any guidance on manually writing a task loop in openmp that performs the reduction without requiring commutativity?

Thanks!

-={C}=-

Hello again,

I was able to rewrite the parallel_reduce function in a way that works without using OpenMP’s reducer. I have a couple of questions:

  1. I use a vector to gather the intervening results for later reduction. Is there any problem depending on vector here?
  2. I can see that it might make sense to build a taskloop for the actual reduction if the number of chunks is quite large. Is that something that I should look into more?

The code is below. Please let me know if you have any questions or concerns.

//------------------------------------------------------------------------
// parallel_reduce
//------------------------------------------------------------------------

template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
_Value
__parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
                       _RealBody __real_body, _Reduction __reduction)
{
    std::size_t __n_chunks{0}, __chunk_size{0}, __first_chunk_size{0};
    __chunk_partitioner(__first, __last, __n_chunks, __chunk_size, __first_chunk_size);

    std::vector<_Value> __values(__n_chunks);

    // To avoid over-subscription we use taskloop for the nested parallelism
    _PSTL_PRAGMA(omp taskloop shared(__values))
    for (std::size_t __chunk = 0; __chunk < __n_chunks; ++__chunk)
    {
        auto __this_chunk_size = __chunk == 0 ? __first_chunk_size : __chunk_size;
        auto __index = __chunk == 0 ? 0 : (__chunk * __chunk_size) + (__first_chunk_size - __chunk_size);
        auto __begin = __first + __index;
        auto __end = __begin + __this_chunk_size;
        __values[__chunk] = __real_body(__begin, __end, __identity);
    }

    auto __result = __values.front();
    for (auto p = __values.begin() + 1; p != __values.end(); ++p)
    {
        __result = __reduction(__result, *p);
    }

    return __result;
}

Hi Christopher,

I’ve double check the code of __pattern_is_partitioned (which is based on the reduction parallel pattern). Yes, a binary operation is not commutative. So, my hypo was right.

Generally speaking the writing “manually” reduction pattern w/o OpenMP s reducer is not good approach due to it may be not effective. Indeed, if we consider your example – the second loop (for) combines the results in serial mode, and std::vector brings additional overheads…

Once more, OpenMP reduction requires commutative binary operation and it is right. With PSTL design perspective an algorithm pattern should not rely on a fact that a parallel reduction pattern (which is provided by a parallel backend) support a non-commutative binary operation. So, it is an issue of __pattern_is_partitioned and we will fix it. So while I would suggest don’t worry about that.

Best regards,

Mikhail Dvorskiy

Great, thanks!

I appreciate it!

Hi Christopher,

Please find the attachment - a patch which fixes the mentioned problem with is partitioned algo. I’ve not yet promoted it into LLVM repo, but you can check it now with OpenMP backend.

Best regards,

Mikhail Dvorskiy

is_partitioned_commutative_bi_op.patch (203 KB)

Hi Mikhail,

I tried to apply the patch today, but it doesn’t apply cleanly. May I ask what version of the repo I should use? After I apply it (skipping a hunk that doesn’t apply) I get some compile errors:

/home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1780:13: error: unknown type name ‘_ReduceRes’
_ReduceRes __table[] = {__broken, __broken, __broken, __broken, __broken, __all_true,
^
/home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1783:13: error: use of ‘_ReduceType’ with tag type that does not match previous declaration
struct _ReduceType
^
/home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1770:18: note: previous use is here
enum _ReduceType
^
/home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1802:54: error: use of undeclared identifier ‘__val1’; did you mean ‘__value’?
return _ReduceType{__broken, __val1.__pos};
^~~~~~
__value
/home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1799:73: note: ‘__value’ declared here
_ReduceType __value) → _ReduceType {

I was able to puzzle out the patch, and it seems to compile now. I’ll rewrite the reducer and see how it goes. Thanks!

Hi Mikhail,

I used the patch, but the unit tests are failing. Maybe I applied the patch badly, or perhaps I did something wrong in the reducer. I rewrote the parallel reduce function as:

template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
_Value
__parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
                       _RealBody __real_body, _Reduction __reduction)
{
    std::size_t __n_chunks{0}, __chunk_size{0}, __first_chunk_size{0};
    __omp_backend::__chunk_partitioner(__first, __last, __n_chunks, __chunk_size, __first_chunk_size);

    using _CombinerType = __pstl::__internal::_Combiner<_Value, _Reduction>;
    _CombinerType __result{__identity, &__reduction};
    _PSTL_PRAGMA_DECLARE_REDUCTION(__combiner, _CombinerType)

    // To avoid over-subscription we use taskloop for the nested parallelism
    _PSTL_PRAGMA(omp taskloop reduction(__combiner : __result))
    for (std::size_t __chunk = 0; __chunk < __n_chunks; ++__chunk)
    {
        auto __this_chunk_size = __chunk == 0 ? __first_chunk_size : __chunk_size;
        auto __index = __chunk == 0 ? 0 : (__chunk * __chunk_size) + (__first_chunk_size - __chunk_size);
        auto __begin = __first + __index;
        auto __end = __begin + __this_chunk_size;
        __result.__value = __real_body(__begin, __end, __identity);
    }

    return __result.__value;
}

Hi Christopher,

I will try to investigate what is wrong…

Could you please send me a your merged version of algorithm_impl.h? I’ll check it as well.

Best regards,

Mikhail Dvorskiy

It is attached. Thank you!

algorithm_impl.h (171 KB)