is_partioned algorithm

Hi friends,new_state = table[left_state * 4 + right_state]

I have rewritten the __parallel_reduce function for OpenMP to avoid using the reduction() clause. I think that OpenMP is not feeding the output values to the reduction in the way that one might think. I am tracking that problem in a different thread. In any case, the loop without the reduction, and then performing an explicit reduction at the end gets much farther.

In this thread, I would like to talk about the is_partioned algorithm. Briefly, I can see that there is a main body which performs some processing and then returns a result which is a value in a lookup table.

The brick body for is_partioned will return one of:

enum _ReduceType
    __not_init = -1,

In the case I am trying to debug there 521 elements. The simple, naive chunk partitioner I wrote breaks this up into two chunks: 0-511, and 512-521.

The first chunk comes back as “2” (__all_true) and “3” (__all_false) if I am reading the enum right. The reducer is called with __reduction(2,3) which performs the following table lookup:

new_state = table[left_state * 4 + right_state]

In this case, it looks like it result in table entry 11:

_ReduceType __table[] = {
__broken, __broken, __broken, __broken, __broken, __all_true,
__true_false, __true_false, __broken, __broken, __all_false, __broken,
__broken, __broken, __true_false, __broken

Which turns out to be __broken. However, when I run the brick body on the entire range at once, it results in “3”. This seems to indicate that the algorithm is very sensitive to how the range is chunked. I looked at the TBB implementation and it seems to create a blocked range with a grainsize of 1. That seems wrong, since the scheduling overhead would be enormous, so clearly there is something going on that I don’t understand.

However, since the reduction of two arbitrary chunks of a range appears to give an answer that is different from the reduction of a symmetric split, it seems like that may be a requirement. Is that true? Does __parallel_reduce need split the range evenly in order to perform correctly?