Dear all,

Yesterday, from the customer’s code I observed relatively simple case, where Polly is unable to detect parallelizable outer loop. Consider two versions of Fortran routine:

subroutine filter(H, szh, X, szx, Y, szy)

implicit none

integer(kind=IKIND), intent(in) :: szh, szx, szy

real(kind=RKIND), intent(in) :: H(szh), X(szx)

real(kind=RKIND), intent(out) :: Y(szy)

integer(kind=IKIND) :: i, j, i1, i2

integer, parameter :: rkind = RKIND

do i = 1, szy

Y(i) = 0.0_rkind

do j = 1, szh

Y(i) = Y(i) + X(i + j - 1) * H(j)

enddo

enddo

end subroutine filter

subroutine filter(H, szh, X, szx, Y, szy)

implicit none

integer(kind=IKIND), intent(in) :: szh, szx, szy

real(kind=RKIND), intent(in) :: H(szh), X(szx)

real(kind=RKIND), intent(out) :: Y(szy)

integer(kind=IKIND) :: i, j, i1, i2

integer, parameter :: rkind = RKIND

do i = 1, szx - szh

Y(i) = 0.0_rkind

i1 = i

i2 = i + szh - 1

Y(i) = Y(i) + sum(X(i1:i2) * H)

enddo

end subroutine filter

The difference between two versions is only in the way how inner reduction loop is implemented. In first case it is explicit and in second case - uses sum intrinsic, which is lowered to plain code by DragonEgg. In first case Polly successfully detects parallel outer loop, in second case - it can’t.

Now let’s see how IR codes look like upon their arrival at the beginning of Polly pass manager in KernelGen: 1.ll and 2.ll. As far as I see, the only difference between IRs is that in case of 1.ll memory location for accumulating element is updated after each fma in inner loop, while in 2.ll accumulator is updated only after the end of inner loop, producing a PHI node (or reg2mem allocas - with Polly’s preopt passes). Interestingly, if you will try to optimize 1.ll further, it would yield to 2.ll too.

After some attempts to modify the Polly behavior, I concluded it cannot resist the presence of non-IV PHI-node. Polly attempts to produce independent basic blocks, trying to gather dependent instructions in single block and allocating scratch variables. Allocas make the situation even worse, because they are always placed into the first block of function, completely killing potential parallelism.

Do you see any possible workarounds in this case? Maybe alloca-s could be placed inside the inner loop to give it a chance to be parallel?

Thanks,

- D.

1.ll (2.38 KB)

2.ll (2.07 KB)