RFC Loop Versioning for unit stride

The idea here is to improve performance of the index calculations of 1D and 2D arrays (as those are far more common than larger dimension arrays). It also improves the likelihood of the loop vectorizers in MLIR and LLVM deciding to use vector operations.

When a module is passed an array of unknown size, see Example 1 the below, the compiler will use the descriptor to determine the stride.

This pass that works on the generated MLIR code detects arguments that have unknown size in the first dimension, and if there are loops that use that argument. If so, the loop is duplicated, and surrounded by, essentially if (stride(arg) == sizeof(elementtype)) new_loop else old_loop;

The new loop will the use a 1D array as an alias for the 1D or 2D array, and which now has a fixed stride known at translation to LLVM-IR and machine-code. This allows for better code in the loop - and at least sometimes, also allows the loop to be vectorized.

In SPEC-2017 roms_r, I’m seeing about 6% improvement - there’s still room for more improvement in this benchmark, but it’s definitely a step in the right direction. In an artificial benchmark, the improvement is more along the lines of 30-40% improvement.

There’s a previous discussion on this here:

There is also a github issue here:

A prototype of the code is available here:
https://reviews.llvm.org/D141306
(the build on debian fails due to some formatting of CLOptions.td)

Example 1:

module func3

contains

subroutine func( a, b, n)
  real*8  :: a(:, :), b(:, :)
  integer :: n
  integer :: i,j

  do i=1, n
     do j=1,n
        a(j,i) = a(j,i) + b(j,i)
     end do
  end do
end subroutine func
end module func3
1 Like

Using the following code:

program bm
  use module, only: func1d

  implicit none
  integer, parameter :: size = 4000
  real*8 :: aa(size)
  real*8 :: bb(size)
  real*8 :: cc(size * 2)

  aa = 1
  bb = 2
  cc = 3

  call do_bench("a + b", aa, bb, size)
  call do_bench("a + c", aa, cc, size)
  call do_bench("a + c(2:)", aa, cc(2:), size)
  call do_bench("a + c(::2)", aa, cc(::2), size)
  
contains
  subroutine do_bench(msg, aa, bb, size)
    character(*) :: msg
    integer, parameter :: loops = 250000
    integer:: size
    real*8 :: aa(1:)
    real*8 :: bb(1:)
    
    real*8 :: time, time_start, time_end
    integer::i

    real*8 :: expect
    real*8 :: check

    expect = aa(1) + loops * bb(1)

    call CPU_TIME(time_start)
    do i = 1, loops
       call func1d(aa, bb, size)
    end do
    call CPU_TIME(time_end)
    time = time_end - time_start
    check = checksum(aa, size)
    if (check .ne. expect * size) then
       print *, msg, ": Checksum mismatch fot ",  check, " Expect:", size * expect
    end if
    print "(A12, F8.5, A2)", msg, time, " s"
  end subroutine do_bench
    
  function checksum(arr, size)
    integer ::size
    real*8  :: arr(1:)
    integer :: i
    real*8 :: checksum
    real*8 :: sum
    sum = 0
    do i = 1, size
       sum = sum + arr(i)
    end do
    checksum = sum
  end function checksum
    
end program bm

With module.f90 containing this:

  subroutine func1d(a, b, n)
    real*8  :: a(:), b(:)
    integer :: n
    integer :: i
    
    do i=1, n
       a(i) = a(i) + b(i)
    end do
  end subroutine func1d

I get the following results when compiling with -Ofast -fno-loop-versioning:

$ ./bm1d-noloop 
       a + b 0.61709 s
       a + c 0.58340 s
   a + c(2:) 0.58243 s
  a + c(::2) 0.63377 s

And with loop versioning enabled (leaving out the -fno-loop-versioning and using -Ofast):

$ ./bm1d-loop 
       a + b 0.37326 s
       a + c 0.36045 s
   a + c(2:) 0.36667 s
  a + c(::2) 0.63324 s

That’s roughly a 38% speed up.

The efficiency goes down with larger array size, because L1 cache hit ratio goes down.

1 Like

Did some more benchmarking,

First, I did a similar 2D version of the one above. On my x86 machine, I get similar 38% improvement. On Arm processor, it’s close to 50% improvement. Same on the 1D benchmark above. (The a + c(2:) is more like 40% - I haven’t looked into the details, but I suspect the “unaligned vector access” is the problem).

I also ran the spec-2017 benchmark on AArch64, with the following results:
On x86:
roms_r 4.9% faster.
(Only ran this one benchmark on my x86 desktop machine)

On AArch64:

Benchmark Change
roms_r 6.1% faster
exchange_2_r no change
cam4_r no change
fotonik3d_r no change
bwaves_r no change
wrf_r no change
cactuBSSN_r no change
1 Like

This has now be submitted and is available for anyone to try out.
It is enabled at O3 and Ofast optimisation levels, or when -fversion-loops-for-stride is given - or when using ifr-opt --loop-versioning.

1 Like

Subsequent patches and fixes provided further benefits in roms_r spec-2017 benchmark. In total roms_r is up 36% with the versioning patches.
https://reviews.llvm.org/D151140
https://reviews.llvm.org/D151880

1 Like