Area for improvement

I noticed that fourinarow is one of the programs in which LLVM is much slower than GCC, so I decided to take a look and see why that is so. The program has many loops that look like this:

#define ROWS 6
#define COLS 7

void init_board(char b[COLS][ROWS+1])
{
int i,j;

for (i=0;i<COLS;i++)
for (j=0;j<ROWS;j++)
b[i][j]='.';
``
for (i=0;i<COLS;i++)
b[i][ROWS]=0;
}

This generates the following X86 code:

.text
.align 16
.globl init_board
.type init_board, @function
init_board:
subl $4, %esp
movl %esi, (%esp)
movl 8(%esp), %eax
movl $0, %ecx
.LBBinit_board_1: # loopexit.1
imull $7, %ecx, %edx
movl %eax, %esi
addl %edx, %esi
movb $46, (%esi)
imull $7, %ecx, %edx
movl %eax, %esi
addl %edx, %esi
leal 1(%esi), %edx
movb $46, (%edx)
imull $7, %ecx, %edx
movl %eax, %esi
addl %edx, %esi
leal 2(%esi), %edx
movb $46, (%edx)
imull $7, %ecx, %edx
movl %eax, %esi
addl %edx, %esi
leal 3(%esi), %edx
movb $46, (%edx)
imull $7, %ecx, %edx
movl %eax, %esi
addl %edx, %esi
leal 4(%esi), %edx
movb $46, (%edx)
imull $7, %ecx, %edx
movl %eax, %esi
addl %edx, %esi
leal 5(%esi), %edx
movb $46, (%edx)
incl %ecx
cmpl $7, %ecx
jne .LBBinit_board_1 # loopexit.1
.LBBinit_board_2: # return
movb $0, 6(%eax)
movb $0, 13(%eax)
movb $0, 20(%eax)
movb $0, 27(%eax)
movb $0, 34(%eax)
movb $0, 41(%eax)
movb $0, 48(%eax)
movl (%esp), %esi
addl $4, %esp
ret

The code generated by GCC is much faster. LLVM does unroll the loop, which GCC does not do, but the addressing code is a constant repetition of these three instructions:

imull $7, %ecx, %edx
movl %eax, %esi
addl %edx, %esi

All but the first occurrance do nothing but put the same address into %esi that was already there to being with, and does it with an expensive multiply at that. GCC correctly creates a suitable induction variable and strength reduces the multiplication down to addition.

It turns out that using the new selection dag code generator, the outer loop is unrolled also and nothing is left but a bunch of movb instructions. Much, much faster. But still, the optimization passes shouldn’t be leaving so much garbage around for instruction selection to clean up. At first, I thought all that was needed was to another round of basic block optimizations after the loop was unrolled, until I saw the actual LLVM bytecodes:

void %init_board([7 x sbyte]* %b) {
entry:
br label %loopexit.1

loopexit.1: ; preds = %loopexit.1, %entry
%indvar14 = phi uint [ 0, %entry ], [ %indvar.next15, %loopexit.1 ] ; <uint> [#uses=7]
%tmp.10 = getelementptr [7 x sbyte]* %b, uint %indvar14, int 0 ; <sbyte*> [#uses=1]
store sbyte 46, sbyte* %tmp.10
%tmp.10.1 = getelementptr [7 x sbyte]* %b, uint %indvar14, int 1 ; <sbyte*> [#uses=1]
store sbyte 46, sbyte* %tmp.10.1
%tmp.10.2 = getelementptr [7 x sbyte]* %b, uint %indvar14, int 2 ; <sbyte*> [#uses=1]
store sbyte 46, sbyte* %tmp.10.2
%tmp.10.3 = getelementptr [7 x sbyte]* %b, uint %indvar14, int 3 ; <sbyte*> [#uses=1]
store sbyte 46, sbyte* %tmp.10.3
%tmp.10.4 = getelementptr [7 x sbyte]* %b, uint %indvar14, int 4 ; <sbyte*> [#uses=1]
store sbyte 46, sbyte* %tmp.10.4
%tmp.10.5 = getelementptr [7 x sbyte]* %b, uint %indvar14, int 5 ; <sbyte*> [#uses=1]
store sbyte 46, sbyte* %tmp.10.5
%indvar.next15 = add uint %indvar14, 1 ; <uint> [#uses=2]
%exitcond16 = seteq uint %indvar.next15, 7 ; <bool> [#uses=1]
br bool %exitcond16, label %return, label %loopexit.1

return: ; preds = %loopexit.1
%tmp.19 = getelementptr [7 x sbyte]* %b, int 0, int 6 ; <sbyte*> [#uses=1]
store sbyte 0, sbyte* %tmp.19
%tmp.19.1 = getelementptr [7 x sbyte]* %b, int 1, int 6 ; <sbyte*> [#uses=1]
store sbyte 0, sbyte* %tmp.19.1
%tmp.19.2 = getelementptr [7 x sbyte]* %b, int 2, int 6 ; <sbyte*> [#uses=1]
store sbyte 0, sbyte* %tmp.19.2
%tmp.19.3 = getelementptr [7 x sbyte]* %b, int 3, int 6 ; <sbyte*> [#uses=1]
store sbyte 0, sbyte* %tmp.19.3
%tmp.19.4 = getelementptr [7 x sbyte]* %b, int 4, int 6 ; <sbyte*> [#uses=1]
store sbyte 0, sbyte* %tmp.19.4
%tmp.19.5 = getelementptr [7 x sbyte]* %b, int 5, int 6 ; <sbyte*> [#uses=1]
store sbyte 0, sbyte* %tmp.19.5
%tmp.19.6 = getelementptr [7 x sbyte]* %b, int 6, int 6 ; <sbyte*> [#uses=1]
store sbyte 0, sbyte* %tmp.19.6
ret void
}

Now the problem is obvious. A two dimensional array access is being performed by a single instruction. The arithmetic needed to address the element is implicit, and therefore inaccessible to optimizations. The redundant calculations can not be eliminated, nor can strength reduction be performed. getelementptr needs to be broken down into its constituent operations at some point.

Or am I missing something? Is this breaking down and reapplying optimizations part of the selection dag process? I am somewhat surprised that using it affected what loops got unrolled.

When I increased COLS to the point where the loop could no longer be unrolled, the selection dag code generator generated effectively the same code as the default X86 code generator. Lots of redundant imul/movl/addl sequences. It can’t clean it up either. Only unrolling all nested loops permits it to be optimized away, regardless of code generator.

Jeff Cohen wrote:

Jeff,

This is an important find. I started looking at fourinarow and other
programs that don't do well compared to GCC. I think there are probably
only a couple things remaining for LLVM to smoke GCC on all tests
(floating point is another one). I know that Chris has been looking for
good analyses like the one you've provided.

Could you put this into a bugzilla for us? That way it definitely gets
on the "to do" list.

Thanks,

Reid.

Now the problem is obvious. A two dimensional array access is being performed by a single instruction. The arithmetic needed to address the element is implicit, and therefore inaccessible to optimizations. The redundant calculations can not be eliminated, nor can strength reduction be performed. getelementptr needs to be broken down into its constituent operations at some point.

Jeff,

This is exactly right -- any multi-dimensional array indexing operations need to decomposed into a sequence of single index operations so that they can be exposed to GCSE and LICM. This transformation obscures the indexing behavior of the code, so the right place to do this is within each back-end, on LLVM code just before instruction selection.

There is a pass called DecomposeMultiDimRefs (which seems to have been incorrectly moved to lib/Target/SparcV9/) which does this transformation. It's used by the SparcV9 back end before instr. selection but is not specific to Sparc. Running this, followed by LICM and GCSE should address this problem.

--Vikram

I figured getelementptr exists as it does to facilitate data flow analysis, but it does need to be broken down before instruction selection. It's not just the missed optimization opportunities. It also introduces a huge amount of complexity into instruction selection as they deal with its complexity. It would also take care of many of the FIXMEs in LoopStrengthReduce.

Vikram Adve wrote:

I figured getelementptr exists as it does to facilitate data flow analysis, but it does need to be broken down before instruction selection. It's not just the missed optimization opportunities. It also introduces a huge amount of complexity into instruction selection as they deal with its complexity. It would also take care of many of the FIXMEs in LoopStrengthReduce.

Yes, I agree. Code-generating multi-dim. array index operations becomes significantly simpler after this transformation.

One caveat though is that you may want to keep as many successive constant indexes in a GEP together because they can all be folded into a single offset (i.e., a single operand to a load/store), e.g.

  geteelementptr <Ty*> %p, int 0, int %i, int 3, int 5, int %j, int %k

probably should be broken down into:

  %p1 = gep %p, int 0, int %i
  %p2 = gep %p1, int 3, int 5, int %j
  %p3 = gep %p2, int %k

or something like that.

--Vikram

I figured getelementptr exists as it does to facilitate data flow analysis, but it does need to be broken down before instruction selection. It's not just the missed optimization opportunities. It also introduces a huge amount of complexity into instruction selection as they deal with its complexity. It would also take care of many of the FIXMEs in LoopStrengthReduce.

There are two issues here. One is an important one that should be addressed on the LLVM level: we currently do not have a loop strength reduction pass. Loop strength reduction is an important optimization that turns things like this:

   for (i = ...; ; ++i)
     A[i] = ...

into:

   for (p = &A[0]; ... ; ++p)
     *p = ...

This transforms multiplies in addressing arithmetic (e.g. &A+i*sizeof(A[0]) into additions (e.g. p = p + sizeof(A[0])). Nate worked on a patch to do this a while back, but I'm not sure where it stands.

The second issue is that we need to do redundancy elimination and hoisting on operations that are only exposed once instruction selection is performed. Getelementptr expansion is just ONE particular case of this, but it can happen with any instructions (including the use of large integer (or any FP) constants on RISC machines, addressing globals with PIC code, handling extended/truncated operations (for RISC machines with a single integer size), etc. Note that hacks like "LowerMultiDimRefs" and the sparc Preselection pass sorta-kinda work around SOME of this, but they really are just as bad as the problem: they solve some cases and make other cases worse. The only way to make preselection or lowermultidimrefs work is to duplication all of the knowledge of how the instruction selector will select the code (e.g. the advice about allowing constant indices to be grouped together).

This is *exactly* the reason why the new instruction selection framework was introduced. It already handles local CSE trivially, and will be extended to do GCSE, LICM, and eventually PDCE as time permits. Combined with loop strength reduction, these problems should go completely away. Also, with the new instruction selection frameworks, targets don't have to handle GEP instructions at all, so the "complexity" argument above doesn't apply.

If you're interested in working on this, I would suggest starting with the loop strength reduction code, as I've heard it's pretty close to working.

-Chris

Vikram Adve wrote:

Now the problem is obvious. A two dimensional array access is being performed by a single instruction. The arithmetic needed to address the element is implicit, and therefore inaccessible to optimizations. The redundant calculations can not be eliminated, nor can strength reduction be performed. getelementptr needs to be broken down into its constituent operations at some point.

Jeff,

This is exactly right -- any multi-dimensional array indexing operations need to decomposed into a sequence of single index operations so that they can be exposed to GCSE and LICM. This transformation obscures the indexing behavior of the code, so the right place to do this is within each back-end, on LLVM code just before instruction selection.

There is a pass called DecomposeMultiDimRefs (which seems to have been incorrectly moved to lib/Target/SparcV9/) which does this transformation. It's used by the SparcV9 back end before instr. selection but is not specific to Sparc. Running this, followed by LICM and GCSE should address this problem.

--Vikram

_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://mail.cs.uiuc.edu/mailman/listinfo/llvmdev

-Chris

I noticed that fourinarow is one of the programs in which LLVM is much slower than GCC, so I decided to take a look and see why that is so. The program has many loops that look like this:

  #define ROWS 6
  #define COLS 7

  void init_board(char b[COLS][ROWS+1])
  {
    int i,j;

    for (i=0;i<COLS;i++)
          for (j=0;j<ROWS;j++)
            b[i][j]='.';
                        for (i=0;i<COLS;i++)
          b[i][ROWS]=0;
  }

This generates the following X86 code:
      imull $7, %ecx, %edx
      movl %eax, %esi
      addl %edx, %esi
      movb $46, (%esi)
      imull $7, %ecx, %edx
      movl %eax, %esi
      addl %edx, %esi
      leal 1(%esi), %edx

... (many many copies of this, see the end of the email for full output) ...

The code generated by GCC is much faster. LLVM does unroll the loop, which GCC does not do, but the addressing code is a constant repetition of these three instructions:

      imull $7, %ecx, %edx
      movl %eax, %esi
      addl %edx, %esi

Yup, this is an issue of simple "peephole expansion" of the getelementptr instruction.

All but the first occurrance do nothing but put the same address into %esi that was already there to being with, and does it with an expensive multiply at that. GCC correctly creates a suitable induction variable and strength reduces the multiplication down to addition.

This is called loop strength reduction.

It turns out that using the new selection dag code generator, the outer loop is unrolled also and nothing is left but a bunch of movb instructions. Much, much faster.

For the record, instead of the ugly code above, the new isel generates:

init_board:
         mov %EAX, DWORD PTR [%ESP + 4]
         mov %ECX, 0
.LBBinit_board_1: # loopexit.1
         imul %EDX, %ECX, 7
         mov BYTE PTR [%EAX + %EDX], 46
         mov BYTE PTR [%EAX + %EDX + 1], 46
         mov BYTE PTR [%EAX + %EDX + 2], 46
         mov BYTE PTR [%EAX + %EDX + 3], 46
         mov BYTE PTR [%EAX + %EDX + 4], 46
         mov BYTE PTR [%EAX + %EDX + 5], 46
         inc %ECX
         cmp %ECX, 7
         jne .LBBinit_board_1 # loopexit.1
.LBBinit_board_2: # return
         mov BYTE PTR [%EAX + 6], 0
         mov BYTE PTR [%EAX + 13], 0
         mov BYTE PTR [%EAX + 20], 0
         mov BYTE PTR [%EAX + 27], 0
         mov BYTE PTR [%EAX + 34], 0
         mov BYTE PTR [%EAX + 41], 0
         mov BYTE PTR [%EAX + 48], 0
         ret

This code is almost "perfect" with the exception of the multiply by 7 inside of the loop. Loop strength reduction would transform this into an add like GCC produces.

But still, the optimization passes shouldn't be leaving so much garbage around for instruction selection to clean up. At first, I thought all that was needed was to another round of basic block optimizations after the loop was unrolled, until I saw the actual LLVM bytecodes:

There are two issues here. First, LLVM code is NOT designed for trivial peephole code generation. This is *possible*, but not going to give you good quality code (see above :wink: ). The new isel uses an extremely light-weight DAG-based optimizer that both simplifies instruction selectors and optimizes the code before selection. LLVM code is meant to be a host to mid-level and interprocedural optimizations, and I think it does that job well.

Now the problem is obvious. A two dimensional array access is being performed by a single instruction. The arithmetic needed to address the element is implicit, and therefore inaccessible to optimizations. The redundant calculations can not be eliminated, nor can strength reduction be performed. getelementptr needs to be broken down into its constituent operations at some point.

Actually strength reduction can be trivially done with getelementptr instructions, that was part of their design. Part of the problem is that we have an induction variable canonicalization pass which UNDOES strength reduction to make loop analysis simpler, but we don't have strength reduction to redo it. Thus if you compile a loop like this (complex double chosen to show a size that X86 can scale "for free"):

#include <complex.h>
void test(unsigned N, complex double *P) {
   for (; N != 0; --N)
     *P++ = 0;
}

We compile it to this LLVM code:

void %test(uint %N, "complex long double"* %P) {
entry:
         %tmp.15 = seteq uint %N, 0 ; <bool> [#uses=1]
         br bool %tmp.15, label %return, label %no_exit

no_exit: ; preds = %no_exit, %entry
         %indvar = phi uint [ %indvar.next, %no_exit ], [ 0, %entry ] ; <uint> [#uses=3]
         %tmp.4 = getelementptr "complex long double"* %P, uint %indvar, uint 0 ; <double*> [#uses=1]
         store double 0.000000e+00, double* %tmp.4
         %tmp.5 = getelementptr "complex long double"* %P, uint %indvar, uint 1 ; <double*> [#uses=1]
         store double 0.000000e+00, double* %tmp.5
         %indvar.next = add uint %indvar, 1 ; <uint> [#uses=2]
         %exitcond = seteq uint %indvar.next, %N ; <bool> [#uses=1]
         br bool %exitcond, label %return, label %no_exit

return: ; preds = %no_exit, %entry
         ret void
}

Note the accesses to P are using P[indvar].{real|imag}, not incrementing P. Compile this to X86, and you get this loop (using the simple instruction selector):

.LBBtest_1: # no_exit
         mov %ESI, %EDX
         shl %ESI, 4 --> scale by sizeof(complex double)
         mov %EDI, %ECX
         add %EDI, %ESI
         mov DWORD PTR [%EDI], 0
         mov DWORD PTR [%EDI + 4], 0
         mov %ESI, %EDX
         shl %ESI, 4 --> scale by sizeof(complex double)
         mov %EDI, %ECX
         add %EDI, %ESI
         lea %ESI, DWORD PTR [%EDI + 8]
         mov DWORD PTR [%ESI], 0
         mov DWORD PTR [%ESI + 4], 0
         inc %EDX
         cmp %EDX, %EAX
         jne .LBBtest_1 # no_exit

Now, if you disable the -indvars pass, you'll get this loop instead (note the pointer increment of P is retained):

void %test(uint %N, "complex long double"* %P) {
entry:
         %tmp.15 = seteq uint %N, 0 ; <bool> [#uses=1]
         br bool %tmp.15, label %return, label %no_exit
no_exit: ; preds = %no_exit, %entry
         %P_addr.0.0 = phi "complex long double"* [ %inc, %no_exit ], [ %P, %entry ] ; <"complex long double"*> [#uses=3]
         %N_addr.0.0 = phi uint [ %dec, %no_exit ], [ %N, %entry ] ; <uint> [#uses=1]
         %inc = getelementptr "complex long double"* %P_addr.0.0, int 1 ; <"complex long double"*> [#uses=1]
         %tmp.4 = getelementptr "complex long double"* %P_addr.0.0, int 0, uint 0 ; <double*> [#uses=1]
         store double 0.000000e+00, double* %tmp.4
         %tmp.5 = getelementptr "complex long double"* %P_addr.0.0, int 0, uint 1 ; <double*> [#uses=1]
         store double 0.000000e+00, double* %tmp.5
         %dec = add uint %N_addr.0.0, 4294967295 ; <uint> [#uses=2]
         %tmp.1 = seteq uint %dec, 0 ; <bool> [#uses=1]
         br bool %tmp.1, label %return, label %no_exit
return: ; preds = %no_exit, %entry
         ret void
}

which codegens to this loop (simple isel again):

.LBBtest_1: # no_exit
         lea %EDX, DWORD PTR [%ECX + 16]
         mov DWORD PTR [%ECX], 0
         mov DWORD PTR [%ECX + 4], 0
         mov DWORD PTR [%ECX + 8], 0
         mov DWORD PTR [%ECX + 12], 0
         dec %EAX
         test %EAX, %EAX
         mov %ECX, %EDX
         jne .LBBtest_1 # no_exit

I think you'll agree that this loop is much nicer, though there is some wierd register shuffling going on between ECX and EDX. This is the code we would generate if we did loop strength reduction on the LLVM level. Note that on X86, this isn't a HUGE problem in all cases: it can scale by 2,4 and 8 for "free". The problem is much worse on RISC targets.

Or am I missing something? Is this breaking down and reapplying optimizations part of the selection dag process? I am somewhat surprised that using it affected what loops got unrolled.

I'm not sure what you mean. The LLVM code input to the selection dag isel is exactly the same as that input to the simple isel, and it does no loop unrolling. The full code produced by the simple isel (for Jeff's example) is attached below for reference. Contrast this to the pattern isel output at the top.

init_board:
         sub %ESP, 4
         mov DWORD PTR [%ESP], %ESI
         mov %EAX, DWORD PTR [%ESP + 8]
         mov %ECX, 0
.LBBinit_board_1: # loopexit.1
         imul %EDX, %ECX, 7
         mov %ESI, %EAX
         add %ESI, %EDX
         mov BYTE PTR [%ESI], 46
         imul %EDX, %ECX, 7
         mov %ESI, %EAX
         add %ESI, %EDX
         lea %EDX, DWORD PTR [%ESI + 1]
         mov BYTE PTR [%EDX], 46
         imul %EDX, %ECX, 7
         mov %ESI, %EAX
         add %ESI, %EDX
         lea %EDX, DWORD PTR [%ESI + 2]
         mov BYTE PTR [%EDX], 46
         imul %EDX, %ECX, 7
         mov %ESI, %EAX
         add %ESI, %EDX
         lea %EDX, DWORD PTR [%ESI + 3]
         mov BYTE PTR [%EDX], 46
         imul %EDX, %ECX, 7
         mov %ESI, %EAX
         add %ESI, %EDX
         lea %EDX, DWORD PTR [%ESI + 4]
         mov BYTE PTR [%EDX], 46
         imul %EDX, %ECX, 7
         mov %ESI, %EAX
         add %ESI, %EDX
         lea %EDX, DWORD PTR [%ESI + 5]
         mov BYTE PTR [%EDX], 46
         inc %ECX
         cmp %ECX, 7
         jne .LBBinit_board_1 # loopexit.1
.LBBinit_board_2: # return
         mov BYTE PTR [%EAX + 6], 0
         mov BYTE PTR [%EAX + 13], 0
         mov BYTE PTR [%EAX + 20], 0
         mov BYTE PTR [%EAX + 27], 0
         mov BYTE PTR [%EAX + 34], 0
         mov BYTE PTR [%EAX + 41], 0
         mov BYTE PTR [%EAX + 48], 0
         mov %ESI, DWORD PTR [%ESP]
         add %ESP, 4
   ret

-Chris

Sorry, I thought I was running selection dag isel but I screwed up when trying out the really big array. You're right, it does clean it up except for the multiplication.

So LoopStrengthReduce is not ready for prime time and doesn't actually get used?

I might consider whipping it into shape. Does it still have to handle getelementptr in its full generality?

Chris Lattner wrote:

Sorry, I thought I was running selection dag isel but I screwed up when trying out the really big array. You're right, it does clean it up except for the multiplication.

So LoopStrengthReduce is not ready for prime time and doesn't actually get used?

I don't know what the status of it is. You could try it out, and see what it does. :slight_smile: I would suggest adding it to gccas immediately after the SCCP pass.

I might consider whipping it into shape. Does it still have to handle getelementptr in its full generality?

What do you mean?

-Chris

Chris Lattner wrote:

I noticed that fourinarow is one of the programs in which LLVM is much slower than GCC, so I decided to take a look and see why that is so. The program has many loops that look like this:

  #define ROWS 6
  #define COLS 7

  void init_board(char b[COLS][ROWS+1])
  {
    int i,j;

    for (i=0;i<COLS;i++)
          for (j=0;j<ROWS;j++)
            b[i][j]='.';
                        for (i=0;i<COLS;i++)
          b[i][ROWS]=0;
  }

This generates the following X86 code:
      imull $7, %ecx, %edx
      movl %eax, %esi
      addl %edx, %esi
      movb $46, (%esi)
      imull $7, %ecx, %edx
      movl %eax, %esi
      addl %edx, %esi
      leal 1(%esi), %edx

... (many many copies of this, see the end of the email for full output) ...

The code generated by GCC is much faster. LLVM does unroll the loop, which GCC does not do, but the addressing code is a constant repetition of these three instructions:

      imull $7, %ecx, %edx
      movl %eax, %esi
      addl %edx, %esi

Yup, this is an issue of simple "peephole expansion" of the getelementptr instruction.

All but the first occurrance do nothing but put the same address into %esi that was already there to being with, and does it with an expensive multiply at that. GCC correctly creates a suitable induction variable and strength reduces the multiplication down to addition.

This is called loop strength reduction.

It turns out that using the new selection dag code generator, the outer loop is unrolled also and nothing is left but a bunch of movb instructions. Much, much faster.

For the record, instead of the ugly code above, the new isel generates:

init_board:
        mov %EAX, DWORD PTR [%ESP + 4]
        mov %ECX, 0
.LBBinit_board_1: # loopexit.1
        imul %EDX, %ECX, 7
        mov BYTE PTR [%EAX + %EDX], 46
        mov BYTE PTR [%EAX + %EDX + 1], 46
        mov BYTE PTR [%EAX + %EDX + 2], 46
        mov BYTE PTR [%EAX + %EDX + 3], 46
        mov BYTE PTR [%EAX + %EDX + 4], 46
        mov BYTE PTR [%EAX + %EDX + 5], 46
        inc %ECX
        cmp %ECX, 7
        jne .LBBinit_board_1 # loopexit.1
.LBBinit_board_2: # return
        mov BYTE PTR [%EAX + 6], 0
        mov BYTE PTR [%EAX + 13], 0
        mov BYTE PTR [%EAX + 20], 0
        mov BYTE PTR [%EAX + 27], 0
        mov BYTE PTR [%EAX + 34], 0
        mov BYTE PTR [%EAX + 41], 0
        mov BYTE PTR [%EAX + 48], 0
        ret

This code is almost "perfect" with the exception of the multiply by 7 inside of the loop. Loop strength reduction would transform this into an add like GCC produces.

But still, the optimization passes shouldn't be leaving so much garbage around for instruction selection to clean up. At first, I thought all that was needed was to another round of basic block optimizations after the loop was unrolled, until I saw the actual LLVM bytecodes:

There are two issues here. First, LLVM code is NOT designed for trivial peephole code generation. This is *possible*, but not going to give you good quality code (see above :wink: ). The new isel uses an extremely light-weight DAG-based optimizer that both simplifies instruction selectors and optimizes the code before selection. LLVM code is meant to be a host to mid-level and interprocedural optimizations, and I think it does that job well.

Now the problem is obvious. A two dimensional array access is being performed by a single instruction. The arithmetic needed to address the element is implicit, and therefore inaccessible to optimizations. The redundant calculations can not be eliminated, nor can strength reduction be performed. getelementptr needs to be broken down into its constituent operations at some point.

Actually strength reduction can be trivially done with getelementptr instructions, that was part of their design. Part of the problem is that we have an induction variable canonicalization pass which UNDOES strength reduction to make loop analysis simpler, but we don't have strength reduction to redo it. Thus if you compile a loop like this (complex double chosen to show a size that X86 can scale "for free"):

#include <complex.h>
void test(unsigned N, complex double *P) {
  for (; N != 0; --N)
    *P++ = 0;
}

We compile it to this LLVM code:

void %test(uint %N, "complex long double"* %P) {
entry:
        %tmp.15 = seteq uint %N, 0 ; <bool> [#uses=1]
        br bool %tmp.15, label %return, label %no_exit

no_exit: ; preds = %no_exit, %entry
        %indvar = phi uint [ %indvar.next, %no_exit ], [ 0, %entry ] ; <uint> [#uses=3]
        %tmp.4 = getelementptr "complex long double"* %P, uint %indvar, uint 0 ; <double*> [#uses=1]
        store double 0.000000e+00, double* %tmp.4
        %tmp.5 = getelementptr "complex long double"* %P, uint %indvar, uint 1 ; <double*> [#uses=1]
        store double 0.000000e+00, double* %tmp.5
        %indvar.next = add uint %indvar, 1 ; <uint> [#uses=2]
        %exitcond = seteq uint %indvar.next, %N ; <bool> [#uses=1]
        br bool %exitcond, label %return, label %no_exit

return: ; preds = %no_exit, %entry
        ret void
}

Note the accesses to P are using P[indvar].{real|imag}, not incrementing P. Compile this to X86, and you get this loop (using the simple instruction selector):

.LBBtest_1: # no_exit
        mov %ESI, %EDX
        shl %ESI, 4 --> scale by sizeof(complex double)
        mov %EDI, %ECX
        add %EDI, %ESI
        mov DWORD PTR [%EDI], 0
        mov DWORD PTR [%EDI + 4], 0
        mov %ESI, %EDX
        shl %ESI, 4 --> scale by sizeof(complex double)
        mov %EDI, %ECX
        add %EDI, %ESI
        lea %ESI, DWORD PTR [%EDI + 8]
        mov DWORD PTR [%ESI], 0
        mov DWORD PTR [%ESI + 4], 0
        inc %EDX
        cmp %EDX, %EAX
        jne .LBBtest_1 # no_exit

Now, if you disable the -indvars pass, you'll get this loop instead (note the pointer increment of P is retained):

void %test(uint %N, "complex long double"* %P) {
entry:
        %tmp.15 = seteq uint %N, 0 ; <bool> [#uses=1]
        br bool %tmp.15, label %return, label %no_exit
no_exit: ; preds = %no_exit, %entry
        %P_addr.0.0 = phi "complex long double"* [ %inc, %no_exit ], [ %P, %entry ] ; <"complex long double"*> [#uses=3]
        %N_addr.0.0 = phi uint [ %dec, %no_exit ], [ %N, %entry ] ; <uint> [#uses=1]
        %inc = getelementptr "complex long double"* %P_addr.0.0, int 1 ; <"complex long double"*> [#uses=1]
        %tmp.4 = getelementptr "complex long double"* %P_addr.0.0, int 0, uint 0 ; <double*> [#uses=1]
        store double 0.000000e+00, double* %tmp.4
        %tmp.5 = getelementptr "complex long double"* %P_addr.0.0, int 0, uint 1 ; <double*> [#uses=1]
        store double 0.000000e+00, double* %tmp.5
        %dec = add uint %N_addr.0.0, 4294967295 ; <uint> [#uses=2]
        %tmp.1 = seteq uint %dec, 0 ; <bool> [#uses=1]
        br bool %tmp.1, label %return, label %no_exit
return: ; preds = %no_exit, %entry
        ret void
}

which codegens to this loop (simple isel again):

.LBBtest_1: # no_exit
        lea %EDX, DWORD PTR [%ECX + 16]
        mov DWORD PTR [%ECX], 0
        mov DWORD PTR [%ECX + 4], 0
        mov DWORD PTR [%ECX + 8], 0
        mov DWORD PTR [%ECX + 12], 0
        dec %EAX
        test %EAX, %EAX
        mov %ECX, %EDX
        jne .LBBtest_1 # no_exit

I think you'll agree that this loop is much nicer, though there is some wierd register shuffling going on between ECX and EDX. This is the code we would generate if we did loop strength reduction on the LLVM level. Note that on X86, this isn't a HUGE problem in all cases: it can scale by 2,4 and 8 for "free". The problem is much worse on RISC targets.

Or am I missing something? Is this breaking down and reapplying optimizations part of the selection dag process? I am somewhat surprised that using it affected what loops got unrolled.

I'm not sure what you mean. The LLVM code input to the selection dag isel is exactly the same as that input to the simple isel, and it does no loop unrolling. The full code produced by the simple isel (for Jeff's example) is attached below for reference. Contrast this to the pattern isel output at the top.

init_board:
        sub %ESP, 4
        mov DWORD PTR [%ESP], %ESI
        mov %EAX, DWORD PTR [%ESP + 8]
        mov %ECX, 0
.LBBinit_board_1: # loopexit.1
        imul %EDX, %ECX, 7
        mov %ESI, %EAX
        add %ESI, %EDX
        mov BYTE PTR [%ESI], 46
        imul %EDX, %ECX, 7
        mov %ESI, %EAX
        add %ESI, %EDX
        lea %EDX, DWORD PTR [%ESI + 1]
        mov BYTE PTR [%EDX], 46
        imul %EDX, %ECX, 7
        mov %ESI, %EAX
        add %ESI, %EDX
        lea %EDX, DWORD PTR [%ESI + 2]
        mov BYTE PTR [%EDX], 46
        imul %EDX, %ECX, 7
        mov %ESI, %EAX
        add %ESI, %EDX
        lea %EDX, DWORD PTR [%ESI + 3]
        mov BYTE PTR [%EDX], 46
        imul %EDX, %ECX, 7
        mov %ESI, %EAX
        add %ESI, %EDX
        lea %EDX, DWORD PTR [%ESI + 4]
        mov BYTE PTR [%EDX], 46
        imul %EDX, %ECX, 7
        mov %ESI, %EAX
        add %ESI, %EDX
        lea %EDX, DWORD PTR [%ESI + 5]
        mov BYTE PTR [%EDX], 46
        inc %ECX
        cmp %ECX, 7
        jne .LBBinit_board_1 # loopexit.1
.LBBinit_board_2: # return
        mov BYTE PTR [%EAX + 6], 0
        mov BYTE PTR [%EAX + 13], 0
        mov BYTE PTR [%EAX + 20], 0
        mov BYTE PTR [%EAX + 27], 0
        mov BYTE PTR [%EAX + 34], 0
        mov BYTE PTR [%EAX + 41], 0
        mov BYTE PTR [%EAX + 48], 0
        mov %ESI, DWORD PTR [%ESP]
        add %ESP, 4
    ret

-Chris

_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://mail.cs.uiuc.edu/mailman/listinfo/llvmdev

-Chris

Chris Lattner wrote:

Sorry, I thought I was running selection dag isel but I screwed up when trying out the really big array. You're right, it does clean it up except for the multiplication.

So LoopStrengthReduce is not ready for prime time and doesn't actually get used?

I don't know what the status of it is. You could try it out, and see what it does. :slight_smile: I would suggest adding it to gccas immediately after the SCCP pass.

I know it has a lot of FIXMEs in it.

I might consider whipping it into shape. Does it still have to handle getelementptr in its full generality?

What do you mean?

-Chris

Consider the bytecode I originally posted for the nested for loop. If the getelementptrs haven't been replaced with something simpler, LSR will have to take them apart itself. The multiplication cannot be replaced with addition without the two-dimensional access being replaced with one-dimensional access plus some pointer arithmetic. I would guess that they will not have been decomposed before LSR is run, correct?

It also appears the current LSR reduces each GEP in isolation of all the others. That clearly is bad for this example, as all the GEPs in this loop do the same thing with the induction variable. Creating a separate pointer for each one only makes things worse. Ordinarily the addressing is explicit so the redundancy would have been removed and the LSR algorithm doesn't have to worry about it, but here it does. Bottom line is it has to strength reduce operations that are implicit, so it first has to determine what those operations are and detect common subexpressions among those operations itself (possibly across multiple basic blocks). It doesn't appear to handle explicit multiplications at all (probably doesn't add much value to do so).

It's a challenge :slight_smile:

trying out the really big array. You're right, it does clean it up except for the multiplication.

So LoopStrengthReduce is not ready for prime time and doesn't actually get used?

I don't know what the status of it is. You could try it out, and see what it does. :slight_smile: I would suggest adding it to gccas immediately after the SCCP pass.

I know it has a lot of FIXMEs in it.

Okay, fair enough. I think these are due to unimplemented features. I think that Nate (wisely) started simple, and decided to handle the complex cases later.

I might consider whipping it into shape. Does it still have to handle getelementptr in its full generality?

What do you mean?

-Chris

Consider the bytecode I originally posted for the nested for loop. If the getelementptrs haven't been replaced with something simpler, LSR will have to take them apart itself. The multiplication cannot be replaced with addition without the two-dimensional access being replaced with one-dimensional access plus some pointer arithmetic. I would guess that they will not have been decomposed before LSR is run, correct?

No, this is exactly what LSR should do in this case.

It also appears the current LSR reduces each GEP in isolation of all the others. That clearly is bad for this example, as all the GEPs in this loop do the same thing with the induction variable. Creating a separate pointer for each one only makes things worse. Ordinarily the addressing is explicit so the redundancy would have been removed and the LSR algorithm doesn't have to worry about it, but here it does. Bottom line is it has to strength reduce operations that are implicit, so it first has to determine what those operations are and detect common subexpressions among those operations itself (possibly across multiple basic blocks).

Yes, if you mean something like this:

   for (i)
     for (j)
       A[i][j].X = 0;
       A[i][j].Y = 0;

It should be sufficient to reduce this to (just considering the inner loop):

   for (i)
     p = &A[i]; // this would be futher reduced later.
     for (j)
       p[j].X = 0;
       p[j].Y = 0;

(this needs to do the auto-cse of p into the two GEPs). Further strength reduction should simplify this to:

   for (i)
     p = &A[i]; // this would be futher reduced later.
     for (j)
       p->X = 0;
       p->Y = 0;
       ++p;

... which is what we want.

It doesn't appear to handle explicit multiplications at all (probably doesn't add much value to do so).

True, this can be added later. Other operations, including div and rem, can be strength reduced as well.

It's a challenge :slight_smile:

:slight_smile: I would suggest continuing Nate's approach. Start with the most trivial of simple loops (e.g. for (i) A[i] = 0; ), and work up from there, adding complexity as necessary.

The idea should be to always have a pass that works, but continue to add new capabilities to it.

-Chris

Well, what about this?

  for (i)
    for (j)
      A[i][j].X = 0;
      if (func(j))
        A[i][j].Y = 0;
      else
        A[i][j].Z = 0;

This is CSE across multiple basic blocks, i.e. GCSE.

Or even:

  for (i)
    for (j)
      A[i][j].X = 0;
      if (func(j))
        A[i-1][j].Y = 0;
      else
        A[i+1][j].Z = 0;

Much easier if the GEPs have been decomposed first and the usual global opts already performed :slight_smile:

Chris Lattner wrote:

The second issue is that we need to do redundancy elimination and hoisting on operations that are only exposed once instruction selection is performed. Getelementptr expansion is just ONE particular case of this, but it can happen with any instructions (including the use of large integer (or any FP) constants on RISC machines, addressing globals with PIC code, handling extended/truncated operations (for RISC machines with a single integer size), etc. Note that hacks like "LowerMultiDimRefs" and the sparc Preselection pass sorta-kinda work around SOME of this,

Actually, they are capable of working around much of this, and most such optimizations are actually quite architecture-independent, e.g., LowerMultiDimRefs, which is a major optimization for array-intensive languages.

I don't claim such choices are the right long-term solution but the right long-term solution is exactly what you're implying below - a new optimization framework within the back-end, working on it's own representation, etc. This has taken 4 years and is still under development. This is why what you call "hacks" are sometimes the right choice in the short term.

The only way to make preselection or lowermultidimrefs work is to duplication all of the knowledge of how the instruction selector will select the code (e.g. the advice about allowing constant indices to be grouped together).

This is why you need a separate, low-level optimization framework - the kind you were describing.

--Vikram
http://www.cs.uiuc.edu/~vadve
http://llvm.cs.uiuc.edu/

Incidentally, for those interested in the big picture, the issues raised by Jeff and Chris are a fundamental trade-off between the so-called "low-level" model of compilation (like GCC and I believe IBM's XL compiler family), and the mid-level model we chose for LLVM. In GCC and other low-level systems, code is lowered to expose all the detailed operations needed on a specific target early in the compilation process, so that they are all exposed to optimization. Of course, doing this in a single IR capable of supporting multiple targets can lead to horrendously complex representations. The mid-level model does not expose architecture-specific operations in the IR but this means that a separate back-end optimization framework is needed, in addition to the primary mid-level, architecture-neutral, framework.

My point is that this is a fundamental trade-off made very early on in the LLVM design, and I think it was the right one. It greatly simplifies the higher-level optimizations, which are by far the most complex, and can even make more optimizations possible (if the mid-level IR is able to expose types and reference patterns that are obscured in the low-level IR).

--Vikram
http://www.cs.uiuc.edu/~vadve
http://llvm.cs.uiuc.edu/

Vikram S. Adve wrote:

The only way to make preselection or lowermultidimrefs work is to duplication all of the knowledge of how the instruction selector will select the code (e.g. the advice about allowing constant indices to be grouped together).

This is why you need a separate, low-level optimization framework - the kind you were describing.

--Vikram
Vikram S. Adve
http://llvm.cs.uiuc.edu/

Sounds reasonable to me. It seems to me it would be best if LSR was part of the low-level optimization framework, otherwise it would be forced to duplicate too much work on its own.

Also, some of what LSR needs to decide is architecture dependent. For example, it may not want to strength reduce a multiplication which multiplies by a small power of two, as this is handled by addressing modes on some architectures.

There is still probably a need for a mid-level LSR that handles the explicit multiplications, divisions, etc...

I agree completely. The important part is to decide what makes the most sense to do on LLVM and what makes the most sense to do in the backend. Unfortunately, it is too easy just to assume that everything should be done on LLVM, because (like you say above), LLVM is not meant to be the host for target-specific stuff. :slight_smile:

-Chris

Vikram S. Adve wrote:

The only way to make preselection or lowermultidimrefs work is to duplication all of the knowledge of how the instruction selector will select the code (e.g. the advice about allowing constant indices to be grouped together).

This is why you need a separate, low-level optimization framework - the kind you were describing.

--Vikram
Vikram S. Adve
http://llvm.cs.uiuc.edu/

Sounds reasonable to me. It seems to me it would be best if LSR was part of the low-level optimization framework, otherwise it would be forced to duplicate too much work on its own.

Also, some of what LSR needs to decide is architecture dependent. For example, it may not want to strength reduce a multiplication which multiplies by a small power of two, as this is handled by addressing modes on some architectures.

I agree. I would say that the only part of this problem (array indexing) that is nearly always important is decomposing multidimensional array refs into individual index operations so that LICM in pariticular can hoist each component to the appropriate loop level. Everything else, including GCSE and strength reduction, is somewhat sensitive to target parameters like register set size and addressing modes.

--Vikram

The second issue is that we need to do redundancy elimination and hoisting on operations that are only exposed once instruction selection is performed. Getelementptr expansion is just ONE particular case of this, but it can happen with any instructions (including the use of large integer (or any FP) constants on RISC machines, addressing globals with PIC code, handling extended/truncated operations (for RISC machines with a single integer size), etc. Note that hacks like "LowerMultiDimRefs" and the sparc Preselection pass sorta-kinda work around SOME of this,

Actually, they are capable of working around much of this, and most such optimizations are actually quite architecture-independent, e.g., LowerMultiDimRefs, which is a major optimization for array-intensive languages.

I obviously agree that high-quality codegen of getelementptr instructions is very important! I have two primary problems with LowerMultiDimRefs:

1. We don't want the code generator munching on the LLVM code as it
    generates code. The code generator should only read the LLVM code.
    Right now if we do thinks like a "quick JIT" followed by a "good code
    JIT" (when we find that a piece of code is hot), the second JIT gets
    different LLVM code than the first one had as input. We are still not
    quite to the point where this is the case, but we are getting there.
2. The way LowerMultiDimRefs works is by decimating GEP instructions into
    pieces, then relying on GCSE and LICM to clean them up. The problem
    with this is that GCSE and LICM are not the trivial passes that you'd
    like them to be for a simple cleanup job like this. In particular,
    GCSE requires a value numbering implementation, LICM requires
    loop info and alias analysis.

Finally, while the *code* for MultiDimRefs isn't target specific, the decisions it makes *are*. As a trivial example, it should decompose the following GEP on most risc machines:

   t2 = GEP P, 1000000, X
   load t2

Assuming the immediate doesn't fit into the immediate field of an
instruction, this GEP will lower to multiple instructions, hypothetically like this (assuming X indexes into a byte array):

   t1 = P + 900000
   load t1 + X + 100000

In this case, the addition of 90000 could be loop invariant, a common subexpression, etc.

I don't claim such choices are the right long-term solution but the right long-term solution is exactly what you're implying below - a new optimization framework within the back-end, working on it's own representation, etc. This has taken 4 years and is still under development. This is why what you call "hacks" are sometimes the right choice in the short term.

I can appreciate that the LowerMultiDimRefs pass made a *lot* of sense 4 years ago, but I'm concerned about moving forward. Now we have more of the infrastructure that is needed to move beyond it. Also, the new instruction selection framework is only a couple months old, not 4 years. :slight_smile:

-Chris

Sounds reasonable to me. It seems to me it would be best if LSR was part of the low-level optimization framework, otherwise it would be forced to duplicate too much work on its own.

There is still probably a need for a mid-level LSR that handles the explicit multiplications, divisions, etc...

You're right: strength reduction is the #1 gray area between the LLVM level and the code generator level. Doing it in the backend has several nice advantages (e.g. handling explicit multiplies and gep instructions is same analysis, you have a better understanding of register pressure, etc), but doing it at the llvm level also has advantages (higher level information is available, the xform is faster and cheaper to do, etc). Currently we also have the infrastructure available to do the transformation on the LLVM level, but don't really have it available in the code generator.

Also, some of what LSR needs to decide is architecture dependent. For example, it may not want to strength reduce a multiplication which multiplies by a small power of two, as this is handled by addressing modes on some architectures.

You're right. However, we can choose to expose information about target parameters through the Target* interfaces that llvm->llvm passes can use as well, so at least this aspect is not a killer issue.

-Chris