Partial loop unrolling

Hi,

PS: It is a generic question related to partial loop unrolling, and nothing specific to LLVM.

As far as partial loop unrolling is concerned, I could see following three different possibilities. Assume that unroll factor is 3.

Original loop:
for (i = 0; i < 10; i++)
{
do_foo(i);
}

  1. First possibility
    i = 0;
    do_foo(i++);
    do_foo(i++);
    do_foo(i++);
    for (; i < 10; i++)

{
do_foo(i);
}

  1. Second possibility

for (i = 0; i < 7; i++)
{
do_foo(i);
}

do_foo(i++);
do_foo(i++);
do_foo(i++);

  1. Third possibility

for (i = 0; i < 10;)
{
do_foo(i++);
do_foo(i++);
do_foo(i++);
}

My questions are:

a. In case, if we get any performance improvement due to partial loop unrolling, are all the three possibilities give almost same performance improvement?

b. If answer to question ‘a’ is ‘no’, then which one of these three possibilities is ideal for generic partial unrolling implementation? and in general, which one of these is implemented in production compilers? and in particular, which one of these is implemented in LLVM compiler?

Hello Mahesha,

Well, if I had to guess, I would say the third option is the best one. You see, loop unrolling’s main goal (as I see it) is not to generate a better code, but to expose the potential of the code to other kinds of optimizations within the compiler* – increasing the size of the code in the process. If you consider that, the first two options ended up increasing the size of the program, but any new simplifications or optimizations to the code will be applied to a part of it that runs only once – notice that the code inside the loop is, potentially, the one that executes most times, but the only difference between the original code and those ones is found outside the loop.

  • There is, in fact, another interesting function for unrolling: if the upper limit of the loop is known during compilation-time and it is a very small value, it could be interesting to substitute the whole loop for all the necessary calls to do_foo – this way, you can remove all the loop-related code from your program (conditions, variables, operations over the induction variable, etc), and still have new opportunities to optimize the code.

Hope I could help,

This is also the case for re-rolling, where the loop is unrolled in
the "wrong way", and re-rolling, than unrolling will expose
paralellism.

but there are others:

3. join all loads and stores into a block and hide the delays in
between the loop cycles.

Reading a large block of data is almost as efficient as reading a
small one, so a totally rolled loop will have read-op-write while a
partially unrolled loop will have read-op-op-op-write, saving two
reads and two writes every three.

4. Align vectorized loops

Vectorization will often partially unroll the loop (like your 1 and 3
examples) to make the loop aligned to the memory constraints (ex, if
the pointer or if the loop count is unaligned, etc).

AFAIK, all vectorizing compilers, including LLVM, do all of them. It
depends more on what you want and how's your original loop than
generic goodness.

cheers,
--renato

As far as partial loop unrolling is concerned, I could see following
three different possibilities. Assume that unroll factor is 3.

Original loop:
       for (i = 0; i < 10; i++)
       {
            do_foo(i);
       }

1. First possibility
       i = 0;
       do_foo(i++);
       for (; i < 10; i++)
       {
            do_foo(i);
       }

This is really peeling, not unrolling.

2. Second possibility
       for (i = 0; i < 7; i++)
       {
            do_foo(i);
       }
       do_foo(i++);

Same as above.

3. Third possibility
       for (i = 0; i < 10;)
       {
            do_foo(i++);
       }

This one is unrolled, but incorrectly.

-Krzysztof

As far as partial loop unrolling is concerned, I could see following
three different possibilities. Assume that unroll factor is 3.

Original loop:
       for (i = 0; i < 10; i++)
       {
            do_foo(i);
       }

1. First possibility
       i = 0;
       do_foo(i++);
       do_foo(i++);
       do_foo(i++);
       for (; i < 10; i++)
       {
            do_foo(i);
       }

This is really peeling, not unrolling.

2. Second possibility

       for (i = 0; i < 7; i++)
       {
            do_foo(i);
       }
       do_foo(i++);
       do_foo(i++);
       do_foo(i++);

Same as above.

I understand. I must agree that, I am newbee when it comes to compiler
optimization techniques and nomenclature

3. Third possibility

       for (i = 0; i < 10;)
       {
            do_foo(i++);
            do_foo(i++);
            do_foo(i++);
       }

This one is unrolled, but incorrectly.

I think I understood the mistake here. In fact, I noticed it just after
shooting my initial mail. One of the correct ways may be below one.

3. Third possibility
       for (i = 0; i < 10;)
       {
            do_foo(i++);
            do_foo(i++);
            do_foo(i++);
            do_foo(i++);
            do_foo(i++);
       }

Yes, this one works.

You can, in fact, unroll this loop by a factor of 3 as well. The unrolled loop will cover 9 iterations of the original loop, and you will have a single iteration left that needs to be executed separately. You have two basic choices there:

   do_foo(0);
   for (i = 1; i < 10; i += 3) {
     do_foo(i);
     do_foo(i+1);
     do_foo(i+2);
   }

or

   for (i = 0; i < 9; i += 3) {
     do_foo(i);
     do_foo(i+1);
     do_foo(i+2);
   }
   do_foo(9);

In a general case you will have a few iterations left (up to 2 with the unroll factor of 3), and then you would want to create a "residue loop", which will execute them. The placement of that loop (before or after the unrolled loop) doesn't really matter that much in the absence of other factors.

Lastly, the version unrolled by 3 would also work if you add extra code to the loop:

   for (i = 0; i < 10;) {
     do_foo(i++);
     if (i >= 10) break;
     do_foo(i++);
     if (i >= 10) break;
     do_foo(i++);
     if (i >= 10) break;
   }

This isn't very common though.

-Krzysztof