Tight overlapping loops and performance

I was playing around in x86 assembly the other day, looking at ways to optimize my cooperative multitasking system. Currently, it uses a ‘timeout’ counter that is decremented each time through a loop, letting me stop the loop and go to the next cooperative thread if the loop runs a little long.

The asm has two overlapping loops:

With gcc -O3 4.2 and 4.4 we match 1.0s. The LLVM, after running it through
opt -std-compile-opts, is around 1.7s.

Hmm, on my computer, I get around 2.5 seconds with both gcc -O3 and
llvm-gcc -O3 (using llvm-gcc from svn). Not sure what you're doing
differently; I wouldn't be surprised if it's sensitive to the version
of LLVM.

Should I be looking at any particular optimization passes that aren't in
-std-compile-opts to match the gcc speeds?

First, try looking at the generated code... the code LLVM generates is
probably not what you're expecting. I'm getting the following for the
main loop:

.LBB1_1: # loopto
  cmpl $1, %eax
  leal -1(%eax), %eax
  cmove %edx, %eax
  incl %ecx
  cmpl $999999999, %ecx
  jne .LBB1_1 # loopto

LLVM is optimizing your oddly nested loops into a single loop which
does some extra computation to keep track of the timeout variable.
Since you'd normally be doing something non-trivial in the timeout
portion of the loop, the results you're getting with this contrived
testcase are irrelevant to your actual issue.

In general, you'll probably get better results from LLVM with properly
nested loops; LLVM's loop optimizers don't know how to deal with deal
with overlapping loops. I'd suggest writing it more like the
following:

int timeout = 2000;
int loopcond;
do {
timeoutwork();
do {
timeout--;
loopcond = computationresult();
} while (loopcond && timeout);
} while (loopcond);

-Eli

Date: Mon, 2 Mar 2009 13:41:45 -0800
From: eli.friedman@gmail.com
To: llvmdev@cs.uiuc.edu
Subject: Re: [LLVMdev] Tight overlapping loops and performance

Hmm, on my computer, I get around 2.5 seconds with both gcc -O3 and
llvm-gcc -O3 (using llvm-gcc from svn). Not sure what you’re doing
differently; I wouldn’t be surprised if it’s sensitive to the version
of LLVM.

For which version of gcc? I should mention I’m on OS X and using the LLVM SVN.

First, try looking at the generated code… the code LLVM generates is
probably not what you’re expecting. I’m getting the following for the
main loop:

I was seeing the same thing, but wasn’t sure what to make of it. It looks like values are being swapped into and out of memory and not holding them in registers. That’s why I was asking about other optimization passes, at first glance -mem2reg looked like a good candidate, but I didn’t notice any improvement using it blindly.

int timeout = 2000;
int loopcond;
do {
timeoutwork();
do {
timeout–;
loopcond = computationresult();
} while (loopcond && timeout);
} while (loopcond);

My current implementation uses something very similar, but if you’ll notice the difference between this example and my examples is that the branch for checking ‘timeout’ is taken in the majority case where in mine it isn’t. It can be checked separately for less cost, assuming the variables stay in registers.

Jonathan

My current implementation uses something very similar, but if you'll notice the difference between this example and my examples is that the branch for checking 'timeout' is taken in the majority case where in mine it isn't. It can be checked separately for less cost, assuming the variables stay in registers.

Perhaps I shouldn't be so quick to stick my foot in my mouth. All my knowledge of asm is from pretty old machines, I don't know anything about modern pipelines. Having said that, I'm definitely willing to learn.

Jonathan

For which version of gcc? I should mention I'm on OS X and using the LLVM
SVN.

gcc 4.3. It's also possible this is processor-sensitive.

First, try looking at the generated code... the code LLVM generates is
probably not what you're expecting. I'm getting the following for the
main loop:

I was seeing the same thing, but wasn't sure what to make of it. It looks
like values are being swapped into and out of memory and not holding them in
registers.

You're misreading the asm... nothing is touching memory. (BTW, "leal
-1(%eax), %eax" isn't a memory operation; it's just subtracting one
from %eax.) You might want to try reading the LLVM IR (which you can
generate with llvm-gcc -S -emit-llvm); it tends to be easier to read.

My current implementation uses something very similar, but if you'll notice
the difference between this example and my examples is that the branch for
checking 'timeout' is taken in the majority case where in mine it isn't. It
can be checked separately for less cost, assuming the variables stay in
registers.

A taken and non-taken branch have roughly the same cost on any
remotely recent x86 processor.

-Eli

You're misreading the asm... nothing is touching memory. (BTW, "leal
-1(%eax), %eax" isn't a memory operation; it's just subtracting one
from %eax.) You might want to try reading the LLVM IR (which you can
generate with llvm-gcc -S -emit-llvm); it tends to be easier to read.

I tried that, but I'm still learning LLVM. Seeing indvar, phi nodes, tail
calls on printfs, and nounwinds had me more confused than the asm.

A taken and non-taken branch have roughly the same cost on any
remotely recent x86 processor.

I was wondering if that might be the case.

The crux of the example still seems intact. From LLVM SVN, converted to asm via llc:

                .text
        .align 4,0x90
        .globl _main
_main:
        subl $12, %esp
        movl $1999, %eax
        xorl %ecx, %ecx
        movl $1999, %edx
        .align 4,0x90
LBB1_1: ## loopto
        cmpl $1, %eax
        leal -1(%eax), %eax
        cmove %edx, %eax
        incl %ecx
        cmpl $999999999, %ecx
        jne LBB1_1 ## loopto
LBB1_2: ## bb1
        movl %eax, 4(%esp)
        movl $LC, (%esp)
        call _printf
        xorl %eax, %eax
        addl $12, %esp
        ret
        .section __TEXT,__cstring,cstring_literals
LC: ## LC
        .asciz "Timeout: %i\n"

        .subsections_via_symbols

Setting the loops to decl instead of cmove/incl might seem like more work, but appears to be faster:

        .text
        .align 4,0x90
        .globl _main
_main:
        subl $12, %esp
        movl $2000, %eax
        movl $1000000000, %ecx
        .align 4,0x90
LBB1_3:
        movl $2000, %eax
LBB1_1: ## loopto
        decl %eax
        jz LBB1_3
        decl %ecx
        jnz LBB1_1 ## loopto
LBB1_2: ## bb1
        movl %eax, 4(%esp)
        movl $LC, (%esp)
        call _printf
        xorl %eax, %eax
        addl $12, %esp
        ret
        .section __TEXT,__cstring,cstring_literals
LC: ## LC
        .asciz "Timeout: %i\n"

        .subsections_via_symbols

The first example is 1.7s, the second is 1.0s. That's on my dual core OS X box. I have a 2-processor quad-core Xeon box that runs Linux and also has very similar results.

Jonathan

Have you tried putting something non-trivial (like asm("nop;"):wink: where
you'd put the code that runs on the timeout?

-Eli

You're misreading the asm... nothing is touching memory. (BTW, "leal
-1(%eax), %eax" isn't a memory operation; it's just subtracting one
from %eax.) You might want to try reading the LLVM IR (which you can
generate with llvm-gcc -S -emit-llvm); it tends to be easier to read.

I tried that, but I'm still learning LLVM. Seeing indvar, phi nodes, tail
calls on printfs, and nounwinds had me more confused than the asm.

A taken and non-taken branch have roughly the same cost on any
remotely recent x86 processor.

I was wondering if that might be the case.

The crux of the example still seems intact. From LLVM SVN, converted to asm via llc:

       .align 4,0x90
LBB1_1: ## loopto
       cmpl $1, %eax
       leal -1(%eax), %eax
       cmove %edx, %eax
       incl %ecx
       cmpl $999999999, %ecx
       jne LBB1_1 ## loopto

LBB1_1: ## loopto
       decl %eax
       jz LBB1_3
       decl %ecx
       jnz LBB1_1 ## loopto

The main issue is incl updates the EFLAGS condition code register. But llvm x86 isn't taking advantage of that. This is a known issue, hopefully someone will find the time to implement before 2.6.

The second issue is the leal -1 can be turned (back) into a decl. Combine that with the optimization previously described, it can eliminate the first cmpl.

Feel free to file a bugzilla for this. I'm hopefully this will be fixed in the not too far future.

Thanks,

Evan

Have you tried putting something non-trivial (like asm("nop;"):wink: where
you'd put the code that runs on the timeout?

-Eli

Using a asm("nop") does fix the llvm output, which makes it sound like a bug. At least in my expectations, a trivial loop should be faster than a non-trivial one.

The main issue is incl updates the EFLAGS condition code register. But
llvm x86 isn't taking advantage of that. This is a known issue,
hopefully someone will find the time to implement before 2.6.

The second issue is the leal -1 can be turned (back) into a decl.
Combine that with the optimization previously described, it can
eliminate the first cmpl.

Feel free to file a bugzilla for this. I'm hopefully this will be
fixed in the not too far future.

Thanks,

Evan

Will do. Thanks.

Jonathan