Why the same code is much slower in JIT compared to separate executable?

I run the same simple Fibonacci computing code in JIT and as a native executable. I see that with argument 45 JIT runs for 11.3sec and executable runs for 7.5sec.
Why there is such difference?

Yuri

-------- fib.ll --------
; ModuleID = 'all.bc'

@.str = private constant [12 x i8] c"fib(%i)=%i\0A\00", align 1 ; <[12 x i8]*> [#uses=1]

define i32 @fib(i32 %AnArg) {
EntryBlock:
%cond = icmp sle i32 %AnArg, 2 ; <i1> [#uses=1]
br i1 %cond, label %return, label %recurse

return: ; preds = %EntryBlock
ret i32 1

recurse: ; preds = %EntryBlock
%arg = sub i32 %AnArg, 1 ; <i32> [#uses=1]
%fibx1 = tail call i32 @fib(i32 %arg) ; <i32> [#uses=1]
%arg1 = sub i32 %AnArg, 2 ; <i32> [#uses=1]
%fibx2 = tail call i32 @fib(i32 %arg1) ; <i32> [#uses=1]
%addresult = add i32 %fibx1, %fibx2 ; <i32> [#uses=1]
ret i32 %addresult
}

define i32 @main(i32 %argc, i8** nocapture %argv) nounwind {
entry:
%0 = getelementptr inbounds i8** %argv, i32 1 ; <i8**> [#uses=1]
%1 = load i8** %0, align 4 ; <i8*> [#uses=1]
%2 = tail call i32 @atoi(i8* %1) nounwind ; <i32> [#uses=2]
%3 = tail call i32 @fib(i32 %2) nounwind ; <i32> [#uses=1]
%4 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([12 x i8]* @.str, i32 0, i32 0), i32 %2, i32 %3) nounwind ; <i32> [#uses=0]
ret i32 undef
}

declare i32 @atoi(i8* nocapture) nounwind readonly

declare i32 @printf(i8* nocapture, ...) nounwind

-------- run-jit shell script --------
llvm-as fib.ll && \
time lli -O3 fib.bc 45

-------- run-exe shell script --------
llvm-as fib.ll && \
llc -O3 fib.bc -o fib.s && \
as fib.s -o fib.o && \
gcc -o fib fib.o && \
time fib 45

How long does it take for llc to compile it?
Remember that the JIT includes code generation time.

Best regards,
--Edwin

Török Edwin wrote:

How long does it take for llc to compile it?
Remember that the JIT includes code generation time

llc takes almost no time (0.00 user as measured by time), code is tiny.

Yuri

Are you using 2.6 or 2.7, 32-bit or 64-bit?

With 2.7 on x86-64 I get:

lli:
real 0m9.564s
user 0m9.557s
sys 0m0.004s

a.out:
real 0m12.105s
user 0m12.029s
sys 0m0.008s

So JIT is actually faster here.

With a 32-bit build, I get this with a.out:
real 0m15.052s
user 0m14.977s
sys 0m0.004s

And this with the JIT (Release-Asserts):
real 0m17.963s
user 0m17.581s
sys 0m0.004s

Best regards,
--Edwin

Török Edwin wrote:

Are you using 2.6 or 2.7, 32-bit or 64-bit?
  
I use 2.7 on i386. lli has debug asserts enabled, but I guess this shouldn't matter for JIT code speed.

jit: 11.32 real
exe: 7.64 user

Both have -O3 option. Speed should be the same.

Yuri

Török Edwin wrote:

Are you using 2.6 or 2.7, 32-bit or 64-bit?
  
I use 2.7 on i386. lli has debug

try a release build (ENABLE_OPTIMIZED=1 DISABLE_ASSERTIONS=1)

Yuri <yuri@tsoft.com> writes:

Török Edwin wrote:

Are you using 2.6 or 2.7, 32-bit or 64-bit?
  
I use 2.7 on i386. lli has debug asserts enabled, but I guess this
shouldn't matter for JIT code speed.

jit: 11.32 real
exe: 7.64 user

Both have -O3 option. Speed should be the same.

With

time lli -O3 fib.bc 45

you are measuring the time lli takes optimizing the LLVM code,
generating the native code and, finally, executing it. If you add to
this the debug asserts, it is not surprising that lli ends being quite a
bit slower than directly executing the native code.

As Török suggests, using a Release build with asserts off will make a
difference on lli's speed.

Yuri <yuri@tsoft.com> writes:

With

time lli -O3 fib.bc 45

you are measuring the time lli takes optimizing the LLVM code,
generating the native code and, finally, executing it. If you add to
this the debug asserts, it is not surprising that lli ends being quite a
bit slower than directly executing the native code

You can see that this is not true running 'time lli -O3 fib.bc 4',
which has exactly the same code. Compiler still compiles the same way,
and it takes 0.00 user seconds.

You are right. The code is small enough to have any impact on the total
required time.

Try passing -print-machineinstrs to lli and compare the output with the
assembler generated by llc.