Very slow performance of lli on x86

Hi all,

I am trying to compare the performance of gcc , llvm-gcc , clang and lli(with JIT) on x86. i have attached the performance comparision spreadsheet as well as the source which i used for performing these test. i ran this code for 10000 iterations and the time of execution is as follows

for -O3 results refer attachment.
clang (-O0)

real 0m10.247s
user 0m2.644s
sys 0m5.949s

llvm-gcc(-O0)

real 0m11.324s
user 0m2.478s
sys 0m6.000s

gcc(-O0)

real 0m10.963s
user 0m2.365s
sys 0m5.953s

llvm-jit
i used clang-cc -O0 -emit-llvm-bc to emit llvm bytecode and then passed it to opt tool and then linked all bytecode files to single bytecode using llvm-ld, i used lli tool to run this single bytecode file and noticed the following output

real 6m33.786s
user 5m12.612s
sys 1m1.205s

why is lli taking such a loooong time to execute this particular piece of code.??

Thanks and Regards,
Prasanth J

generic_replica.c (61.6 KB)

dacc.c (813 Bytes)

xacc.c (1.63 KB)

llvm comaprisons.xls (100 KB)

Hi all,

I am trying to compare the performance of gcc , llvm-gcc , clang and lli(with JIT) on x86. i have attached the performance comparision spreadsheet as well as the source which i used for performing these test. i ran this code for 10000 iterations and the time of execution is as follows

for -O3 results refer attachment.
time clang (-O0) llvm-gcc(-O0) gcc(-O0)
real 0m10.247s 0m11.324s 0m10.963s
user 0m2.644s 0m2.478s 0m2.263s
sys 0m5.949s 0m6.000s 0m5.953s

llvm-jit
i used clang-cc -O0 -emit-llvm-bc to emit llvm bytecode and then passed it to opt tool and then linked all bytecode files to single bytecode using llvm-ld, i used lli tool to run this single bytecode file and noticed the following output
real 6m33.786s
user 5m12.612s
sys 1m1.205s

why is lli taking such a loooong time to execute this particular piece of code.??

Thanks and Regards,
Prasanth J

generic_replica.c (61.6 KB)

dacc.c (813 Bytes)

xacc.c (1.63 KB)

llvm comaprisons.xls (100 KB)

for -O3 results refer attachment.
time clang (-O0) llvm-gcc(-O0) gcc(-O0)
real 0m10.247s 0m11.324s 0m10.963s
user 0m2.644s 0m2.478s 0m2.263s
sys 0m5.949s 0m6.000s 0m5.953s

llvm-jit
i used clang-cc -O0 -emit-llvm-bc to emit llvm bytecode and then passed it to opt tool and then linked all bytecode files to single bytecode using llvm-ld, i used lli tool to run this single bytecode file and noticed the following output
real 6m33.786s
user 5m12.612s
sys 1m1.205s

why is lli taking such a loooong time to execute this particular piece of code.??

Something's wrong on your machine or something. I did the same (but using llvm-gcc for the .ll files). Using a debug build of current ToT I got this:

[ghostwheel:~/Desktop] echristo% time ~/builds/build-llvm-64bit/Debug/bin/lli foo.bc.bc
0.210u 0.010s 0:00.22 100.0% 0+0k 0+0io 0pf+0w

That's a 64-bit build, but you'll notice the time difference. That said I'm guessing that there's something missing since it takes no time to execute. Step by step directions on what you did might help.

-eric

He is probably using the interpreter on a debug build.

Evan

Hi all,

LLVM is built without debug enabled. Also i am not forcing lli to use interpreter mode. so i dont think the reason is not because of debug build or interpreter mode.

step 1:
compiled the 3 files (generic_replica.c ,xacc.c and dacc.c) with clang-cc to llvm bytecode files using -emit-llvm-bc and (-O0/-O3) options
step 2:
bytecode obtained from step 1 (generic_replica.bc, xacc.bc and dacc.bc) is passed to opt tool using (-O0/-O3) options
step 3:
optimized bytecode obtained from step 2 (generic_replica.opt.bc, xacc.opt.bc and dacc.opt.c) is combinde to a single bytecode file (monolith.bc) using llvm-ld tool
step 4:
running monolith.bc for 10000 iterations using lli tool and measured the time.

I also tried using llvm-gcc for emiting bytecode in step 1 but got almost the same output. As i have my entire setup in office i cant attach my makefile today. i will attach my entire setup tom once i get back to office. Also i will attach the configuration options i used for compiling LLVM. Let me know in case i am wrong anywhere.

Thanks & Regards,
Prasanth J

Sorry i really forgot to mention one thing. I downloaded the X86 binaries of llvm+clang and llvm-gcc from llvm download site. i hope that is not a debug build.

Prasanth J

Prasanth J <j.prasanth.j@gmail.com>
writes:

LLVM is built without debug enabled. Also i am not forcing lli to use
interpreter mode. so i dont think the reason is not because of debug build
or interpreter mode.

*step 1: *
compiled the 3 files (generic_replica.c ,xacc.c and dacc.c) with clang-cc to
llvm bytecode files using -emit-llvm-bc and (-O0/-O3) options
*step 2:*
bytecode obtained from step 1 (generic_replica.bc, xacc.bc and dacc.bc) is
passed to opt tool using (-O0/-O3) options
*step 3:*
optimized bytecode obtained from step 2 (generic_replica.opt.bc, xacc.opt.bc
and dacc.opt.c) is combinde to a single bytecode file (monolith.bc) using
llvm-ld tool
*step 4: *
running monolith.bc for 10000 iterations using lli tool and measured the
time.

So if I understand you correctly, you build executables with
llvm-gcc and clang, and ran it 10000 times taking about 10 seconds. Then
you generate some .bc files, combine and optimized them, and invoke
lli 10000 times with the resulting .bc file.

lli needs to generate the native code from the .bc file each time you
invoke it, so it is not a fair comparision, unless you are testing lli's
native code generation speed.

So if your program executes fast (<1 ms) when compiled with llvm-gcc
but have a moderately large (a few KB) .bc file, that could explain why
lli seems slow.

If the .bc file is short then, for some unknown reason, lli may be using
the interpreter instead of generating and running native code.

Which operative system do you use? How long is the .bc file you pass to
lli? What's the output of running your .bc file passing the command line
option -stats to lli? Is there any difference if you pass to lli the
-force-interpreter option too?

Granted I’m not up on using bit code files, but I don’t believe the debug build affects whether or not the jit is used (non-interpretive mode). Ignoring other debug build effects on the efficiency of the jitted code, it would be interesting if you also could measure the time to jit–don’t actually execute the 10000 iteration. I don’t believe this would explain the time scale shown, but it should have some effect. To my mind, the proffered time scale also implies interpretive mode which you might be able to force to see if this is the culprit. I’ll help test when you supply the build (make files).

Garrison

How are you doing this?

-eric

Hi all,

I have attached the complete test suite. it has different directories for gcc, llvm-gcc , clang and lli-clang. Source code , makefile and run script (contains number of times the program should execute) for each case are available inside each directory.

FOLLOWING ARE THE STATISTICS WHILE USING LLI FOR SINGLE ITERATION

generic_asm.tgz (61.3 KB)

Prasanth J <j.prasanth.j@gmail.com>
writes:

[snip]

FOLLOWING ARE THE STATISTICS WHILE USING LLI FOR SINGLE ITERATION*

[snip]

real 0m0.043s

0.043 * 10000 = 430 seconds

[snip]

Even for single iteration the time take for execution is pretty high when
compared to gcc, llvm-gcc and clang.
What should be the expected behavior while using lli? As per my
understanding as lli does runtime optimizations it should be faster than
clang and llvm-gcc. am i right?

As explained on a previous message, lli translates the llvm bitecode to
native code on each run. This is not comparable to gcc or clang, that
creates a process which you run separately.

I suspect that there are further reasons for not taking your test code
as a meaningful benchmark (it seems to execute too fast, i.e. you are
actually testing process creation performance for the clang and llvm-gcc
and gcc parts).

You can expect that as the test code grows, you will find larger
differences among code executed with lli and code which was compiled
into executable files. Not because the code generaed by lli is bad, but
because it needs to generate the code on each run.

[snip]