Slow jitter.

While compiling some sources, translating from my compiler's IR to LLVM
using the C++ API requires 2.5 seconds. If the resulting LLVM module is
dumped as LLVM assembler, the file is 240,000 lines long. Generating
LLVM code is fast.

However, generating the native code is quite slow: 33 seconds. I force
native code generation calling ExecutionEngine::getPointerToFunction for
each function on the module.

This is on x86/Windows/MinGW. The only pass is TargetData, so no fancy
optimizations.

I don't think that a static compiler (llvm-gcc, for instance) needs so
much time for generating unoptimized native code for a similarly sized
module. Is there something special about the JIT that makes it so slow?

The JIT uses the entire code generator, which uses N^2 algorithms etc in some cases. If you care about compile time, I'd strongly suggest using the "local" register allocator and the "-fast" mode. This is what we do for -O0 compiles and it is much much faster than the defaults. However, you get worse performing code out of the compiler.

-Chris

For comparison, how long does it take to write the whole thing out as
native assembler? What optimization level are you using for code
generation?

-Eli

Chris Lattner <clattner@apple.com> writes:

The JIT uses the entire code generator, which uses N^2 algorithms etc
in some cases. If you care about compile time, I'd strongly suggest
using the "local" register allocator and the "-fast" mode. This is
what we do for -O0 compiles and it is much much faster than the
defaults.

Okay, I'll do if some day I figure out how to pass those options to the
JIT :slight_smile:

However, you get worse performing code out of the compiler.

This affects the quality of register allocation and instruction
selection, but optimization passes (inlining, mem2reg, etc) are still
effective, aren't they?

Thanks.

Eli Friedman <eli.friedman@gmail.com> writes:

While compiling some sources, translating from my compiler's IR to LLVM
using the C++ API requires 2.5 seconds. If the resulting LLVM module is
dumped as LLVM assembler, the file is 240,000 lines long. Generating
LLVM code is fast.

However, generating the native code is quite slow: 33 seconds. I force
native code generation calling ExecutionEngine::getPointerToFunction for
each function on the module.

This is on x86/Windows/MinGW. The only pass is TargetData, so no fancy
optimizations.

I don't think that a static compiler (llvm-gcc, for instance) needs so
much time for generating unoptimized native code for a similarly sized
module. Is there something special about the JIT that makes it so slow?

For comparison, how long does it take to write the whole thing out as
native assembler?

What kind of metric this is? How string manipulation and I/O are a
better indication than the number of llvm assembly lines generated or
the ratio (llvm IR generation time / native code generation time)?

What optimization level are you using for code
generation?

As explained on the original post, there are no optimizations
whatsoever.

After reading Chris' message, my only hope is to disable the non-linear
stuff and still get decent native code.

I wanted the comparison to check whether the issue is just "codegen is
slow", or more specifically that JIT codegen is slow. You seem to be
under the impression that it will be significantly slower, but I don't
think it's self-evident. (The output of "time llc dumpedmodule.bc"
would be sufficient.)

-Eli

Óscar Fuentes wrote:

Okay, I'll do if some day I figure out how to pass those options to the
JIT :slight_smile:

Well, the -fast option is easy to get:

  MP = new ExistingModuleProvider(module);
  string error;
  JIT = ExecutionEngine::create(MP, false, &error, llvm::CodeGenOpt::None);

Albert

Albert Graef <Dr.Graef@t-online.de> writes:

Óscar Fuentes wrote:

Okay, I'll do if some day I figure out how to pass those options to the
JIT :slight_smile:

Well, the -fast option is easy to get:

  MP = new ExistingModuleProvider(module);
  string error;
  JIT = ExecutionEngine::create(MP, false, &error, llvm::CodeGenOpt::None);

Thanks Albert.

With this change the time used by code generation goes down from 33
seconds to 26.5.

Eli Friedman <eli.friedman@gmail.com> writes:

Eli Friedman <eli.friedman@gmail.com> writes:

While compiling some sources, translating from my compiler's IR to LLVM
using the C++ API requires 2.5 seconds. If the resulting LLVM module is
dumped as LLVM assembler, the file is 240,000 lines long. Generating
LLVM code is fast.

However, generating the native code is quite slow: 33 seconds. I force
native code generation calling ExecutionEngine::getPointerToFunction for
each function on the module.

This is on x86/Windows/MinGW. The only pass is TargetData, so no fancy
optimizations.

I don't think that a static compiler (llvm-gcc, for instance) needs so
much time for generating unoptimized native code for a similarly sized
module. Is there something special about the JIT that makes it so slow?

For comparison, how long does it take to write the whole thing out as
native assembler?

What kind of metric this is? How string manipulation and I/O are a
better indication than the number of llvm assembly lines generated or
the ratio (llvm IR generation time / native code generation time)?

I wanted the comparison to check whether the issue is just "codegen is
slow", or more specifically that JIT codegen is slow. You seem to be
under the impression that it will be significantly slower, but I don't
think it's self-evident. (The output of "time llc dumpedmodule.bc"
would be sufficient.)

Sorry Eli. I misread your message as if you were suggesting to measure
the time required for dumping the module as LLVM assembler.

llc needs 45 seconds. This is far worse than the 33 seconds used by the
JIT. Maybe llc is using optimizations. My JIT have no optimizations
enabled.

Yup, llc -O0 takes 37.5 seconds.

llc -pre-RA-sched=fast -regalloc=local takes 26 seconds. Much better but
still slow IMO. The question is if this avoids the non-linear algorithms
and if the generated code is faster enough to justify LLVM. I'll do some
experimentation.

The generated assembly file is 290K lines for unadorned llc and 616K
lines for -pre-RA-sched=fast -regalloc=local. This does not inspire much
hope :slight_smile:

Is this a Release or a Release-Asserts build?
You could try how much time it takes on a Release-Asserts build.

Also if you use -time-passes with llc it should show which pass in llc
takes so much time.

Best regards,
--Edwin

Óscar Fuentes wrote:

With this change the time used by code generation goes down from 33
seconds to 26.5.

... and that's probably not worth it because of the loss of code
quality. In Pure I always use llvm::CodeGenOpt::Aggressive, although
there's a preprocessor symbol to select llvm::CodeGenOpt::None at
compile time.

I also found that in Pure the lion's share of compilation time is spent
in the JIT (and I have a bunch of optimization passes enabled, which
don't add much to the total compilation time). That's why I always let
the JIT do its lazy compilation thing, in an interactive
interpreter-like environment that's quite sensible. If people want to
get rid of the JIT latency, they have the option to compile their Pure
scripts to native executables. This approach works very well for me.

Just my 0.02c.

Albert

Hello Török.

Török Edwin <edwintorok@gmail.com> writes:

llc needs 45 seconds. This is far worse than the 33 seconds used by the
JIT. Maybe llc is using optimizations. My JIT have no optimizations
enabled.

Yup, llc -O0 takes 37.5 seconds.

llc -pre-RA-sched=fast -regalloc=local takes 26 seconds. Much better but
still slow IMO. The question is if this avoids the non-linear algorithms
and if the generated code is faster enough to justify LLVM. I'll do some
experimentation.

The generated assembly file is 290K lines for unadorned llc and 616K
lines for -pre-RA-sched=fast -regalloc=local. This does not inspire much
hope :slight_smile:

Is this a Release or a Release-Asserts build?
You could try how much time it takes on a Release-Asserts build.

Assertions are disabled.

Also if you use -time-passes with llc it should show which pass in llc
takes so much time.

These are the three main culprits for llc -O0

   ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
  10.9531 ( 30.0%) 0.4687 ( 58.8%) 11.4218 ( 30.6%) 11.5468 ( 30.6%) X86 DAG->DAG Instruction Selection
  10.2500 ( 28.0%) 0.0156 ( 1.9%) 10.2656 ( 27.5%) 10.2500 ( 27.2%) Live Variable Analysis
   4.8593 ( 13.3%) 0.0000 ( 0.0%) 4.8593 ( 13.0%) 4.8593 ( 12.9%) Linear Scan Register Allocator

And there for -pre-RA-sched=fast -regalloc=simple -O0 code.bc

  10.7187 ( 45.4%) 0.4375 ( 60.8%) 11.1562 ( 45.8%) 11.1718 ( 45.4%) X86 DAG->DAG Instruction Selection
   7.4687 ( 31.6%) 0.0156 ( 2.1%) 7.4843 ( 30.7%) 7.5312 ( 30.6%) Simple Register Allocator
   1.9531 ( 8.2%) 0.1406 ( 19.5%) 2.0937 ( 8.6%) 2.1093 ( 8.5%) X86 Intel-Style Assembly Printer

I suppose we can't get rid of instruction selection :slight_smile:

Another important flag for testing llc time is llc -asm-verbose=false.

Dan

Hello Dan.

Dan Gohman <gohman@apple.com> writes:

Albert Graef <Dr.Graef@t-online.de> writes:

Óscar Fuentes wrote:

With this change the time used by code generation goes down from 33
seconds to 26.5.

... and that's probably not worth it because of the loss of code
quality.

Agreed. The JIT with default options produces slightly slower code than
a brain-dead alternative backend my compiler has. I expect that once
optimization passes are added to the JIT its code will turn "industrial
grade". That's the justification for adding LLVM support, but if
application startup needs several minutes...

In Pure I always use llvm::CodeGenOpt::Aggressive, although
there's a preprocessor symbol to select llvm::CodeGenOpt::None at
compile time.

I also found that in Pure the lion's share of compilation time is spent
in the JIT (and I have a bunch of optimization passes enabled, which
don't add much to the total compilation time). That's why I always let
the JIT do its lazy compilation thing, in an interactive
interpreter-like environment that's quite sensible. If people want to
get rid of the JIT latency, they have the option to compile their Pure
scripts to native executables. This approach works very well for me.

Sadly, I cannot produce executables (at least for a large part of the
application's code) and freezing the application for several seconds
here and there while the JIT does its stuff is not an option either, so
I'm forced to JIT all the code on startup.

Pass -fast-isel to speed up instruction selection.

Dan, I think that this should be made "non hidden" and updated (from llc --help):

   -fast-isel - Enable the experimental "fast" instruction selector

-Chris

It's turned on by -O0. And I guess it's not so "experimental" at this point :). It hasn't been tuned for a wide variety of applications yet though.

An interesting option to add is -fast-isel-verbose, which prints out LLVM instructions that aren't going down the fast path. If there's something that shows up a lot, it may be worthwhile looking into why the front-end is using it, or looking into adding support for that instruction to the fast path.

LLVM has made progress in this area, but there's more to be done.

Dan

Dan Gohman <gohman@apple.com> writes:

[snip]

An interesting option to add is -fast-isel-verbose, which prints out
LLVM instructions that aren't going down the fast path. If there's
something that shows up a lot, it may be worthwhile looking into why
the front-end is using it, or looking into adding support for that
instruction to the fast path.

From a bytecode file that disassembles into a 240K lines LLVM assembly

file, -fast-isel-verbose outputs ~7600 missed instructions.

There are lots of loads/stores of boolean values (i1), bitcasts and
calls.

store i1 : 1802 occurrences (23%)
load i1* : 1076 occurrences (13%)
call : 2590 occurrences (34%)
bitcast : 1848 occurrences (24%)

Almost all the calls have void return value and use the sret attribute.

I can send the bytecode file to anyone interested.

Albert Graef <Dr.Graef@t-online.de> writes:

Óscar Fuentes wrote:

With this change the time used by code generation goes down from 33
seconds to 26.5.

... and that's probably not worth it because of the loss of code
quality.

Agreed. The JIT with default options produces slightly slower code than
a brain-dead alternative backend my compiler has. I expect that once
optimization passes are added to the JIT its code will turn "industrial
grade". That's the justification for adding LLVM support, but if
application startup needs several minutes...

llvm JIT does not do optimizations. If that's important for your application you need to add the optimization passes before passing the bitcode to the JIT. Of course those optimization passes take time but it may significantly speed up codegen.

Evan

Dan Gohman <gohman@apple.com> writes:

[snip]

An interesting option to add is -fast-isel-verbose, which prints out

LLVM instructions that aren't going down the fast path. If there's

something that shows up a lot, it may be worthwhile looking into why

the front-end is using it, or looking into adding support for that

instruction to the fast path.

From a bytecode file that disassembles into a 240K lines LLVM assembly

file, -fast-isel-verbose outputs ~7600 missed instructions.

There are lots of loads/stores of boolean values (i1), bitcasts and
calls.

store i1 : 1802 occurrences (23%)
load i1* : 1076 occurrences (13%)

I've added fast-path support for loads and stores of i1 now.

call : 2590 occurrences (34%)

The fast-path doesn't currently support sret (which you mention below).

bitcast : 1848 occurrences (24%)

For bitcasts, it depends on the specific types involved.

Dan