GSoC 2011: Fast JIT Code Generation for x86-64

Hi All,

I'd like to propose a fast path through code generation for x86-84 in
the JIT execution engine as part of 2011's Google Summer of Code
program.

While the LLVM-JIT is very popular as a first try at jitting
programming languages, projects have abandoned the LLVM-JIT when
disappointed with the overall runtime. The problem is, that the
benefit of faster execution is traded for longer compile time -- which
only pays off for code that is executed frequently. One solution to
this problem is an adaptive compilation scheme with separate compile
strategies for cold and hot code.

The aim of my project is to create the fast path for code that is
compiled the first time in such an adaptive compilation scheme: a code
generator that produces unoptimized code in a very short time.

I plan to implement a two-pass (almost) linear code generator
specifically for x86-64 that

- performs analyses (e.g. live-range analysis) on LLVM-IR in the
   first pass and

- then generates x86-64 instructions directly from IR in a second
   pass that writes to the executable memory (e.g. in
   X86CodeEmitter.cpp),

circumventing the more expensive backend passes.

This code generator can then be part of an adaptive compilation
framework within LLVM (a GSoC proposal in this direction is currently
discussed on the llvm-dev mailing list[1]), or from outside of LLVM --
the latter being my main motivation.

I currently work on generating fast cycle-accurate simulators[2]. For
this, our institute has implemented a two-part adaptive compilation
scheme using the LLVM-JIT. Although most optimizations are turned off
already and the FastISel instruction selector is used, the "fast" path
for first-time code generation is still the bottleneck of the
simulators. This is for the largest part due to the SelectionDAG
instruction selection process, hence the motivation for a simpler,
two-pass code generator.

As for my personal details, I'm a PhD student at Vienna University of
Technology (TU Wien) with a strong background in compiler theory,
acquired in a wide variety of undergradute- and graduate-level
courses.

I appreciate any suggestions and would be very excited if someone is
interested in mentoring this.

Please note that I'm offline until April 4, so I cannot respond before
next Tuesday.

- Viktor Pavlu

Hi Viktor,

I think this is a great idea overall! This problem is something that has almost turned me away from LLVM several times now.

I’m by no means an influential member of the community (and hence have no real say in GSoC projects), but I do have a few comments.

I plan to implement a two-pass (almost) linear code generator
specifically for x86-64 that

  • performs analyses (e.g. live-range analysis) on LLVM-IR in the
    first pass and

I assume this is for collecting information for register allocation? For fast code generation, I would go with a local, bottom-up, linear register allocator, which shouldn’t require an explicit live-range analysis pass. It only needs to know liveness information within a single block (mostly), which should be easier and faster to compute on-demand instead of in an analysis pass.

  • then generates x86-64 instructions directly from IR in a second
    pass that writes to the executable memory (e.g. in
    X86CodeEmitter.cpp),

It sounds as if you are intending on mostly hand-writing the code generation part. If this is the case, I would suggest that it would be significantly more valuable to generate it from the *.td files instead. That way, it should be a lot easier to port to other architectures.

Sincerely,
Joshua

Viktor Pavlu <vpavlu@gmail.com> writes:

[snip]

While the LLVM-JIT is very popular as a first try at jitting
programming languages, projects have abandoned the LLVM-JIT when
disappointed with the overall runtime.

Hear, hear.

The problem is, that the benefit of faster execution is traded for
longer compile time -- which only pays off for code that is executed
frequently. One solution to this problem is an adaptive compilation
scheme with separate compile strategies for cold and hot code.

[snip]

I currently work on generating fast cycle-accurate simulators[2]. For
this, our institute has implemented a two-part adaptive compilation
scheme using the LLVM-JIT. Although most optimizations are turned off
already and the FastISel instruction selector is used, the "fast" path
for first-time code generation is still the bottleneck of the
simulators. This is for the largest part due to the SelectionDAG
instruction selection process, hence the motivation for a simpler,
two-pass code generator.

Well, anything that makes the JIT usable for those of us compiling
medium-sized code (on the order of hundred of KB to a few MB of
generated native code) is greatly appreciated.

As a means of improving runtime performance, my compiler supports the
LLVM JIT and a dumb X86 assembler generator that makes very simple
optimizations and has some hard constraints. The latter runs on a
fraction of the time and performs very similar or better than the LLVM
JIT (without LLVM's optimization passes.) So I'm pretty sure that it is
possible to dramatically reduce the time required by the JIT without a
severe impact on performance.

[snip]

This is effectively what fastisel was created for - there are just IR
constructs that don't go through that path. The idea is that fastisel
will get most of the IR and everything that'd be really hard we just
punt to the DAG. I imagine running more things through fastisel would
help.

That won't help the slow register allocation problem though - even
the fast allocator is pretty slow. I haven't seen what your plan
is for register allocation or were you planning on just using a few
registers in defined ways?

Also, X86CodeEmitter.cpp is going away to be replaced with the MC
emitters.

-eric

[...] Although most optimizations are turned off
already and the FastISel instruction selector is used, the "fast" path
for first-time code generation is still the bottleneck [...]

This is effectively what fastisel was created for - there are just IR
constructs that don't go through that path. The idea is that fastisel
will get most of the IR and everything that'd be really hard we just
punt to the DAG. I imagine running more things through fastisel would
help.

To me, increasing coverage of the FastISel seemed more involved than
directly emitting opcodes to memory, with a lesser outlook on
reducing overhead.

That won't help the slow register allocation problem though - even
the fast allocator is pretty slow. I haven't seen what your plan
is for register allocation or were you planning on just using a few
registers in defined ways?

My first idea was to implement a linear scan allocator integrated
into the code generation pass.

Also, X86CodeEmitter.cpp is going away to be replaced with the MC
emitters.

Yes, I remember reading about this on the mailing list.
With our simulator generators we are still living in 2.2/2.6 land,
though, but we will change that.

X86CodeEmitter was only meant to indicate that in my intended fast
path there is nothing in between the LLVM-IR passes and the final
emission of the code, i.e. an LLVM-IR pass that produces x86-64.

- Viktor

[...] Although most optimizations are turned off
already and the FastISel instruction selector is used, the "fast" path
for first-time code generation is still the bottleneck [...]

This is effectively what fastisel was created for - there are just IR
constructs that don't go through that path. The idea is that fastisel
will get most of the IR and everything that'd be really hard we just
punt to the DAG. I imagine running more things through fastisel would
help.

To me, increasing coverage of the FastISel seemed more involved than
directly emitting opcodes to memory, with a lesser outlook on
reducing overhead.

That seems extremely unlikely. You'd be effectively re-implementing both fast-isel and the MC binary emitter layers, and it sounds like a new register allocator as well.

What Eric is suggesting is instead locating which IR constructs are not being handled by fast-isel and are causing problems (i.e., are being frequently encountered in your code-base) and implementing fast-isel handling for them. That will remove the selectiondag overhead that you've identified as the primary compile-time problem.

-Jim

An alternative that would expand LLVMs capabilities would be to write an
interpreter for the LLVM IR itself. A well written interpretation framework
could be used by the compiler as well.

- Jan

[...] Although most optimizations are turned off
already and the FastISel instruction selector is used, the "fast" path
for first-time code generation is still the bottleneck [...]

This is effectively what fastisel was created for - there are just IR
constructs that don't go through that path. The idea is that fastisel
will get most of the IR and everything that'd be really hard we just
punt to the DAG. I imagine running more things through fastisel would
help.

To me, increasing coverage of the FastISel seemed more involved than
directly emitting opcodes to memory, with a lesser outlook on
reducing overhead.

Then you're not quite understanding what fast-isel does completely. The
idea behind fast-isel is that common code that can be easily splatted
out using effectively assembly is done using that. If you're seeing
a lot of time in the dag instruction selection then the code you're
putting through fast-isel isn't getting all the way through and fast-isel
is punting to selection dag.

If this is happening a lot you'll probably want to change the IR that
you're generating if possible. If you'd like to see the constructs
that fast-isel is punting to selection dag on, then there are options
to get it to be more verbose, or even abort.

That won't help the slow register allocation problem though - even
the fast allocator is pretty slow. I haven't seen what your plan
is for register allocation or were you planning on just using a few
registers in defined ways?

My first idea was to implement a linear scan allocator integrated
into the code generation pass.

You may want to look at the fast register allocator then.

Also, X86CodeEmitter.cpp is going away to be replaced with the MC
emitters.

Yes, I remember reading about this on the mailing list.
With our simulator generators we are still living in 2.2/2.6 land,
though, but we will change that.

X86CodeEmitter was only meant to indicate that in my intended fast
path there is nothing in between the LLVM-IR passes and the final
emission of the code, i.e. an LLVM-IR pass that produces x86-64.

Effectively what you're talking about then is a pass that rewrites
fast-isel to use MC instead of machine instructions. Also, one of
the advantages of fast-isel is that there's very little that has to
go through the hand coded parts - a great deal is autogenerated out
of the .td files. This is something that any replacement should do
as well.

-eric

Jim Grosbach <grosbach@apple.com> writes:

To me, increasing coverage of the FastISel seemed more involved than
directly emitting opcodes to memory, with a lesser outlook on
reducing overhead.

That seems extremely unlikely. You'd be effectively re-implementing
both fast-isel and the MC binary emitter layers, and it sounds like a
new register allocator as well.

What Eric is suggesting is instead locating which IR constructs are
not being handled by fast-isel and are causing problems (i.e., are
being frequently encountered in your code-base) and implementing
fast-isel handling for them. That will remove the selectiondag
overhead that you've identified as the primary compile-time problem.

At some point on the past someone was kind enough to add fast-isel for
some instructions frequently emitted by my compiler, hoping that that
would speed up JITting. The results were dissapointing (negligible,
IIRC). Either fast-isel does not make much of a difference or the main
inefficiency is elsewhere.

Bug number?

Seriously, if you haven't looked at how fast-isel does then you need to.
It is, in theory, almost no different than what he's planning on doing.
You may still be falling into selection dag. If you're not then some
investigation may be enlightening.

-eric

Hi Viktor,

Jim Grosbach <grosbach@apple.com> writes:

To me, increasing coverage of the FastISel seemed more involved than
directly emitting opcodes to memory, with a lesser outlook on
reducing overhead.

That seems extremely unlikely. You’d be effectively re-implementing
both fast-isel and the MC binary emitter layers, and it sounds like a
new register allocator as well.

What Eric is suggesting is instead locating which IR constructs are
not being handled by fast-isel and are causing problems (i.e., are
being frequently encountered in your code-base) and implementing
fast-isel handling for them. That will remove the selectiondag
overhead that you’ve identified as the primary compile-time problem.

At some point on the past someone was kind enough to add fast-isel for
some instructions frequently emitted by my compiler, hoping that that
would speed up JITting. The results were dissapointing (negligible,
IIRC). Either fast-isel does not make much of a difference or the main
inefficiency is elsewhere.

fast-isel discussion aside, I think the real speed killer of a dynamic binary translator (or other users of the JIT which invoke it many times on small pieces of code) is the constant time of the JIT which is required for every source ISA BB (each BB gets mapped to an LLVM Function).

[1] cites a constant overhead of 10 ms per BB. I just did some simple measurements with callgrind doing an lli on a simple .ll file which only contains a main function which immediately returns. With -regalloc=fast and -fast-isel and an -O2 compiled lli we spend about 725000 instructions in getPointerToFunction(). Clearly, that’s quite some constant overhead and I doubt that we can get it down by two orders of magnitude, so what about this:

The old qemu JIT used an extremely simple and fast approach which performed surprisingly well: Having chunks of precompiled machine code (from C sources) for the individual IR instructions which at runtime get glued together and patched as necessary.

The idea would be to use the same approach to generate machine code from LLVM IR, e.g. having chunks of LLVM MC instructions for the individual LLVM IR instructions (ideally describing the mapping with TableGen), glueing them together doing no dynamic register allocation, no scheduling.

I’d be willing to mentor such a project, let me know if you’re interested.

Regards,

Tilmann

[1] http://www.iaeng.org/publication/IMECS2011/IMECS2011_pp212-216.pdf

*nod* If we were going to do that, I'd be up for replacing the scheme in fast-isel with that. I just don't see how the general method is going to be any different. Either you have to cover every bit of the IR, or you don't and you punt to a scheme like the DAG which has to cover everything.

That said, the compilation itself could be way slow as you mentioned with the binary translation. I'd be really curious where the overhead is. I'd liked to get that down for sure.

-eric

Hi Viktor,

> Jim Grosbach <grosbach@apple.com> writes:
>
> >> To me, increasing coverage of the FastISel seemed more involved than
> >> directly emitting opcodes to memory, with a lesser outlook on
> >> reducing overhead.
> >
> > That seems extremely unlikely. You'd be effectively re-implementing
> > both fast-isel and the MC binary emitter layers, and it sounds like a
> > new register allocator as well.

it should be possible to leverage some of the existing infrastructure.
in particular, as Tilmann said, the MC binary emitter.

> At some point on the past someone was kind enough to add fast-isel for
> some instructions frequently emitted by my compiler, hoping that that
> would speed up JITting. The results were dissapointing (negligible,
> IIRC). Either fast-isel does not make much of a difference or the main
> inefficiency is elsewhere.

going through machine-level IR is certainly one of those inefficiencies.
we should try to do code generation in two passes. one over the IR generating
binary code. the second fixing-up relocations on the binary code. instruction
selection and register allocation should be performed in one go.

everyone in the need for something more sophisticated can fall back to the
regular backend flow.

fast-isel discussion aside, I think the real speed killer of a dynamic
binary translator (or other users of the JIT which invoke it many times on
small pieces of code) is the constant time of the JIT which is required for
every source ISA BB (each BB gets mapped to an LLVM Function).

[1] cites a constant overhead of 10 ms per BB. I just did some simple
measurements with callgrind doing an lli on a simple .ll file which only
contains a main function which immediately returns. With -regalloc=fast and
-fast-isel and an -O2 compiled lli we spend about 725000 instructions in
getPointerToFunction(). Clearly, that's quite some constant overhead and I
doubt that we can get it down by two orders of magnitude, so what about
this:

i fully agree. i did some measurements (quite a while ago) and the backend
part was always dominating. even when rather heavy high-level optimizations
were enabled.

in short, the JIT is not a JIT. it is merely a regular backend emitting
instructions directly to memory.
(that's neat, but does not deliver the compile time people hope for)

The old qemu JIT used an extremely simple and fast approach which performed
surprisingly well: Having chunks of precompiled machine code (from C
sources) for the individual IR instructions which at runtime get glued
together and patched as necessary.

The idea would be to use the same approach to generate machine code from
LLVM IR, e.g. having chunks of LLVM MC instructions for the individual LLVM
IR instructions (ideally describing the mapping with TableGen), glueing them
together doing no dynamic register allocation, no scheduling.

that is the way to go for me.

i just see one large obstacles: how do we handle lowering?

lowering is important to realize unsupported operations, ABI conventions, ...
most of this is now done in C++ code that is highly depending on the DAG.

i would suggest to do lowering on the linear LLVM IR, that handles all relevant
constructs to be able to generate machine code directly from the IR. the DAG
based lowering could be completely eliminated (or stripped down to a form of
lowering/ABI/pre-isel optimization).

I'd be willing to mentor such a project, let me know if you're interested.

i think this would be an important step to make the LLVM JIT more attractive.

bye,
Florian

Thanks for all the replies!

I wanted to closely resemble what the CACAO VM[1] backend did with
success for a long time: for every CACAO IR instruction, there is a
sequence of x86 instructions that get written directly to the executable
memory. In CACAO, registers are used while available, then everything is
spilled. Relocations are resolved and patched in a second go.

It seems this is similar to what Tilmann refers to in the old qemu JIT:

The old qemu JIT used an extremely simple and fast approach which performed
surprisingly well: Having chunks of precompiled machine code (from C
sources) for the individual IR instructions which at runtime get glued
together and patched as necessary.

The idea would be to use the same approach to generate machine code from
LLVM IR, e.g. having chunks of LLVM MC instructions for the individual LLVM
IR instructions (ideally describing the mapping with TableGen), glueing them
together doing no dynamic register allocation, no scheduling.

I'd be willing to mentor such a project, let me know if you're interested.

So yes, I would be interested.

Only recently is CACAO starting to get a register allocator to improve
quality of the generated code.
I wanted to include this in my first stab at the project but leaving
register allocation for future work or even for the regular backend is
fine with me, too.

- Viktor

[1]: CACAO VM
http://www.cacaovm.org/

Hi Viktor,

Thanks for all the replies!

I wanted to closely resemble what the CACAO VM[1] backend did with
success for a long time: for every CACAO IR instruction, there is a
sequence of x86 instructions that get written directly to the executable
memory. In CACAO, registers are used while available, then everything is
spilled. Relocations are resolved and patched in a second go.

Sounds good!

I think what we want here is something which is as generic as possible, e.g. it would be interesting to investigate whether it is possible to generate parameterizable chunks of machine code directly from LLVM IR instructions using the existing codegen infrastructure (at compile time of the LLVM JIT, like qemu does when invoking a C compiler on the C sources which implement the individual IR instructions). Leveraging as much as possible of the TableGen instruction descriptions and the MC infrastructure should also be an explicit goal.

I’d be willing to mentor such a project, let me know if you’re interested.

So yes, I would be interested.

Awesome!

So I guess the next step is to come up with a detailed proposal and submit it :slight_smile:

Only recently is CACAO starting to get a register allocator to improve
quality of the generated code.
I wanted to include this in my first stab at the project but leaving
register allocation for future work or even for the regular backend is
fine with me, too.

Regards,

Tilmann