MCJit Runtine Performance

Hi All,

We recently upgraded a number of applications from LLVM 3.5.2 (old JIT) to LLVM 3.7.1 (MCJit).

We made the minimum changes needed for the switch (no changes to the IR generated or the IR optimizations applied).

The resulting code pass all tests (8000+).

However the runtime performance dropped significantly: 30% to 40% for all applications.

The applications I am talking about optimize airline rosters and pairings. LLVM is used for compiling high level business rules to efficient machine code.

A typical optimization run takes 6 to 8 hours. So a 30% to 40% reduction in speed has real impact (=> we can't upgrade from 3.5.2).

We have triple checked and reviewed the changes we made from old JIT to MCJIt. We also tried different ways to optimize the IR.

However all results indicate that the performance drop happens in the (black box) IR to machine code stage.

So my question is if the runtime performance reduction is known/expected for MCJit vs. old JIT? Or if we might be doing something wrong?

If you need more information, in order to understand the issue, please tell us so that we can provide you with more details.

Thanks
Morten

Yes, unfortunately, this is very much known. Over in the julia project, we’ve recently gone through this and taken the hit (after doing some work to fix the very extreme corner cases that we were hitting). We’re not entirely sure why the slowdown is this noticable, but at least in our case, profiling didn’t reveal any remaining low hanging fruits that are responsible. One thing you can potentially try if you haven’t yet is to enable fast ISel and see if that brings you closer to the old runtimes.

From: "Keno Fischer via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Morten Brodersen" <Morten.Brodersen@constrainttec.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Thursday, February 4, 2016 6:05:29 PM
Subject: Re: [llvm-dev] MCJit Runtine Performance

Yes, unfortunately, this is very much known. Over in the julia
project, we've recently gone through this and taken the hit (after
doing some work to fix the very extreme corner cases that we were
hitting). We're not entirely sure why the slowdown is this
noticable, but at least in our case, profiling didn't reveal any
remaining low hanging fruits that are responsible. One thing you can
potentially try if you haven't yet is to enable fast ISel and see if
that brings you closer to the old runtimes.

And maybe the register allocator? Are you using the greedy one or the linear one? Are there any other MI-level optimizations running?

-Hal

These are some pretty extreme slowdowns. The legacy JIT shared the code generator with MCJIT, and as far as I’m aware there were really only three main differences:

  1. The legacy JIT used a custom instruction encoder, whereas MCJIT uses MC.
  2. (Related to 1) MCJIT needs to perform runtime linking of the object files produced by MC.
  3. MCJIT does not compile lazily (though it sounds like that’s not an issue here?)

Keno - did you ever look at the codegen pipeline construction for the legacy JIT vs MCJIT? Are we choosing different passes?

Morten - Can you share any test cases that demonstrate the slowdown. I’d love to take a look at this.

Cheers,
Lang.

We are using the same IR passes. We did not look at the the backend passes other than fast isel, because I didn’t realize we had a choice there, do we? In our profiling, nothing in MCJIT specifically (relocations, etc.) are taking any significant amount of time. As far as we could tell most of the slow down was in ISel, with a couple additional percent in various IR passes.

Hi Keno,

… I didn’t realize we had a choice there…

You do, though I don’t think the dials and levers have been plumbed up to the interface. I’m happy to take a look at doing that. I’d be very happy for clients to have more options here.

Cheers,
Lang.

Hi Keno,

Thanks for the fast ISel suggestion.

Here are the results (for a small but representational run):

LLVM 3.5.2 (old JIT): 4m44s

LLVM 3.7.1 (MCJit) no fast ISel: 7m31s

LLVM 3.7.1 (MCJit) fast ISel: 7m39s

So not much of a difference unfortunately.

Hi Morten,

Here are the results (for a small but representational run):

That suggests an optimization quality issue, rather than compile-time overhead. That’s good news - I’d take it as a good sign that the MC and linking overhead aren’t a big deal either, and if we can configure the CodeGen pipeline properly we can get the performance back to the same level as the legacy JIT.

Cheers,
Lang.

Hi Lang,

That suggests an optimization quality issue, rather than compile-time overhead

Yes that makes sense. The long running applications (6+ hours) JIT the rules once (taking a few seconds) and then run the generated machine code for hours. With no additional JIT’ing.

if we can configure the CodeGen pipeline properly we can get the performance back to the same level as the legacy JIT.

Sounds great. Happy to help with whatever is needed.

Speaking of which:

We generate low overhead profiling code as part of the generated IR. We use it for identifying performance bottlenecks in the higher level (before IR) optimizing stages.

So I think it would be possible for me to identify a function that runs much slower in 3.7.1. than in 3.5.2. And extract the IR.

Would that help?

Cheers
Morten

From: "Morten Brodersen via llvm-dev" <llvm-dev@lists.llvm.org>
To: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Thursday, February 4, 2016 9:21:58 PM
Subject: Re: [llvm-dev] MCJit Runtine Performance

Hi Lang,

> That suggests an optimization quality issue, rather than
> compile-time overhead

Yes that makes sense. The long running applications (6+ hours) JIT
the rules once (taking a few seconds) and then run the generated
machine code for hours. With no additional JIT'ing.

> if we can configure the CodeGen pipeline properly we can get the
> performance back to the same level as the legacy JIT.

Sounds great. Happy to help with whatever is needed.

Speaking of which:

We generate low overhead profiling code as part of the generated IR.
We use it for identifying performance bottlenecks in the higher
level (before IR) optimizing stages.

So I think it would be possible for me to identify a function that
runs much slower in 3.7.1. than in 3.5.2. And extract the IR.

Would that help?

It seems quite likely to help. Please do.

-Hal

Hi Hal,

We are using the default register allocator. I assume the greedy one is default?

As for other target machine optimizations:

I have tried:

llvm::TargetMachine* tm = ...;

tm->setOptLevel(llvm::CodeGenOpt::Aggressive);

And it doesn't make much of a difference.

And also:

tm->setFastISel(true);

(previous email).

Is there anything else I can try?

Can you build the code with llc? Try with the large code model. I
think that is the default for MCJIT and can be less efficient.

Cheers,
Rafael

OK. I will ask the optimization guys to extract a good example from the production code.

From: "Morten Brodersen via llvm-dev" <llvm-dev@lists.llvm.org>
To: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Thursday, February 4, 2016 9:26:51 PM
Subject: Re: [llvm-dev] MCJit Runtine Performance

Hi Hal,

We are using the default register allocator. I assume the greedy one
is
default?

As for other target machine optimizations:

I have tried:

llvm::TargetMachine* tm = ...;

tm->setOptLevel(llvm::CodeGenOpt::Aggressive);

And it doesn't make much of a difference.

And also:

tm->setFastISel(true);

(previous email).

Is there anything else I can try?

From your previous e-mail, it seems like this is a case of too little optimization, not too much, right?

Are you creating a TargetTransformInfo object for your target?

CodeGenPasses->add(
          createTargetTransformInfoWrapperPass(TM->getTargetIRAnalysis())

I assume you're dominated by integer computation, not floating-point, is that correct?

-Hal

Hi Lang,

MCJIT does not compile lazily (though it sounds like that’s not an issue here?)

That is not an issue here since the code JIT’s once (a few secs) and then run the generated machine code for hours.

Morten - Can you share any test cases that demonstrate the slowdown. I’d love to take a look at this.

The code is massive so not practical. However I will try and extract an example function that demonstrates the difference (as per previous email).

Hi Rafael,

Not easily (llc).

Is there a way to make MCJit not use the large code model when JIT'ing?

Cheers
Morten

I think Davide started adding support for the small code model.

Cheers,
Rafael

Actually, reading over all of this again, I realize I may have made the wrong statement. The runtime regressions we see in julia are actually regressions in how long LLVM itself takes to do the compilation (but since it happens at run time in the JIT case, I think of it as a regression in our running time). We have only noticed occasional regressions in the performance of the generated code (which we are in the process of fixing). Which kind of regression are you talking about, time taken by LLVM or time taken by the LLVM-generated code?

Hi Keno,

I am talking about runtime. The performance of the generated machine code. Not the time it takes to lower the IR to machine code.

We typically only JIT once (taking a few secs) and then run the generated machine code for hours. So the JIT time (IR → machine code) doesn’t impact us.

Cheers
Morten

I agree with Lang and Keno here. This is both unexpected and very interesting. Given the differences in defaults between the two, I would have expected the new JIT to have better performance but longer compile times. That you are seeing the opposite implies there is something very wrong and I’m very interested to help figure out what it is.