LLVM 3.3 JIT code speed

Hi,

Our DSL LLVM IR emitted code (optimized with -O3 kind of IR ==> IR passes) runs slower when executed with the LLVM 3.3 JIT, compared to what we had with LLVM 3.1. What could be the reason?

I tried to play with TargetOptions without any success…

Here is the kind of code we use to allocate the JIT:

    EngineBuilder builder(fResult->fModule);
    builder.setOptLevel(CodeGenOpt::Aggressive);
    builder.setEngineKind(EngineKind::JIT);
    builder.setUseMCJIT(true);
    builder.setCodeModel(CodeModel::JITDefault);
    builder.setMCPU(llvm::sys::getHostCPUName());
    
    TargetOptions targetOptions;
    targetOptions.NoFramePointerElim = true;
    targetOptions.LessPreciseFPMADOption = true;
    targetOptions.UnsafeFPMath = true;
    targetOptions.NoInfsFPMath = true;
    targetOptions.NoNaNsFPMath = true;
    targetOptions.GuaranteedTailCallOpt = true;

   builder.setTargetOptions(targetOptions);
    
    TargetMachine* tm = builder.selectTarget();
    
    fJIT = builder.create(tm);
    if (!fJIT) {
        return false;
    }
    ….

Any idea?

Thanks.

Stéphane Letz

It's hard to say much without seeing the specific IR and the code
generated from that IR.

-Eli

Our language can do either:

1) DSL ==> C/C++ ===> clang/gcc ===> exec code

or

1) DSL ==> LLVM IR ===> (optimisation passes) ==> LLVM IR ==> LLVM JIT ==> exex code

1) and 2) where running at same speed with LLVM 3.1, but 2) is now slower with LLVM 3.3

I compared the LLVM IR that is generated by the 2) chain *after* the optimization passes, with the one that is generated with 1) and clang -emit-llvm -03 with the pure C input. The two are the same. So my conclusion what that the way we are activating the JIT is no more correct in 3.3, or we are missing new steps that have to be done in JIT?

Stéphane Letz

I understand you to mean that you have isolated the actual execution time as your point of comparison, as opposed to including runtime loading and so on. Is this correct?

One thing that changed between 3.1 and 3.3 is that MCJIT no longer compiles the module during the engine creation process but instead waits until either a function pointer is requested or finalizeObject is called. I would guess that you have taken that into account in your measurement technique, but it seemed worth mentioning.

What architecture/OS are you testing?

With LLVM 3.3 you can register a JIT event listener (using ExecutionEngine::RegisterJITEventListener) that MCJIT will call with a copy of the actual object image that gets generated. You could then write that image to a file as a basis for comparing the generated code. You can find a reference implementation of the interface in lib/ExecutionEngine/IntelJITEvents/IntelJITEventListener.cpp.

-Andy

I understand you to mean that you have isolated the actual execution time as your point of comparison, as opposed to including runtime loading and so on. Is this correct?

We are testing actual execution time yes : time used in a given JIT compiled function.

One thing that changed between 3.1 and 3.3 is that MCJIT no longer compiles the module during the engine creation process but instead waits until either a function pointer is requested or finalizeObject is called. I would guess that you have taken that into account in your measurement technique, but it seemed worth mentioning.

OK, so I guess our testing is then correct since we are testing actual execution time of the function pointer.

What architecture/OS are you testing?

64 bits OSX (10.8.4)

With LLVM 3.3 you can register a JIT event listener (using ExecutionEngine::RegisterJITEventListener) that MCJIT will call with a copy of the actual object image that gets generated. You could then write that image to a file as a basis for comparing the generated code. You can find a reference implementation of the interface in lib/ExecutionEngine/IntelJITEvents/IntelJITEventListener.cpp.

Thanks I'll have a look.

-Andy

Stéphane

And since the 1) DSL ==> C/C++ ===> clang/gcc -03 ===> exec code chain has the "correct" speed, there is no reason the JIT based one should be slower right?

So I still guess something is wrong in the way we use the JIT and/or some LTO issue possibly?

Stéphane

Hi,

And since the 1) DSL ==> C/C++ ===> clang/gcc -03 ===> exec code chain

has the "correct" speed, there is no reason the JIT based one should be
slower right?

So I still guess something is wrong in the way we use the JIT and/or some

LTO issue possibly?

When you say "slower" wrt 3.1 on LLVM and the same speed for clang, could
you put some rough time numbers on things for some fixed testcode for your
DSL? Obviously they won't have an absolute meaning, but the order of
magnitude relative to the normal execution times might guide the ideas about
what it could be.

Cheers,
Dave

About 20% slower with LLVM JIT 3.3 compared to clang 3.3, clang 3.1 and LLVM JIT 3.1

Stéphane