Using LLVM to serialize object state -- and performance

I have a legacy C++ application that constructs a tree of C++ objects (an iterator tree to implement a query language). I am trying to use LLVM to "serialize" the state of this tree to disk for later loading and execution (or "compile" it to disk, if you prefer).

Each of the C++ iterator objects now has a codegen() member function that adds to the LLVM code of an llvm::Function. The LLVM code generated is a sequence of instructions to set up the arguments for and call the constructor of each C++ object. (I am using C "thunks" that provide a C API to LLVM to make C++ class constructor calls.) Hence, all the LLVM code taken together into a single "reconstitute" function are mostly a sequence of "call" instructions with a few "store" and "getelementptr" instructions here and there -- fairly straight-forward LLVM code.

I then write out the LLVM IR code to disk and, at some later time, read it back in with ParseIR(), do getPointerToFunction(), execute that function, and the C++ iterator tree has been reconstituted.

This all works, but the JIT compile step is *slow*. For a sequence of about 8000 LLVM instructions (most of which are "call"), it takes several seconds to execute.

It occurred to me that I don't really want JIT compiling. I really want to compile the LLVM code to machine code and write that to disk so that when I read it back, I can just run it. The "reconstitute" function is only ever run once per query invocation, so there's no benefit from JIT compiling it since it will never be run a second or subsequent time.

Questions:

* Is what I'm doing with LLVM a "reasonable" thing to do with LLVM?
* If so, how can I speed it up? By generating machine code? If so, how?

I've looked at the source for llc, but that apparently only generates assembly source code, not object code.

- Paul

I'm not sure I have a clear picture of what you're JIT'ing. If any of the JIT'ed functions call other JIT'ed functions, it may be difficult to find all the dependencies of a functions and recreate them correctly on a subsequent load. Even if the JIT'ed functions only call non-JIT'ed functions, I think you'd need some confidence that the address of the called functions wasn't being moved.

It's possible that what you're considering would work, but I don't think it's a scenario that the JIT intends to support.

It would be possible, however, to use the MCJIT engine and cache its results. It requires some modifications to the MCJIT engine but nothing major (I know because my team has a patch in the works to do this, but it's blocked by some other things at the moment). MCJIT generates complete object images and then uses RuntimeDyld to load them. If you had a hook to save the generated object, you could use RuntimeDyld directly to load it later. There are other ways to generate the object image (i.e. without MCJIT), but I'm not sure it would be easier.

You basically just need to grab the Buffer that MCJIT::emitObject() has after it calls PM.run() and Buffer->flush() but before it passes it to Dyld.loadObject(). If you prefer, you could copy what MCJIT does and move it somewhere in your own code. There's not a lot to it.

-Andy

Hi Paul,

I had an additional thought with regard to the performance issue you are seeing.

As I understand it, you are generating a large number of functions that call other functions. If the functions being called are externals from the perspective of the JITed code that need to be resolved against some static code within the running executable, that's probably where the slowdown is occurring. Whenever the JIT engine (either the legacy JIT or MCJIT) needs to resolve an external function it calls JITMemoryManager::getPointerToNamedFunction to resolve the function address.

The default JITMemoryManager implementation uses sys::DynamicLibrary::SearchForAddressOfSymbol to find the function. If you know all of the names and addresses of the functions that will need to be resolved, you can provide a custom memory manager implementation to optimize this external function resolution.

-Andy

Thanks for responding. Sorry for the delay in my reply, but I was dealing with hurricane Sandy. Anyway....

My software build produces libmylib.so. The JIT'd function only calls external C functions in libmylib.so and not other JIT'd functions. The C functions are simple thunks to call constructors. For example, given:

  class BinaryNode : public Node {
  public:
    BinaryNode( Node *left, Node *right );
    // ...
  };

there exists a C thunk:

  void* T_BinaryNode_new_2Pv( void *left, void *right ) {
    return new BinaryNode( (Node*)left, (Node*)right );
  }

The JIT'd function is just a sequence of such calls to thunks to build up an object tree. The idea is to generate LLVM code, write it out to disk, terminate execution of the current program's process; then, at some later time, start a new process for the program, read in the previously generated LLVM code from disk, call the JIT function that will reconstitute the state of the tree just as it was.

Elsewhere in my code, I keep a set of llvm::Function*'s, one for each thunk. For each function, I use ExecutionEngine::addGlobalMapping() to bind the Function* to the actual thunk. The binding does use Module::getFunction(). Oddly, on Mac OS X, I only have to do this when my program is creating the LLVM code; on Linux, I also have to do it when my program is reading the LLVM code back in and trying to execute it.

Hopefully, I've explained this better.

You then later wrote:

The default JITMemoryManager implementation uses sys::DynamicLibrary::SearchForAddressOfSymbol to find the function. If you know all of the names and addresses of the functions that will need to be resolved, you can provide a custom memory manager implementation to optimize this external function resolution.

Based on my clarification, is this still the best course of action?

- Paul

Hi Paul,

I think you may have gone beyond what I understand in how the legacy JIT code works. It looks like the call to addGlobalMapping should short-circuit the named function look up that I described, but I can't account for why it behaves differently on Mac vs. Linux.

I still don't understand how the external pointers persist between writing and reading, but it sounds like you have that worked out somehow.

Are you writing LLVM IR to disk or machine code?

If I'm not being helpful, feel free to give up on trying to explain things to me.

-Andy

I think you may have gone beyond what I understand in how the legacy JIT code works. It looks like the call to addGlobalMapping should short-circuit the named function look up that I described ...

Well, I first look for the function by name and, if I didn't find it, then I call addGlobalMapping()

Are you writing LLVM IR to disk or machine code?

Currently IR. How can I write machine code?

- Paul

OK, I think it's starting to make sense. You probably don't want to write machine code. That's why I was confused about pointer continuity. It wouldn't work with machine code.

It might be worth stepping through the look-up by name in a debugger to see what's happening. I think it's possible that is slow.

-Andy

Well, I first look for the function by name and, if I didn't find it, then I call addGlobalMapping(). But that's not where the time is going. Here:

  https://dl.dropbox.com/u/46791180/callgraph.pdf

is a call graph generated by kcachegrind. I still don't understand all the numbers (and this PDF seems not to include commas where it should), but if you look at the left fork, the bottom two ovals, "Schedule..." is called 16K times and "setHeightToAtLeas..." is called 37K times. On the right fork, RAGreed... is called 35K times.

Those are far too many calls to *anything* for a simple sequence of "call" LLVM instructions. Something seems horribly wrong.

- Paul

Hi Paul,

This is definitely outside the area where I know the particulars of what's going on. However, one idea that might be worth trying is setting the JIT optimization level to 'CodeGenOpt::None'. This should trigger the use of the FastISel instruction selector. Normally, you wouldn't want that for anything other than generating debug code, but since your routines are just making calls, it might work for you.

-Andy

Switching to CodeGenOpt::None reduced the execution time from 5.74s to 0.84s. By just tweaking things randomly, changing to CodeModel::Small reduced it further to 0.22s.

We have some old, ugly, pure C++ code that we're trying to replace (both because it's ugly and because it's slow). It's execution time is about 0.089s, so that's the time to beat.

Hence, I'd like to reduce the 0.22s time even further to below 0.089s. Any ideas?

- Paul

I've been profiling more; see <https://dl.dropbox.com/u/46791180/perf.png&gt;\.

One thing I'm a bit confused about is why I see a FunctionPassManager there. I use a FunctionPassManager at the end of LLVM IR code generation, write the IR to disk, then read it back later.

Why is apparently another FunctionPassManager being used during the JIT'ing of the IR code? And how do I control what the passes are to that FunctionPassManager?

The function that's being JIT'd has to executed only once so, ideally, I want to find a sweet spot between speed-of-JIT'ing and speed-of-generated-machine code.

- Paul

Why is apparently another FunctionPassManager being used during the JIT'ing of the IR code?

Because codegeneration consists of series of passes. See
lib/CodeGen/LLVMTargetMachine.cpp and lib/CodeGen/Passes.cpp for more
information

And how do I control what the passes are to that FunctionPassManager?

You should not. There are some options though, like optimization level
inside TargetMachine / TargetPassConfig

The passes run are determined by TargetMachine::adPassesToEmitMachineCode (or addPassesToEmitMC in the case of MCJIT), which is called from the JIT constructor. You can step through that to see where the passes are coming from or you can create a custom target machine instance to control it.

-Andy

The passes run are determined by TargetMachine::adPassesToEmitMachineCode (or addPassesToEmitMC in the case of MCJIT), which is called from the JIT constructor. You can step through that to see where the passes are coming from or you can create a custom target machine instance to control it.

Assuming I were to create a TargetMachine, any small examples of controlling it I could see?

What relationship (if any) does the FunctionPassManager I create explicitly for LLVM IR code generation have to the FunctionPassManager created implicitly that's used during JIT native code generation?

No relationship