OCaml bindings to LLVM

I'm having another play with LLVM using the OCaml bindings for a forthcoming
OCaml Journal article and I have a couple of remarks:

Firstly, I noticed that the execute engine is very slow, taking milliseconds
to call a JIT compiled function. Is this an inherent overhead or am I calling
it incorrectly or is this something that can be optimized in the OCaml
bindings?

Secondly, I happened to notice that JIT compiled code executed on the fly does
not read from the stdin of the host OCaml program although it can write to
stdout. Is this a bug?

Many thanks,

What is the signature of the function you are calling?

When calling a generated function via runFunction, the JIT handles some common
signatures but if it doesn't recognize the function signature it falls
back on generating
a stub function on the fly. This generation is fairly expensive and is
probably the overhead
you are seeing. There should be little more inherent overhead than the
cost of a function
call if the stub path isn't being taken.

The simple solution (aside from fixing JIT) is to change your
signature to match one
of the ones the JIT special cases (see JIT::runFunction). A nullary
one with arguments
passed in globals works fine, if thread safety isn't a concern.

- Daniel

> Firstly, I noticed that the execute engine is very slow, taking
> milliseconds to call a JIT compiled function. Is this an inherent
> overhead or am I calling it incorrectly or is this something that can be
> optimized in the OCaml bindings?

What is the signature of the function you are calling?

  unit -> unit

So I am passing zero arguments and returning void.

When calling a generated function via runFunction, the JIT handles some
common signatures but if it doesn't recognize the function signature it
falls back on generating
a stub function on the fly. This generation is fairly expensive and is
probably the overhead
you are seeing. There should be little more inherent overhead than the
cost of a function
call if the stub path isn't being taken.

The simple solution (aside from fixing JIT) is to change your
signature to match one
of the ones the JIT special cases (see JIT::runFunction). A nullary
one with arguments
passed in globals works fine, if thread safety isn't a concern.

I see. Looking at JIT::runFunction, passing one dummy int32 argument should do
the trick.

I'll see if I can write something a little cleverer on the OCaml side to
run-time compile stubs either so that partial application can be used to
share them or just memoize to reuse them.

I'm having another play with LLVM using the OCaml bindings for a forthcoming
OCaml Journal article and I have a couple of remarks:

Firstly, I noticed that the execute engine is very slow, taking milliseconds to call a JIT compiled function. Is this an inherent overhead or am I calling it incorrectly or is this something that can be optimized in the OCaml bindings?

The high-level calling convention using GenericValue is going to be very slow relative to a native function call. This is true in C++, but even moreso in Ocaml, which must cons up a bunch of objects on the heap for each call. To get best performance, you would want to avoid fine-grained calls into JIT'd code, e.g. by iterating over inputs inside the JIT instead of outside.

If you want to improve performance of the GenericValue-based interface, I'd suggest trying to minimize the number and overhead of allocations in your Ocaml code, then look at the bindings themselves:

- If GenericValues can't be reused, add bindings to allow mutating them. Reuse the same 'n' instances for each call into JIT code. Yucky imperative data structures to the rescue.

- Write bindings for a heap-allocated GenericValue and wrap that in a custom block instead of heap-allocating each GenericValue individually. Of course such an array must be mutable. More imperative data structures!

- Try using placement new to initialize GenericValues inside of Ocaml blocks instead of new'ing them up on the C++ heap as is presently done. This would be outside the bounds of standard C++, so it could fail. This would require circumventing the C bindings, since such cannot expose the C++ GenericValue class as a struct.

- Use Ocaml variants for inputs (type GenericValue = Pointer of 'a | Int of bits * value | ...) and convert those to a stack-based SmallVector<GenericValue>. This will avoid finalizers on the Ocaml blocks. This doesn't work symmetrically for outputs, though. Likewise, it involves going around the C bindings.

But realize that a GenericValue-based interface will always be slow relative to a native call. If you have a specific performance goal though, you may be able to cheaply eliminate 'enough' overhead for your needs without much work. All of the above are relatively simple (should be doable in a day, modulo patch review).

For the very best performance, you really want to call the JIT'd function directly—e.g.,

     let nf = native_function name m

where native_function has type string -> Llvm.module -> 'a and nf has some functional type, like int -> int -> int.

However, this is subject to the quirks and complexities of the Ocaml FFI (e.g., overflow arguments passed in a global array on x86, totally nonstandard calling convention).

- If you know in advance the signature of the functions you're going to call, you can write shims in C (similar to those in llvm_ocaml.c) that will add not terribly much overhead. These wouldn't really be of any use to anyone else, though.

- If not, you can generate the shims at runtime using LLVM (even inline them into the callee), but will have to reimplement Ocaml's FFI macros for unwrapping values and tracking stack roots. This would take considerably more effort to implement (esp. portably), but would be a substantial improvement to the bindings if the helpers were incorporated therein.

Secondly, I happened to notice that JIT compiled code executed on the fly does not read from the stdin of the host OCaml program although it can write to stdout. Is this a bug?

This has nothing to do with LLVM.

— Gordon

Unless tens of thousands of allocations are made for every call, I do not
believe that explains the performance discrepancy I quantified. A millisecond
is a long time in this context.

Does it spawn or fork a process?