llvmcpy: yet another Python binding for LLVM

Hi, I wrote yet another [1,2] Python binding for LLVM! I'm doing this
because llvmlite has some serious limitations: 1) it cannot parse an
existing IR, only create new modules [3], 2) it keeps its own
representation of the IR (which is less memory efficient than the LLVM
one), and 3) each llvmlite version supports a single LLVM version.

Considering that my need is to load modules of hundreds of MiB, this
is was kind of a problem.
So I've come up with a "Python API generator". Basically it uses CFFI
[4] to parse the LLVM-C API headers and automatically generate (using
some heuristics) a Pythonic API, with classes, properties and the like.

I've quickly tested it with LLVM 3.4, 3.8 and 3.9, and, for its
simplicity, does a good job. It also supports multiple LLVM
installations (it uses the one of the first llvm-config in path).

I'd be happy to have some feedback, give it a look:

https://rev.ng/llvmcpy

Using something like CFFI to autogenerate bindings is definitely a good approach to this problem. It'll produce bindings which aren't entirely idiomatic for python, but they'll at least be reasonable likely to remain in sync. This also has the nice property that new additions to the C API get picked up without manual work; this should serve to incentive contribution in this area.

You mention in your readme that you had to slightly modify the LLVM C headers to get this approach to work. Can you point out a couple of example changes? Maybe these are things we should consider taking upstream.

I've not familiar with the details of CFFI. Are the bindings it generates for a particular set of headers specific to the machine it's generated on? Or could the resulting bindings be published and reused directly? If so, hosting a set of bindings for previous releases would be a useful service.

Philip

You mention in your readme that you had to slightly modify the LLVM C
headers to get this approach to work. Can you point out a couple of
example changes? Maybe these are things we should consider taking
upstream.

Take a look at the `clean_include_file` function:

Basically CFFI doesn't handle enum entries whose valus is computed
through an expression. In the LLVM-C API sometimes we have 1 << 8.
Also, static inline functions are not handled too (CFFI only handles
function prototypes), so I've to strip them away.

I'm not sure it'll ever be possible to handle unmodified LLVM-C API
headers with no modifications, and given that one explicit aim is to
support older versions of LLVM I'd have to keep that code anyway.

It would be nice, however if that code doesn't have to grow in the
future (e.g., having sophisticated expression as enum values).

A thing I like about the C API is the consistency in function naming
like having LLVMGetSomething/LLVMSetSomething pairs,
LLVMCountSomethings/LLVMCountSomethings pairs and
LLVMGetFirstSomething/LLVMGetNextSomething pairs.

What I'd need would be the ability to know the name of the arguments,
which CFFI doesn't provide. That would allow me to set up slightly more
robust heuristics. For instance I'm now transforming a pair of pointer
arguments followed by an integer as a pointer to an array plus its
size, and it's fine in current versions of LLVM but it's not very
robust. Same argument for error messages, having the argument name
would help. But this is more a CFFI issue.

I've not familiar with the details of CFFI. Are the bindings it
generates for a particular set of headers specific to the machine
it's generated on? Or could the resulting bindings be published and
reused directly? If so, hosting a set of bindings for previous
releases would be a useful service.

I'm not entirely sure they're portable across OS/architectures. What
would be the use case? It takes a moment to generate the bindings but
it's something the module will lazily do for you only once.