Prototype of an LLVM IR => C compiler ("c backend")

Hi everyone,

I wrote a small proof-of-concept “C backend” (see later for why the quotes are there). It allows compiling LLVM IR into C. It works on top of the Emscripten asm.js backend, basically doing an AST-level transform on the asm.js output and turning that into C, so the process is

C/C++ => LLVM IR => asm.js => C

Hence the quotes before, it is not currently written as an LLVM backend. However, if there is interest, this quick hack could be refactored into a backend. Basically Emscripten’s current backend, which emits asm.js, could be refactored to emit either asm.js or C (all that is needed is to allow customization of the output code in the areas where asm.js and C look different; the differences are almost all purely superficial).

The goal of the project was to check for feasibility - while the Emscripten backend emits asm.js which is very parallel to C, it has a particular form that might in theory prevent efficient compilation to C. For example, all memory accesses are inside a single large array. The good news is that it looks like those issues are not showstoppers, and performance is quite good as well:

benchmark x slower than original
copy 1.03
corrections 1.00
fannkuch 1.17
fasta 0.81
memops 1.00
primes 1.01
skinning 1.06
box2d 1.04
zlib 1.19

Numbers are how much slower the C backend output is, when compiled natively, compared to the original source also compiled natively. So 1.03 means the C backend output is 3% slower, etc.

I’m not sure why fasta becomes 19% faster, that is quite puzzling (I verified the output is correct, so it isn’t just running a different code path), but the other results show slowdowns between 0%-19%, with something like an average 10% slowdown. So the C+±>LLVM IR->asm.js->C route preserves performance very close to the original, and in theory this could allow things like compiling c++11 code to C which can then run on platforms without c++11 support (like the xbox).

Some limitations:

  • asm.js is a 32-bit arch, so the output is 32-bit. It compiles with -m32 on 64-bit systems though.
  • Emscripten’s output uses the Emscripten system headers, portable libc, etc., and those are used in the output as well. It connects to the native libc for actual printf and stuff like that, but handles almost all libc stuff itself. This could be changed though.
  • The generated C code is standalone, it can’t be linked with other C or C++ code.

The Emscripten LLVM backend is still a work in progress and not upstream, but if there is interest in a C backend based on it, we could work towards that and hopefully upstream both at some point. What the proof of concept shows so far is that overall this compilation approach works well enough to be the basis for a C backend with decent performance.

Code is in emscripten’s ‘c_backend’ branch, https://github.com/kripken/emscripten/tree/c_backend (see tools/c_backend.py)

Thoughts&feedback welcome, I hope this is interesting to some people.

  • Alon

I’m really pleased to know this functionality exists. My use case for LLVM ==> C involves “compiling” pre-trained machine learning algorithms into single C source files. For this scenario, I don’t expect much slowdown due to the intermediate JS representation, and was anticipating doing myself what you’ve done now that there is no official LLVM C backend. I hope this functionality gets supported by Emscripten regardless of whether or not it’s upstreamed to LLVM!
Thanks,

Josh