Tool for run-time code generation?

Using C++ code, I would like to generate code at run-time (the same way .NET code can use dynamic methods or compiled expressions) in order to obtain very high performance code (to read binary data records whose formats are only known at run-time.) I need to target x86 (Win32) and ARM (WinCE).

Can LLVM be used for this purpose, or would something else work better? Are there any open-source projects that have done this, that I could look to as an example?

David Piepgrass, E.I.T.

Software Developer

image001.jpg

David Piepgrass <dpiepgrass@mentoreng.com> writes:

Using C++ code, I would like to generate code at run-time (the same
way .NET code can use dynamic methods or compiled expressions) in
order to obtain very high performance code (to read binary data
records whose formats are only known at run-time.)

I need to target x86 (Win32) and ARM (WinCE).

Can LLVM be used for this purpose, or would something else work
better?

LLVM has a JIT for this purpose. You generate LLVM IR code (a sort of
generic assembler) and it produces optimized native code ready to be
executed.

x86-win32 is fine. I don't think so about arm-wince.

Are there any open-source projects that have done this, that I
could look to as an example?

This is a list of some of the projects that uses LLVM:

http://www.llvm.org/Users.html

The LLVM tutorial is worth a reading too:

http://www.llvm.org/docs/tutorial/

LLVM has a JIT for this purpose. You generate LLVM IR code (a sort of
generic assembler) and it produces optimized native code ready to be
executed.

x86-win32 is fine. I don't think so about arm-wince.

What's wrong with running LLVM on ARM? It's supposed to support ARM as a target, and since it's written in C it should theoretically compile for ARM. CMake doesn't support Visual Studio 9 for Smart Devices, so I would probably have to go to quite a bit of trouble to create a project file. Still, if I did so, shouldn't it theoretically work?

I happen to be using LLVM for just this reason. I process large volumes of data records with schemas that are only known at runtime and/or can change dynamically as various transforms are applied to such records at various stages.

To this end, I auto-generate C99 syntax at run time, parse it using clang, do some AST transformations, compile using LLVM JIT, and then execute within the same (C++) process. As a point of comparison, I've done similar things with Java bytecode and while the JVM approach was [much] easier learning curve- and documentation-wise, it is hard to complain about the level of control you get with LLVM. It is like having a WISS (When I Say So) dynamic compiler producing -03-level native code. In my case, I target x86-64 and had only minor trouble supporting the same toolchain on Linux and Darwin.

So, the approach is definitely workable, but I must warn about the non-trivial amount of effort required to figure things out in both clang and LLVM codebases. For example, how to traverse or mutate ASTs produced by clang is pretty much a FAQ on the clang list but there is no good documentation addressing this very common use case.

HTH,
Vlad

David Piepgrass <dpiepgrass@mentoreng.com> writes:

LLVM has a JIT for this purpose. You generate LLVM IR code (a sort of
generic assembler) and it produces optimized native code ready to be
executed.

x86-win32 is fine. I don't think so about arm-wince.

What's wrong with running LLVM on ARM?

LLVM can generate code for ARM, but the JIT requires extra target and
platform dependent stuff, and that's not done for arm-wince.

It's supposed to support ARM as a target, and since it's written in C
it should theoretically compile for ARM.

LLVM is C++, although it has C bindings.

CMake doesn't support Visual Studio 9 for Smart Devices, so I would
probably have to go to quite a bit of trouble to create a project
file. Still, if I did so, shouldn't it theoretically work?

The VS project files should be the least of your concerns wrt this
issue. You can ask here for a rough estimation of how hard would be to
add support for arm-wince and then consider your options.

> What's wrong with running LLVM on ARM?

LLVM can generate code for ARM, but the JIT requires extra target and
platform dependent stuff, and that's not done for arm-wince.

The release notes say "compiler_rt now supports ARM targets". What else is needed? Keep in mind that I do not need (or want) Clang or any of the optimizers: I just want to generate small sequences of machine code in-memory (on the heap, I assume) and execute them.

By the way, LLVM comes with 66 projects (BrainF, bugpoint, Kaleidoscope, llc, etc.) and it's not obvious which one(s) I actually need to accomplish this "simple" use of LLVM. Is there a list of all the targets and what they are for?

> It's supposed to support ARM as a target, and since it's written in C
> it should theoretically compile for ARM.

LLVM is C++, although it has C bindings.

Right, sorry, that's what I meant.

Have you ever tried running a dynamically allocated code that cannot execute
malloc() or any runtime libraries or library-based functions because they
haven't been ported to your architecture? That's what it would be like trying
to run ARM Linux code on your WinCE target. Windows CE requires an LLVM port
before it will be useful to you. Put simply, the core architecture of the LLVM
ARM backend requires porting to work.

Have you ever tried running a dynamically allocated code that cannot
execute
malloc() or any runtime libraries or library-based functions because
they haven't been ported to your architecture? That's what it would
be like trying to run ARM Linux code on your WinCE target. Windows CE
requires an LLVM port before it will be useful to you. Put simply,
the core architecture of the LLVM ARM backend requires porting to
work.

WinCE offers most of the standard C APIs (such as malloc), as well as most standard C++ libraries including, I think, the same STL that is available under Win32. There could be some gaps, but I hope, given that I only need part of LLVM, that any "porting" I have to do will involve only small tweaks.

If you mean that dynamically generated code cannot call malloc (why not? Couldn't I provide a table that tells LLVM the address of malloc and any other methods I want to call?), even then it is not a problem, since the code I want to generate will simply access records and do some arithmetic. For example, I might generate a function that does this:

int ReadArrayElement(byte* record, int index, int defValue) {
    unsigned arraySize = (*(unsigned*)(record + 4) & 0x7F);
    if (index > arraySize)
        return defValue;
    return *(int*)(record + 8 + index * 4); }

No method calls there.

There's one very big one named ReadArrayElement() right in the middle of what
you just typed. You'd have to assume the calling conventions of the ARM code
are the same as the x86 code (which is impossible because of the different
architectures), or that the calling convention is the same as ARM Linux. You'd
need to implement the Windows CE ABIs into the backend to be able to define the
C calling convention to call that code. I think you're trivializing something
that is more complicated than you are giving it credit for.

He does mean exactly that. Being able to call a function means that you need to know how the system's calling convention works (what parameters go in register/on the stack, where return values go, etc.). Not just that: accessing a record requires you to know how the record is laid out in memory, which is also governed by platform-specific rules in the translation from C structs to actual memory layouts (complete with padding, etc).

That said, if you're very, very careful to control layouts by hand and only to call functions whose calling convention you can model (maybe write your own prologue in ASM?), you could probably coerce it into working.

--Owen

If you mean that dynamically generated code cannot call malloc (why not? Couldn't I provide a table that tells LLVM the address of malloc and any other methods I want to call?), even then it is not a problem, since the code I want to generate will simply access records and do some arithmetic. For example, I might generate a function that does this:

At least you will need to implement WinCE ARM ABI and port JIT to WinCE...

Vlad wrote:

I happen to be using LLVM for just this reason. I process large volumes
of data records with schemas that are only known at runtime and/or can
change dynamically as various transforms are applied to such records at
various stages.

To this end, I auto-generate C99 syntax at run time, parse it using
clang, do some AST transformations, compile using LLVM JIT, and then
execute within the same (C++) process. As a point of comparison, I've
done similar things with Java bytecode and while the JVM approach was
[much] easier learning curve- and documentation-wise, it is hard to
complain about the level of control you get with LLVM. It is like having
a WISS (When I Say So) dynamic compiler producing -03-level native code.
In my case, I target x86-64 and had only minor trouble supporting the
same toolchain on Linux and Darwin.

I strongly recommend that anyone doing this sort of specialization not to write a system that generates C code as strings and then parses it, unless you happen to be starting with a system that already prints C.

Instead, break the chunks of C you would generate into functions and compile those ahead-of-time. At run time you use llvm only (no clang) to generate a series of function calls into those functions.

Then you can play tricks with that. Instead of fully compiling those functions ahead of time (ie. to .o files), you can compile them into .bc and create an llvm Module out of it, either by loading it from a file or by using 'llc -march=cpp' to create C++ code using the LLVM API that produces said module when run. With your run-time generated code and the library code in the same Module, you can run the inliner before the other optimizers.

Alternately, if your chunks of C are very small you should may find it easy to just produce LLVM IR in memory using the LLVM API directly. See the LLVM language at llvm.org/docs/LangRef.html and the IRBuilder at http://llvm.org/doxygen/classllvm_1_1IRBuilder.html .

Either of these techniques avoids the need to use clang at run-time, or spend time generating large strings just to re-parse them. Since the optimizers are all in LLVM proper, you should get the exact same assembly out.

Nick

In general, the suggestion makes sense. In my case, however, some of the content is actually human-generated and is somewhat free-form. So, some parseable entry language is a requirement and I chose C99 since it seemed to be well supported by clang. The auto-generated part comes into play only as glue that binds the C code variables with the rest of the parent process' runtime.

Also, any overhead of parsing and compiling is well amortized over subsequent processing time (millions of records).

Cheers,
Vlad

Martin C. Martin wrote:

Vlad wrote:

Instead, break the chunks of C you would generate into functions and
compile those ahead-of-time. At run time you use llvm only (no clang) to
generate a series of function calls into those functions.

Compelling. I hadn't considered that.

In our application, we have a tree of primitive operations, where each
one calls into its children and returns to its parent. There are various
types of nodes, and we don't know the topology or types of nodes until
runtime (i.e. Clang/LLVM invocation time). Each operation is pretty
simple, so we'd like to inline the children's operations into the parent
where a C compiler would do it.

Could your technique be extended to that? e.g. precompile to LLVM IR
with calls to a non-existent "node_do_foo()" call, and then replace it
with the specific "childtype_do_foo()" call when we know the type of the
child?

Will you know the prototype of the function call in advance? If so, you can do something really simple where you write the C functions with a function pointer parameter. Then at run-time, use llvm::CloneAndPruneFunctionInto to produce the specialized function by giving it a valuemap that maps the Argument* for the fptr to the concrete Function* you want it to call.

If you don't know the type of the call you intend to place, the question becomes "why not?" What arguments were you planning to pass it, if you don't know how many arguments it takes in advance? I don't see any reason to doubt that it's possible to do, but I would need more details before I could suggest an implementation.

Since the
optimizers are all in LLVM proper, you should get the exact same
assembly out.

This is something I've been wondering. Since Clang has different
information than the LLVM IR, it seems there should be some
optimizations that would be easy to do in Clang, but
difficult/impossible to do in LLVM. No? Not even for C++?

Yes. Copy constructor elimination is permitted by C++ even though it may change the visible behaviour of the program. I think NRVO is also done in the frontend.

The sterling example is type-based alias analysis (which most people know of as -fstrict-aliasing in GCC). C++ has rules which state that sometimes pointers can't alias depending on their types. The LLVM type system is not the C++ type system so we can't just apply those rules, and Clang doesn't have the sort of low-level optimizations that would benefit from this information. The eventual plan is to make clang tag loads and stores with metadata indicating what 'aliasing group' they belong to, then provide an alias analysis pass in LLVM which uses that information.

Nick

Martin C. Martin wrote:

Vlad wrote:

Instead, break the chunks of C you would generate into functions and
compile those ahead-of-time. At run time you use llvm only (no clang) to
generate a series of function calls into those functions.

Compelling. I hadn't considered that.

In our application, we have a tree of primitive operations, where each
one calls into its children and returns to its parent. There are various
types of nodes, and we don't know the topology or types of nodes until
runtime (i.e. Clang/LLVM invocation time). Each operation is pretty
simple, so we'd like to inline the children's operations into the parent
where a C compiler would do it.

Could your technique be extended to that? e.g. precompile to LLVM IR
with calls to a non-existent "node_do_foo()" call, and then replace it
with the specific "childtype_do_foo()" call when we know the type of the
child?

Will you know the prototype of the function call in advance? If so, you
can do something really simple where you write the C functions with a
function pointer parameter. Then at run-time, use
llvm::CloneAndPruneFunctionInto to produce the specialized function by
giving it a valuemap that maps the Argument* for the fptr to the
concrete Function* you want it to call.

Great! I wasn't aware of that, so that's really helpful.

If you don't know the type of the call you intend to place, the question
becomes "why not?" What arguments were you planning to pass it, if you
don't know how many arguments it takes in advance? I don't see any
reason to doubt that it's possible to do, but I would need more details
before I could suggest an implementation.

We're processing large amounts of data, so imagine something like a SQL implementation. In one query I might want to JOIN on an int field. In another query, I'm JOINing on a pair of fields of type unsigned & double. For those two queries, I'd generate:

rightChildType_seekTo(rigtChild, left.getIntColumn3());

vs.

rightChildType_seekTo(rightChild, left.getUIntColumn9(), left.getDoubleColumn23());

Is that possible in LLVM?

Best,
Martin

Martin C. Martin wrote:

Martin C. Martin wrote:

Vlad wrote:

Instead, break the chunks of C you would generate into functions and
compile those ahead-of-time. At run time you use llvm only (no
clang) to
generate a series of function calls into those functions.

Compelling. I hadn't considered that.

In our application, we have a tree of primitive operations, where each
one calls into its children and returns to its parent. There are various
types of nodes, and we don't know the topology or types of nodes until
runtime (i.e. Clang/LLVM invocation time). Each operation is pretty
simple, so we'd like to inline the children's operations into the parent
where a C compiler would do it.

Could your technique be extended to that? e.g. precompile to LLVM IR
with calls to a non-existent "node_do_foo()" call, and then replace it
with the specific "childtype_do_foo()" call when we know the type of the
child?

Will you know the prototype of the function call in advance? If so, you
can do something really simple where you write the C functions with a
function pointer parameter. Then at run-time, use
llvm::CloneAndPruneFunctionInto to produce the specialized function by
giving it a valuemap that maps the Argument* for the fptr to the
concrete Function* you want it to call.

Great! I wasn't aware of that, so that's really helpful.

If you don't know the type of the call you intend to place, the question
becomes "why not?" What arguments were you planning to pass it, if you
don't know how many arguments it takes in advance? I don't see any
reason to doubt that it's possible to do, but I would need more details
before I could suggest an implementation.

We're processing large amounts of data, so imagine something like a SQL
implementation. In one query I might want to JOIN on an int field. In
another query, I'm JOINing on a pair of fields of type unsigned &
double. For those two queries, I'd generate:

rightChildType_seekTo(rigtChild, left.getIntColumn3());

vs.

rightChildType_seekTo(rightChild, left.getUIntColumn9(),
left.getDoubleColumn23());

Is that possible in LLVM?

It's possible, but it's a bit more work. Breaking it down into:

   a = left.getUIntColumn9()
   b = left.getDoubleColumn23()
   rightChildType_seekTo(rightChild, a, b)

shows that you'd have to insert the function calls to your getters then insert them as arguments to seekTo. This isn't that hard, it would look something like:

   Function *seekTo = ...; // lookup based on fields
   std::vector<Value *> Args;
   Args.push_back(rightChild);
   for each field:
     Args.push_back(IRBuilder.CreateCall(...));
   IRBuilder.CreateCall(seekTo, Args.begin(), Args.end());

Maybe the solution is to produce some entire functions through the LLVM API, where those functions can call out to the leaf functions that you wrote in C and compiled to IR.

If your functions are small enough then you should probably just produce each one with the IRBuilder. If they're large and you only want to add specialized code right in the middle, it may work best to write the function in C with a call to a special 'replace_me' function. Then, your custom specialization pass would clone the function and find the special call using a technique like this:

   LLVM Programmer’s Manual — LLVM 16.0.0git documentation

erase it via. eraseFromParent() then insert the replacement code.

Nick