Can I use Clang to parse snippets of C++ code?

Hello guys,

I'd like to use Clang to parse snippets of (and emit bytecode for) C++ code that come from larger files that don't contain only C++, but looking at the clang interpreter example, either I didn't get it, or it looks like the driver expects only files, and not strings or char buffers.

Is there a simple way to achieve this? Do I have to split my input into small files and pass them to clang, or is there a better way?

Félix

The semantics of C++ depend heavily on what comes before the given
fragment. How do you plan to address this? For example, if you know
all the headers you think these snippets will include, you can do
something similar to PCH to parse the fragment in context of all of
the headers.

I don't know much about feeding clang buffers instead of files, but I
believe it can be done with some of the "virtual" file suport that has
been added recently.

Reid

Yup, I know beforehand everything that needs to be included or declared, so this is not an issue. It *would* work if I made individual files, but it doesn't look like a very intuitive way to do it.

I'll look into PCH. I was a little bit startled when I opened the index.html file docs/; at least there seems to be a lot of documentation inside the code.

Félix

It depends on exactly what you're trying to achieve. LLDB uses clang to parse individual expressions, and does this by hooking into various name lookup routines to dynamically/lazily populate symbol tables from debug info.

This is all possible, but it's a nontrivial amount of work.

-Chris

I believe that what I’m trying to do with Clang is fairly simple; the final goal, however, might be a little harder.

Knowing myself, there are chances I’ll never go through with this project (like Mikael who posted earlier, I’m nothing but an enthusiastic student with lots of time on my hands), but it feels cool enough to me to announce my idea. Besides, I’ll probably need help from more knowledgeable people anyways.

The way LLVM works makes it pretty easy and straightforward to generate code from nested structures (like ASTs), which is totally commendable since LLVM is a compiler back-end. However, in the past months I’ve set myself to make an emulator back-end with LLVM that would translate machine code to LLBC then compile it to native code with the JIT, and my experience haven’t been so great, especially because of the following:

  • it’s stupid-hard to debug just-in-time generated code with the version of gdb that ships with Xcode (it repeatedly crashed on me);
  • the sheer number of cases to treat is, in itself, rebuking enough: a NES 6502 is ‘fine’ with just less than 60 distinct operations, but the full-fledged PowerPC you get with a GameCube has roughly 6 times more;
  • when faced with subtle bugs, it’s much easier to deal with C++ code representing what you want to do (like interpreter code) than IRBuilder::Create* calls.

I figured that while I can’t do much about the first, if I could get LLVM to generate code that would generate code, the two others would be much less cumbersome.

So my plan is to write a tool that accepts a specification of how instructions should be interpreted, with handlers written in C++, and turn that into an usable recompiler (that would also use LLVM libraries). The grammar would be a shell for C++ code, and I’d use Clang to turn the actual code into LLBC; then, I would pass through the code (à la llvm2cpp), and create calls to an IRBuilder to generate equivalent code. Once this generated class compiled (through regular means), clients would call the appropriate methods on the object to generate code, and will finally be able to get a Function to use with the JIT.

I’ve joined an example grammar and an example expected output (made on the train, it’s not actually working, but it gives a good idea).

example.txt (2.64 KB)

Hello,

I believe that what I'm trying to do with Clang is fairly simple; the final goal, however, might be a little harder.

Knowing myself, there are chances I'll never go through with this project (like Mikael who posted earlier, I'm nothing but an enthusiastic student with lots of time on my hands), but it feels cool enough to me to announce my idea. Besides, I'll probably need help from more knowledgeable people anyways.

The way LLVM works makes it pretty easy and straightforward to generate code from nested structures (like ASTs), which is totally commendable since LLVM is a compiler back-end. However, in the past months I've set myself to make an emulator back-end with LLVM that would translate machine code to LLBC then compile it to native code with the JIT, and my experience haven't been so great, especially because of the following:
  • it's stupid-hard to debug just-in-time generated code with the version of gdb that ships with Xcode (it repeatedly crashed on me);
  • the sheer number of cases to treat is, in itself, rebuking enough: a NES 6502 is 'fine' with just less than 60 distinct operations, but the full-fledged PowerPC you get with a GameCube has roughly 6 times more;
  • when faced with subtle bugs, it's much easier to deal with C++ code representing what you want to do (like interpreter code) than IRBuilder::Create* calls.

I figured that while I can't do much about the first, if I could get LLVM to generate code that would generate code, the two others would be much less cumbersome.

So my plan is to write a tool that accepts a specification of how instructions should be interpreted, with handlers written in C++, and turn that into an usable recompiler (that would also use LLVM libraries). The grammar would be a shell for C++ code, and I'd use Clang to turn the actual code into LLBC; then, I would pass through the code (à la llvm2cpp), and create calls to an IRBuilder to generate equivalent code. Once this generated class compiled (through regular means), clients would call the appropriate methods on the object to generate code, and will finally be able to get a Function to use with the JIT.

I've joined an example grammar and an example expected output (made on the train, it's not actually working, but it gives a good idea).

<example.txt>

Sounds a lot like LLVM's TableGen .td file format, no? It uses [{ ... }] syntax to include snippets of C++ code.

If this is specifically about emulating PowerPC, I would recommend extending an existing emulator like QEMU, which also emulates some of the hardware you'll need for the CPU to do anything useful. Having said that, it might be interesting to use LLVM as TCG backend though.

Andreas