Rolling my own LLVM assembly language parser

Hi everyone,

I'm currently in the first year of my PhD, and I'm going to be looking at an
experimental IR for my thesis. After looking at a variety of research
compilers I've come to the conclusion that LLVM is the nicest to work with
for my purposes. I was considering writing the code to construct this
experimental IR from LLVM assembly, and then at the end of the process (i.e.
new optimisations and transformations, etc.) I'll translate back into LLVM
assembly to allow compilation to continue. This keeps everything modular,
means I can work with "real" languages rather than toy ones, and so on.

I was thinking of generating my own lexer and parser for LLVM assembly. I'm
aware that between the specification here:

http://llvm.org/docs/LangRef.html

and also the comments in LLParser.cpp there is information about the grammar
for .ll files, but is there any documentation that simply states the full
grammar, much in the style of this C grammar:

http://www.lysator.liu.se/c/ANSI-C-grammar-y.html

or this Python grammar?

http://www.python.org/doc/2.5.2/ref/grammar.txt

Does anyone have any thoughts or experience with parsing .ll files? I'd like
to add that if I've managed to somehow miss the document I'm looking for,
then I'm willing to make a paper dunce hat and wear it for the rest of the
week.

Best,

James

jstanier wrote:

Hi everyone,

I'm currently in the first year of my PhD, and I'm going to be looking at an
experimental IR for my thesis. After looking at a variety of research
compilers I've come to the conclusion that LLVM is the nicest to work with
for my purposes. I was considering writing the code to construct this
experimental IR from LLVM assembly, and then at the end of the process (i.e.
new optimisations and transformations, etc.) I'll translate back into LLVM
assembly to allow compilation to continue. This keeps everything modular,
means I can work with "real" languages rather than toy ones, and so on.
  

It seems to me that an easier approach would be to convert in-memory
LLVM IR to your experimental IR using the LLVM libraries, then do
whatever you do with your experimental IR, and then convert the code
back into in-memory LLVM IR for LLVM based optimizations and code
generation. The libraries for manipulating LLVM IR are well designed,
reasonably documented, and seem to me to be easier to work with than
creating your own LLVM IR parser.

-- John T.

Hello,

I was thinking of generating my own lexer and parser for LLVM assembly. I'm
aware that between the specification here:

Why do you need this?
There is already a parser library inside LLVM framework and you can
use it directly without any problems.

Thank you both for your answers.

The only reason I was interested in not using the built-in parsing library
was that it would give me more flexibility over the language I program in,
but if it means brushing up on my C++ then this isn't too much of a problem
either.

With regards to using the in-memory LLVM, that's also a good approach.
However, I was thinking of structuring my thesis work as standalone tools
(much like llvm-as and the others) as it would help me structure my work
better. I hope this makes sense.

Best,

James

You can achieve the 'standalone tools' effect using LLVM bitcode,
which is a binary IR format.

Then you can have:
source code -> [llvm-gcc frontend] -> bc
bc -> [your tool using llvm bitcode reader writer library, doing
whatever mutations to the IR you want and spitting out bc again] -> bc
bc -> [llvm backend (llc)/llvm-as/...] -> native code.

Doesn't that meet your requirements?

That sounds promising, as far as I can tell!

Thanks very much. It's the end of the day now, but I'll be getting stuck in
tomorrow.

Best,

James

I assume James is only considering reinventing this wheel because he is not
using C++. LLVM does not play nice with other languages because C++ is quite
uninteroperable. There are C bindings to LLVM that make it a lot easier but,
of course, they are far from complete.

So I can fully appreciate the desire to do something like this. However, my
recommendation would be to augment LLVM with better interop rather than
reimplement bits of it in other languages. So I would advise James to work on
a more language agnostic machine-readable format for LLVM's IR (e.g. XML
based) and contribute code to LLVM that lets it IO in that format as well as
the current human-readable form.

As we discussed before, something like an autogenerated XML-RPC server to the
whole of LLVM would be a much better solution offering easy interop with a
huge variety of languages without having to write and maintain all of these
bindings.

Jon -- you're right with your intuition. I also think the XML idea is great,
and something I might look into contributing when I have some more free
time.

For the time being I'm setting myself up in the /projects directory of the
source, and I'm going to have a play from there.

For what it's worth, I'm working on creating an LLVM backend for a small
stack machine, and so far it seems a very poor fit for any of LLVM's
CodeGen backend.

My current approach -- which may change, but is the best I've come up with
so far -- is to use make an LLVM pass in C++ where I do nothing but spit
out the module in a target-specific format that I can easily read from
Python, where the I'm implemented the bulk of my actual backend. I'm doing
this largely to avoid writing an .ll or .bc parser (which wouldn't be
terrible, but I'm not sure how stable those formats really are).

Wesley J. Landaker wrote:

My current approach...

I was just looking at the documentation for writing an LLVM pass, and I
believe this is what I am going to do too. Once I've constructed my new IR,
I'm not completely sure whether I'll keep working "inside" LLVM, although I
know I'd like to generate LLVM assembly or bitcode at the other end of my
own IR's transformations and optimisations. It's nice to have options
though!