seeking advice

I recently subscribed to the LLVM and Clang developer mailing lists, and earlier today read a message requesting advice on “Contributing to LLVM Projects”. I’m hoping to eventually get involved as well, but with my situation being very different to that described in the referenced thread I got the idea of solliciting advice seperately. If this is not the right place to ask, though, then please let me know and ignore the remainder of this message.

Right now, I only ‘really’ know Java for my current job, save for some Prolog, Haskell and Coq that I picked up before during my linguistics studies (specializing in categorial grammar, and having supplemented with courses from CS and AI on logic and type theory). I thus still have a long road ahead, already when it comes to just bare theory, let alone implementation. The following is a detailed study plan that I have come up with for my spare time that I think will last me for the next 1.5 years or so (naturally to be supplemented with the LLVM documentation), and which I would be very happy to obtain any feedback on.

Instead of beginning with C++, I rather started studying C from K&R and (once finished) the book by Fraser and Hanson on LCC. Afterwards, I already have a modest reading list prepared of some articles on SSA (see below) which I hope to combine with a study of the LLVM IR. Based on these readings, as a matter of exercise I can take what I learnt from LCC to write a small compiler in C targeting the LLVM IR for, say, a subset of Prolog. Ideally, I should reach this stage by the end of this year. While far from ‘done’ with C, I suppose that at this point I can also proceed to pick up some C++ (having in mind at least Accelerated C++ and the Effective (Modern) C++ books, coupled perhaps with Modern C++ Design for template metaprogramming), by which time I think I should be prepared to start following some of the developments around Clang and start digging into its code. All this, in turn, should keep me busy during the first half of the next year at least. I hope, however, to also come to thoroughly understand the LLVM backend, and I would suppose that, besides following literature on compiler optimizations, going through a book or two on computer architecture should give me a start to dig deeper into its source (thinking currently of Computer Systems: A Programmer’s Perspective). There will surely never be any shortage of good books to read (in particular hoping to also squeeze in Sedgewick’s algorithms in C and a book on OS architecture somewhere), but I hope all this at least should give me a good start from which to start planning further, and hopefully eventually get involved somehow. However, given my current state of knowledge and being limited to my spare time, I’m guessing I’ll be looking at at least five years or so to reach a lower intermediate level understanding of most of the important concepts involved.

As for the articles that I had in mind for coming to grips with the basics of SSA, I came up with the following list, planning to first gradually work my way up to an understanding of control- and data dependence.

  1. Prosser '59: Applications of Boolean matrices to the analysis of flow diagrams <Introduces the concept of dominators, I believe.>
  2. Lowry & Medlock '69: Object code optimization
  3. Allen '70: Control flow analysis
  4. Allen & Cocke '76: A program data flow analysis procedure
  5. Ferrante et al. '87: The program dependence graph and its use in optimization
  6. Rosen et al. '88: Global value numbers and redundant computations
  7. Alpern et al. '88: Detecting equality of variables in programs
  8. Cytron et al. '91: Efficiently computing static single assignment form and the control dependence graph

Many thanks for taking the time to read through all this. I would be very happy to receive any feedback on these plans, and again, if this was not the right place to ask, then my apologies.

Kind regards,

Arno Bastenhof

Based on these
readings, as a matter of exercise I can take what I learnt from LCC to write
a small compiler in C targeting the LLVM IR for, say, a subset of Prolog.

Hi Arno,

This is not a monumental task. There are many tutorials in the LLVM
tree, and producing IR from a simple AST is actually simple and
straight-forward.

So many languages can target LLVM IR, that it's very hard that you'll
find something that cannot be represented in IR. At least something
that other implementations haven't found already, and worked around.

With your broad knowledge in CS, I wouldn't expect you to take more
than a month or two to get a front-end working with the JIT compiler
on a reasonable subset of Prolog. Of course, the deeper you go, the
more time it takes to progress the same 1%, but you can always stop
when productivity drops as fast as enjoyment. :slight_smile:

But you won't need to understand the back-end at all, modulo some
quirks on your platform of choice. The hint is to write the same logic
in Prolog and C, compile the C with Clang and see what they do. If you
spot hackery, the least resistance path is to replicate the hackery.
Asking on the list is also a good way to understand why the hackery,
and maybe even give you an entry point into changing the middle/back
end to support a proper construct in IR for better codegen.

From there, I'd recommend you to try different targets. Get some

board, device, phone you have at your disposal (RaspberryPi2 is a
great inexpensive kit), and make it work. You'll take a lot less
effort than you're imagining...

cheers,
--renato

When I set out on my “Pascal Compiler project”, I had never written a compiler before - I had SOME experience in writing interpreters and other parser/parser-like projects. That’s about 14 months ago. I think I started around end of February, and by the beginning of march, when I pushed the first commit to github, I already had small programs working.

I only spend (some of my) evenings and weekends on this project, and there have been periods where I didn’t do any work for several weeks for one reason or another. I’m now at the a stage where the parser and semantic analysis is parsing the iso7185pat.pas (“pat = Pascal Acceptance Test”). It’s not yet passing the code-gen phase - the current stumbling block is trampolining for nested function pointer calls, and I think there are going to be a few other problems along the way. Given that this file is “trying to do everything the ISO7185 standard allows”, it’s not surprising it finds a few interesting issues…

I have a few decent size home-grown tests, such as a sudoku solver that “works” - and because LLVM provides good optimisation, my compiler beats the “regular” Free Pascal compiler on all like-for-like comparisons I’ve made so far - in particular, the Sudoku solver is about 2x faster with my compiler than Free Pascal (ok, I cheat, some 5% of that comes from having implemented a “popcount” intrinsic that can take a “set” and give the number of elements in the set, which standard pascal or Free Pascal doesn’t have).

Of course, like Renato says, for every small step towards “full functionality” gets harder - and the good thing with a hobby project is the ability to choose how much time you spend on what features, and how you implement them.

I thoroughly recommend “try it, see how it goes”. I’ve found it surprisingly easy… [Mostly because other people have done the REALLY hard part of making the LLVM IR to machine code phase! - Thanks guys!]

Good recommendations on the thread already, but let me add by saying
you don't _need_ to learn C or C++ to use LLVM; of course
contributions would require familiarity with C++. You can use a
library that talks to LLVM in your favorite language, and if that's
Java, this seems to be the most up-to-date project with bindings:
https://github.com/robovm/robovm/tree/master/llvm

Let me also critique your reading list if your goal is writing a first
compiler. Understanding SSA, optimizations, etc. is 10x more lucid
(and you appreciate it more) when you see some IR or assembly and the
impact a certain optimization or concept can have on it. Let me be
clear, your reading list is bang on for some topics, but I feel like
you'll miss some "aha" moments if you've not actually written some
code first-hand, or maybe gone through some simple code generation
steps (even if it's reading code).

In college, I took a course (and did well) on compiler optimizations
(program analysis to be specific) and we went through pretty much all
what is needed (theory) to build an optimizing compiler: cfg
construction, dataflow frameworks, SSA form, fixed points, lattices
and abstract interpretation, etc. but a few years hence without using
any of that it's been somewhat of a re-learning this past year or so.
The formal education there was absolutely critical for the concepts,
but I didn't get something tangible at the end of it.

If optimizations are really what you want to understand or implement,
might I suggest starting with the one of the simplest ones: Constant
folding. It's simple to implement yet there is a feeling of
accomplishment.

Finally, to be honest since learning is an individual experience my
suggestions may not at all be helpful, but if you learn like I do ...
by doing, they just might.

P.S. - I have no idea about your level of cs understanding is, so
please don't be offended if what I've suggested is too naive or simple
for you.

Many thanks for the detailed replies and advice. If I may attempt to summarize, my plans as I described them were rather heavy on the theory, and the foremost piece of advice, I think, is that theory typically settles itself into memory much better if it can be readily attached to prior experiences. Hence, it is better introduced more gradually in favour of first gaining familiarity with the IR by writing a front-end for it, using the ample documentation already available. This will provide knowledge of something very concrete (the IR) that will help grasp the implications of the more advanced concepts more easily once I start learning about them, as well as provide a basis to start digging more deeply (say, writing optimizations or even a back-end).

As for a concrete project, while most is still new to me, I did already dabble a little with Prolog implementations recently, having written a parser and bytecode interpreter for a small fragment using a stack-based machine (then still in Java, though I would now like to practice with C). There’s several papers showing how abstract machines for Prolog can be used as intermediate languages for actual compilation down to machine code (e.g., GNU Prolog: Beyond compiling Prolog to C), and targeting the LLVM IR instead should then not be too hard once the latter is well understood.

Again, many thanks for the inspiring advice and success stories. Also, Mukul, I’m very happy to receive any sort of advice and would never feel offended by it. (So, to repeat myself once more, many thanks!)

Kind regards,

Arno