LLVM targeting HLLs

I am interested in using LLVM to translate C and C++ into high-level
language code. (As an update to an earlier project of mine, Clue, which
used the Sparse compiler library to do this: it targets Lua, Javascript,
Perl 5, C, Java and Common Lisp, with a disturbing amount of success.
See http://cluecc.sourceforge.net for details.)

The obvious place to start on this is the C backend, except in these 2.8
days the C backend is so hedged about with caveats I'm rather wary of
basing anything on it. I also recall seeing comments here that it's due
for a rewrite from scratch, and that various people were looking into
it. Can anyone go into more detail as to what exactly is wrong with the
C backend, and whether this rewrite is happening?

The other thing I could do is to use the LLVMTargetMachine and treat my
HLL as a low-level machine; this gets me a certain amount of good stuff
like register allocation and more optimisations, but the documentation
is still pretty basic (e.g.
http://wiki.llvm.org/Absolute_Minimum_Backend is three short paragraphs)
and I'm not certain as to whether LLVMTargetMachine is suitable. For
example: my HLL can largely be treated as a register machine with an
arbitrary number of registers. Can LLVMTargetMachine handle this?

You could create a different code generator from clang or use the rewriting
machinery?

-eric

If you're familiar with Sparse, then I strongly recommend basing this project on Clang ASTs, not basing it on LLVM IR.

-Chris

-- Send from my Jacquard Loom

A better approach would probably be to use Clang's CodeGen lib as inspiration, and write an equivalent that emitted your high-level language code instead of LLVM IR. For example, consider C++ classes:

When you convert these to LLVM IR, you lose all of the information about them other than their structure, and the vtable is explicitly created for the target ABI. Mapping them to something like JavaScript, you'd actually want to create a new prototype object for each class, with one slot for each field and another slot for each method (and some extra mixin-style stuff if you wanted to support multiple inheritance).

The same is true even for pure C structures - you'd want to represent these as objects with named fields. This information is in the Clang AST, but it's lost by the time you get to LLVM IR. Taking an example from Apple's Foundation framework, you have two structures:

typedef
{
  CGFloat x, y;
} NSPoint;

typedef
{
  CGFloat width, height;
} NSSize;

In LLVM IR, these are both something like {double, double}. In JavaScript, you'd probably want something like:

function NSPoint()
{
  this.x = 0;
  this.y = 0;
}
function NSSize()
{
  this.width = 0;
  this.width = 0;
}

This is pretty simple to generate from the Clang AST, but will be a huge amount of effort to generate from LLVM IR.

David

-- Sent from my PDP-11

David Given <dg@cowlark.com> writes:

The obvious place to start on this is the C backend, except in these 2.8
days the C backend is so hedged about with caveats I'm rather wary of
basing anything on it. I also recall seeing comments here that it's due
for a rewrite from scratch, and that various people were looking into
it. Can anyone go into more detail as to what exactly is wrong with the
C backend, and whether this rewrite is happening?

The rewrite is happening. I've got the skeleton of the codegen done,
but I have to get it to build before I can check it in. After that,
everyone can start adding patterns.

The main problem with the current C backend is that there is no legalize
phase. So you end up seeing vector types and all sorts of non-C
nonsense. It's just overall much cleaner to generate code using the
generic framework.

The other thing I could do is to use the LLVMTargetMachine and treat my
HLL as a low-level machine; this gets me a certain amount of good stuff
like register allocation and more optimisations, but the documentation
is still pretty basic (e.g.
http://wiki.llvm.org/Absolute_Minimum_Backend is three short paragraphs)
and I'm not certain as to whether LLVMTargetMachine is suitable. For
example: my HLL can largely be treated as a register machine with an
arbitrary number of registers. Can LLVMTargetMachine handle this?

Once I get the new C backend checked in (next week, hopefully), it may
be helpful as a guide.

                            -Dave

Chris Lattner <clattner@apple.com> writes:

[...]

The rewrite is happening. I've got the skeleton of the codegen done,
but I have to get it to build before I can check it in. After that,
everyone can start adding patterns.

Is the new C backend 'register' based, that is, generating lots of
little statements operating on lots of variables, rather than producing
the huge mangled expressions that the old one does? If so, that would be
ideal for what I want.

[...]

Once I get the new C backend checked in (next week, hopefully), it may
be helpful as a guide.

Excellent --- I'll wait for that, then. Will it be announced here?

[...]

LLVM IR throws too much information away to target a HLL
effectively.

The thing is, I explicitly don't want to use the Clang AST --- I'm not
interested in producing an idiomatic translation, merely a
fast-performing one. Clue in its current lousy state has proven that
this is possible; without any optimisation I'm getting C-to-Java at 60%
of native, and C-to-Luajit at 10%. I now want to see what sort of
results I get when applying LLVM's optimisations and some more
intelligence to the code generation. (Plus, Sparse is buggy and really
awkward to work with.)

For giggles, here's some example Javascript produced by Clue.

function _dtime(fp, stack) {
var sp;
var H0;
var H1;
var H2;
var H3;
var H4;
var state = 0;
for (;:wink: {
switch (state) {
case 0:
sp = 2;
sp = fp + sp;
H1 = null;
H0 = 0;
H2 = fp;
H3 = _gettimeofday;
H4 = H3(sp, stack, H2, stack, H0, H1);
H0 = fp;
H1 = stack[H0 + 0];
H0 = fp;
H2 = stack[H0 + 1];
H0 = 1000000.000000;
H3 = H2 / H0;
H0 = H1 + H3;
return H0;
} } }

(PS. Can people please reply to the list instead of to me directly?)

David Given <dg@cowlark.com> writes:

Is the new C backend 'register' based, that is, generating lots of
little statements operating on lots of variables, rather than producing
the huge mangled expressions that the old one does? If so, that would be
ideal for what I want.

It will probably end up looking that way. I've considered adding a pass
to expand expressions but it's only benefit would be aesthetic, so it's
not a high priority for me.

Once I get the new C backend checked in (next week, hopefully), it may
be helpful as a guide.

Excellent --- I'll wait for that, then. Will it be announced here?

Absolutely. I will send it for a pre-checking code review.

                        -Dave