LLVM for dynamic languages

How suitable do the developers think that LLVM would be as a code-generator
for a dynamically typed langage? Would the lack of static type information
make the traditional code optimizations performed by LLVM relatively
ineffective?

Sincerely,
  Rayiner Hashem

LLVM is designed to be able to support "any" language, efficiently,
without restriction. However, the mapping for some languages may not be a
clean as for others (lots of casting might be involved for example). That
said, many dynamic languages should be cleanly mappable to the LLVM layer.
What are you thinking about in particular?

As far as the optimizations already in LLVM: LLVM provides an excellent
infrastructure for interprocedural optimizations. As such, I would
recommend writing some language tuned optimizations (as necessary) that
would lower the dynamic objects into the primitives the optimizers
already understand.

For example, think of a language like smalltalk (the most dynamic language
that I know of). Say you have a program that looks like this:

(8*(4/2)) print

The LLVM constant folding optimizations will not be able to fold these
constants, because they are actually instances of the SmallInteger class,
which have dynamically dispatched messages to perform the operations.

Now, however, if you preceed the standard LLVM optimizations with some
simple partial evaluation/resolution transformations, you can show that *
and / will resolve to the implementations in the SmallInteger class. As
such, you can convert it to a direct call, have it inlined, and the
objects themselves would be turned into primitives.

If you can express the method dispatch in a way that is useful to the
preexisting optimizations, of course, you wouldn't have to do anything
special to support this (C++/Java v-tables fit into the catagory).

Basically the idea is that some optimizations are much more important to
certain languages than others. In this case, it's method resolution, for
functional languages, it's tail call elimination, for C++ it's scalar
replacement of aggregates, etc.

Since LLVM provides the facilities needed to efficiently implement these
types of optimizations, it is a pretty good infrastructure for doing them
in, even though it currently does not support "all possible" optimizations
out-of-the-box (yet :slight_smile:

Please let me know if I have confused the issue or missed your question
completely. :slight_smile:

-Chris

That said, many dynamic languages should be cleanly mappable to the LLVM
layer. What are you thinking about in particular?

Not thinking of one in particular right now, but it would probably something
like an object-oriented Scheme.

recommend writing some language tuned optimizations (as necessary) that
would lower the dynamic objects into the primitives the optimizers
already understand.

Yep. I'm assuming that a language-specific optimizer would handle things like
type inference, so the llvm code would either see plain primitives or a
dynamic representation:

struct object {
  int typecode;
};

struct integer {
  int typecode;
  int value;
};

Thus, the calling code could manipulate integers with pointers to objects.

If you can express the method dispatch in a way that is useful to the
preexisting optimizations, of course, you wouldn't have to do anything
special to support this (C++/Java v-tables fit into the catagory).

As far as I can see, casting needs to be done only once. All the calling code
knows is that it has a pointer to an object. After using the typecode to
lookup the method to call, it calls a given function, passing it the object
parameter. This function knows the precise llvm type of the object, so it
needs to cast the object* to (say) an integer*. Come to think of it, C++/Java
code needs to do the same thing, though the dispatch mechanism is different.
Since llvm doesn't appear to have any notion of inheritence, C++/Java virtual
methods should need to cast their 'this' pointer to the precise type. I
presume this doesn't interfere too much with the other optimizations?

Sincerely,
  Rayiner Hashem

> That said, many dynamic languages should be cleanly mappable to the LLVM
> layer. What are you thinking about in particular?
Not thinking of one in particular right now, but it would probably something
like an object-oriented Scheme.

Ok.

> recommend writing some language tuned optimizations (as necessary) that
> would lower the dynamic objects into the primitives the optimizers
> already understand.

Yep. I'm assuming that a language-specific optimizer would handle things like
type inference, so the llvm code would either see plain primitives or a
dynamic representation:

struct object {
  int typecode;
};

struct integer {
  int typecode;
  int value;
};

Exposing a dynamic representation like this is a good idea. In fact, if
you do so like this:

struct integer {
  struct object object_base;
  int value;
}

There won't be as much casting in the LLVM code, and a lot of the
optimizers will do good things for you automatically (like limited type
inference).

Thus, the calling code could manipulate integers with pointers to objects.

Yup, or a variety of other potential "subclasses".

> If you can express the method dispatch in a way that is useful to the
> preexisting optimizations, of course, you wouldn't have to do anything
> special to support this (C++/Java v-tables fit into the catagory).

As far as I can see, casting needs to be done only once. All the calling code
knows is that it has a pointer to an object. After using the typecode to
lookup the method to call, it calls a given function, passing it the object
parameter. This function knows the precise llvm type of the object, so it
needs to cast the object* to (say) an integer*.

Yup. In general, up-casts don't require pointer arithmetic (if the
"classes" are structured like above), only down-casts do.

Come to think of it, C++/Java code needs to do the same thing, though
the dispatch mechanism is different. Since llvm doesn't appear to have
any notion of inheritence, C++/Java virtual methods should need to cast
their 'this' pointer to the precise type.

Yes, in exactly the same way.

I presume this doesn't interfere too much with the other optimizations?

Not so far. I have toyed with the idea of adding a "safe_cast" operation
or something, which the language front-end guarantees to the optimizer
that the resultant type is correct. For something like a dynamic_cast in
C++ and virtual method dispatch in many languages, it would be very
useful. e.g.:

Derived *X = dynamic_cast<Derived*>(base);

would become LLVM code:

  %cond = bool isInstanceOfDerived(base* %base)
   br bool %cond, label %Yes, label %Cont
Yes:
  ; This is guaranteed to be correct by checks the front-end does
  %Casted = safe_cast base* %base to %Derived*
  br label %Cont
Cont:
  %X = phi [%Casted], [null]

This is something that I have thought about adding in the future, but it
is not important enough to our short term goals to add. In particular,
there is a variety of other types of "front-end hints" which can be passed
into LLVM, which would be useful to represent in a structured way. Here
are some ideas:

1. Casts that are known correct
2. Functions that never "throw", return, access memory, ...
3. Aliasing properties guaranteed by the language (e.g., parameters in
   Fortran cannot alias, tbaa in C, mutable and immutable objects cannot
   alias in ML, ...)

It's very important that we don't pollute LLVM with a bunch of ad-hoc
wierd stuff though, so we have focused on the core issues, such as the
design, stability, and basic implementation. In the future, I expect that
some (small, well thought out) extensions can be used to address these
issues...

-Chris