Using LLVM for a dynamically typed language

I recently ran into the following problem.

I'm prototyping a compiler for a dynamically typed language in which
functions are first class objects. Assuming I have something like
this:

if(rand() > 5)
  i = define(x, y, z) { return x + y + z; }
else
  i = define(x, y) { return x + y; }

At this point I cannot know the type of 'i' at compile time. At
runtime 'i' is a structure that contains a type and a function
pointer. What I can't figure out is how to cast my llvm function
pointer to an appropriate function type. I cannot know until runtime
what the type will be. Naturally I can add runtime code that will do
all appropriate checks (number of parameters, etc.) but how do I do
the actual cast?

I hope I'm being clear.

Thanks,
- Slava.

I'm not sure that I understand. LLVM permits you to cast any pointer to
any other pointer. If you are using LLVM to generate your code, this
basically has 3 steps:

1. Get a Type* for the function type you want to cast to.
2. Use a cast instruction to actually perform the type cast.

In the code I'm currently working on, I do the following to cast a
function pointer from one type to another. Hopefully the line wrapping
won't totally destroy this.

std::vector<const Type*> paramTypes;

// First parameter: a void pointer
paramTypes.push_back( PointerType::get( Type::getPrimitiveType( Type::SByteTyID ) ) );
    
FunctionType* pthreadFunctionType = FunctionType::get(
  PointerType::get( Type::getPrimitiveType( Type::SByteTyID ) ),
  paramTypes, false );
assert( pthreadFunctionType != NULL );

// Cast the function to the pthread function type
Value* castedFunction = new CastInst( function,
  PointerType::get( pthreadFunctionType ), "", &terminator );

I hope this helps,

Evan Jones

Evan,

The problem is that I do not know the type of a target function at
compile time. If you consider my code example, I don't know the type
of 'i' until runtime (in fact, I can't even know a possible range of
types 'i' may assume).

Thanks,
- Slava.

But... in a dynamic language each variable must have an associated type
information? So, why can't you use LLVM struct consisting of
- integer type code (or anything else you like)
- pointer to void which holds the actual data.

What you're doing is rather interesting. There's IronPython which implements
Python (a highly dynamic language) on top of .NET and which claims to be 2x
faster than regular Python. However, I never seen any implementation details
so don't know how dynamic features are dealt with.

- Volodya

I don't think I properly understand what the issue is here. For my
perspective, if you can figure out the type of 'i' at runtime, you
either have to:

a) Make all functions the same type. For example, make them all return
void, take a vector of parameters as the first argument, and a vector
for return values as the second argument.

b) Call the appropriate LLVM functions to emit the code for the cast at
runtime.

I don't see how this is a specific challenge with LLVM. It seems to me
that this is a challenge that you will encounter when implementing a
dynamic language in *any* low-level language. I would encourage you to
prototype something in C, and then figure out how to translate that to
use LLVM. There are lots of dynamic languages you can use as examples.
Python would be appropriate here, since it has function objects. There
are also likely to be some textbooks that discuss these sorts of
implementation challenges. As compilers are not my research area, I
can't recommend any.

Good luck,

Evan Jones

I am not an expert about this, but I have done a bit of reading about
IronPython and Jython. As far as I am aware of, these issues are dealt
with in the same way that they are handled in CPython. Basically, when
they compile Python code they just emit the sequence of function calls
that the interpreter would make. Thus, doing something like this:

pyobject.foo += 1

Turns into something like this:

PythonObject temp = pyobject.get_attr( 'foo' );
Array parameters = [ temp, new PythonInteger( 1 ) ];
PythonObject addMethod = temp.get_attr( '__add__' );
PythonObject result = addMethod.call( parameters );
pyobject.set_attr( 'foo', result );

This ends up executing much faster than having to go around an
interpreter loop multiple times. You could actually use the same
approach to compile Python to native code.

Evan Jones

a) Make all functions the same type. For example, make them all return
void, take a vector of parameters as the first argument, and a vector
for return values as the second argument.

This is something I was considering. I guess I'll end up going with this option.

I don't see how this is a specific challenge with LLVM. It seems to me
that this is a challenge that you will encounter when implementing a
dynamic language in *any* low-level language.

I disagree. If I could push a bunch of arguments on a stack (or
specify a list of arguments, etc.) and just use a "call" instruction
with a pointer to a memory address I wouldn't run into this problem.
This is a specific challenge with LLVM because it is strictly typed.

Ah! Right. You can't manually construct a function call on the stack
with LLVM. However, an alternative worth considering would be to use
varargs functions. This would basically compile to native code that just
stuffs a bunch of things on the stack.

As in your example, the function itself knows the number and type of
parameters that are expected. Before executing the "body" of the
function, it can verify that things are being passed as expected and
raise an exception if the are not.

This is basically similar to Chris's suggestion of using a structure to
pass the arguments, except you don't actually need to construct the
structure itself.

Evan Jones

I agree with Evan. To be more specific, for your initial example:

if(rand() > 5)
   i = define(x, y, z) { return x + y + z; }
else
   i = define(x, y) { return x + y; }

As you've noticed here, the result of a "define" has to have the same LLVM type. I suggest compiling this to the equivalent of this C code:

void anon1(int *Result, void *Args) {
   RealArgs = (sometype*)Args;
   *Result = RealArgs->X + RealArgs->Y + RealArgs->Z;
}

void anon2(int *Result, void *Args) {
   RealArgs = (sometype2*)Args;
   *Result = RealArgs->X + RealArgs->Y;
}

Of course, the result pointer would want to be generic somehow, but I just made it int* to simplify the issue. Despite this, "i" can now just be a function pointer. If you have closures to deal with, you make each function a pair of function pointer and environment pointer, and pass them around together.

-Chris

I disagree. If I could push a bunch of arguments on a stack (or
specify a list of arguments, etc.) and just use a "call" instruction
with a pointer to a memory address I wouldn't run into this problem.
This is a specific challenge with LLVM because it is strictly typed.

Ah! Right. You can't manually construct a function call on the stack
with LLVM. However, an alternative worth considering would be to use
varargs functions. This would basically compile to native code that just
stuffs a bunch of things on the stack.

I strongly recommend avoiding varargs functions if possible. It would be better to pass an explicit list/array of parameters instead. The optimizer is severely hobbled for varargs functions, but it does a good job of ripping appart simple arrays and structures.

As in your example, the function itself knows the number and type of
parameters that are expected. Before executing the "body" of the
function, it can verify that things are being passed as expected and
raise an exception if the are not.

I agree. The structure/array passed in should indicate how many params there are and the callee should verify things are happy before continuing. Alternatively, instead of putting the # args in the array, you could pass it in as an explicit extra argument, which would help out the optimizer even more.

-Chris