GSoC 2012 Proposal: Python bindings for LLVM

Hello all,
Here is my GSoC 2012 proposal: Python bindings for LLVM. Any feedback are welcome!

Title: Python bindings for LLVM

Abstract: llvm-py provides Python bindings for LLVM. The latest llvm-py supports bindings with Python 2.x version for LLVM 2.x. This project is to improve llvm-py to make it compatible with both Python 2.x and Python 3 for LLVM 3.

Motivation
LLVM is used as a static and dynamic JIT backends for many platforms. It uses module-design idea and provides extensive optimization support. llvm-py provides Python bindings for LLVM [1]. It began in 2008, which aims to expose enough of LLVM APIs to implement a compiler backend in pure Python. The latest llvm-py works only with LLVM 2.x, not LLVM 3. Since LLVM 3 has several major changes, especially the internal API changes, it is necessary to improve llvm-py to work with LLVM 3. Also current llvm-py only supports Python 2.x version, but not Python 3. By supporting Python 3, it can make llvm-py more complete and thus LLVM can be used by more users, which helps in its development. So this project is to finish the two tasks: make llvm-py work with LLVM3 and add Python 3 support

Project Detail
Before writing the proposal, I took a look at llvm-py source code, and had a basic understanding how it works. I wrote a simple document to analysis how it is implemented. (please see the appendix at the end of this proposal).
In this section, I list some detail that related to this project. It includes details about working with LLVM 3 and details about Python 3 support.

1. Working with LLVM 3
There are some internal API changes in LLVM 3. So the code of llvm-py should be changed to consistent with these modified API.
a. IR Type system. IR type system is reimplemented LLVM 3. For instance, OpaqueType are gone. Such type should also be removed in llvm-py.
b. Value class. Two new sub classes of Value are added:
ConstantDataArray, an array constant
ConstantDataVector, a vector constant.
llvm-py should contain them.
c. Instruction class. Four new sub classes of Instruction are added:
FenceInst, an instruction for ordering other memory operations;
AtomicCmpXchgInst, an instruction that atomically checks and exchanges values in a memory location;
AtomicRMWInst, an instruction that atomically reads a memory location, combines it with another value and store the result back.
LandingPadInst , an instruction that hold the necessary information to generate correct exception handling.
llvm-py should support them.
d. Passes. Some passes are removed, for instance, LowerSetJmp pass. So the API that is corresponding to them such as LLVMAddLowerSetJmpPass, should also be removed in llvm-py.
e. PHINode. Two new functions are added in PHINode class: block_begin and block_end. The list of incoming BasicBlocks can be accessed with these functions. At the same time, reserveOperandSpace function is removed so when creating a PHINode, an extra argument is needed to specify how many operands to reserve space.
When making llvm-py work with LLVM 3.0, we should focus on these changes. What I list above may not be complete. I will cover more changes during the project.

2. Python 3 support
When adding support for Python 3, we also should pay attention to the C API changes between Python 2.x and Python 3. Here I list some of them.

  1. Extension module initialization and finalization (PEP 3121) [2]
    In Python 3, the module initialization routines should look like this:
    PyObject *PyInit_()
    When creating a module, a struct PyModuleDef should be passed as a parameter.
  2. Making PyObject_HEAD conform to standard C (PEP 3123) [3]
    Some macros are added, for instance, PY_TYPE, PY_REFCNT,PY_SIZE. So a code block func->ob_type->tp_name in Python 2.x should be replaced with PY_TYPE(func)->ty_name in Python 3.
  3. Byte vectors and String/Unicode Unification (PEP 0332) [4]
    The str type and its accompanying support functions are gone and is replaced with byte type.

When supporting Python 3 in llvm-py, we should focus on these C API changes.

Timeline

Before the coding period starts, I will analysis llvm-py source code deeply, read LLVM 3 related documentation and code to speed up the project.

The coding period is divided into two stages: before midterm evaluation, I would port llvm-py to LLVM 3. After the midterm, I would add Python 3 support on llvm-py.

May 21 ~ May 27 Support IR Type System for LLVM 3
May 28 ~ June 3 Support new Value sub classes and instruction sub classes
June 4 ~ June 10 Deal with Pass Framework
June 11 ~ June 17 Improve PHINode class support.
June 18 ~ June 24 Deal with other features, such as intrinsics.
June 25 ~ July 1 Test and make LLVM 3 support in good shape.
July 2~ July 8 Document for LLVM 3support for llvm-py
July 9 ~July 15 Midterm evaluation.
July 16~ July 22 Adding Python 3 support, make it basically work
July 23~ July 29 Debug and improve Python 3 support
July 30 ~ August 5 Test to make Python 3 support in good shape.
August 6 ~ August 12 Document for Python 3 support.

Project experience

In GSoC2009, I took part in a project: support Scilab language on SWIG [5]. I added a backend module in SWIG, so that it can support all the C features for Scilab language: variables, functions, constants, enums, structs, unions, pointers and arrays.

In GSoC2010, I also successfully finished a project called“epfs”[6] , which means embedding Python from Scilab. This project introduces a mechanism to load and use Python code from Scilab.

I have about one year’s experience for LLVM. I use it mainly to implement control flow integrity for Operating Systems and thus improve system security.
I recently submitted a patch for Target.h file to improve compatibility with SWIG, which has been applied on the trunk.

Biography
Name: Baozeng Ding
University: Institute of Software, Chinese Academy of Science
Email: sploving1@gmail.com
IRC name: sploving

References
[1]. http://code.google.com/p/llvm-py/
[2]. http://www.python.org/dev/peps/pep-3121/
[3]. http://www.python.org/dev/peps/pep-3123/
[4]. http://www.python.org/dev/peps/pep-0332/
[5]. http://code.google.com/p/google-summer-of-code-2009-swig/downloads/list
[6]. http://forge.scilab.org/index.php/p/epfs/

Appendix

llvm-py Implementation

Here I give a small example to show the relationship between the Python function in llvm-py and the C function in LLVM.

Let us analysis an example in llvm-py:

f_sum = my_module.add_function(ty_func, “sum”).

How the above statement is implemented to call LLVM C function successfully?

The llvm-py package has six modules, of which the most important is the core module, consisting of the following files:

core.py high-level support code
_core.c low-level wrapper code for LLVM Core libraries
wrap.h It includes header files needed for the low-level wrapper code

In core.py, there is a class “Module”, which has a method “add_function”, defined as the following:

def add_function(self, ty, name):
""“Add a function of given type with given name.”""
return Function.new(self, ty, name)

This method calls the constructor of class “Function” (Function.new). So let’s take a look at what this constructor is? It is also defined in the file core.py in llvm-py as the following:

class Function(GlobalValue):

@staticmethod
def new(module, func_ty, name):
check_is_module(module)
check_is_type(func_ty)
return _make_value(_core.LLVMAddFunction(module.ptr, name,
func_ty.ptr))

The most important statement in the above constructor is:

_core.LLVMAddFunction(module.ptr, name, func_ty.ptr)

If you are familiar with C extensions for Python, you could guess that LLVMAddFunction should be defined in the low-level wrapper file _core.c. Let’s find out how it is defined in this wrapper file?
In _core.c, the following statements are what we are looking for.

static PyMethodDef core_methods[] = {

/* Functions */
_method( LLVMAddFunction )

}

LLVMAddFunction is defined as a macro. Let’s look at what the macro _method mean? It is defined in _core.c:

#define _method( func ) { # func , _w ## func , METH_VARARGS },

In the above macro, func is the name used in python, and _w ## func is the corresponding name of the wrapper function. ie, When we call a function func in python, it intrinsically calls the wrapper C funtcion _w ## func. So when we use LLVMAddFunction methoed in python, it actually calls _wLLVMAddFunction. Then how is _wLLVMAddFunction defined?

Also in _core.c file, there is such a statement that is related to LLVMAddFunction:

_wrap_objstrobj2obj(LLVMAddFunction, LLVMModuleRef, LLVMTypeRef, LLVMValueRef)

This macro is defined in wrap.h file:

/**
* Wrap LLVM functions of the type
* outtype func(intype1 arg1, const char *arg2, intype3 arg3)
*/
*#define wrap_objstrobj2obj(func, intype1, intype3, outtype) *
static PyObject * _
w ## func (PyObject *self, PyObject args) </i>
__
{ *

PyObject *obj1, *obj3; _
*intype1 arg1; *
const char arg2; _
*intype3 arg3; *
__
*

if (!PyArg_ParseTuple(args, “OsO”, &obj1, &arg2, &obj3)) *
*return NULL; *
__
*
_
arg1 = ( intype1 ) PyCObject_AsVoidPtr(obj1); *
arg3 = ( intype3 ) PyCObject_AsVoidPtr(obj3); *
__
*__
return ctor_ ## outtype ( func (arg1, arg2, arg3)); *
__
}
__

So the above statement undergoes macro expansion to be:

wLLVMAddFunction (PyObject self, PyObject args) //This is what we are looking for!
__
{
_
PyObject *obj1, *obj3;
LLVMModuleRef arg1;
const char *arg2;
LLVMTypeRef arg3;

if (!PyArg_ParseTuple(args, “OsO”, &obj1, &arg2, &obj3))
return NULL;

arg1 = ( LLVMModuleRef ) PyCObject_AsVoidPtr(obj1);
arg3 = ( LLVMTypeRef) PyCObject_AsVoidPtr(obj3);

return ctor_LLVMValueRef( LLVMAddFunction (arg1, arg2, arg3));
}

We get the function _wLLVMAddFunction that we are looking for. As is show in the last statement of this function:

return ctor_LLVMValueRef( LLVMAddFunction (arg1, arg2, arg3));

we finally get the C function that my_module.add_function in the example calls : LLVMAddFunction, which is defined in the file core.h of LLVM libries.

LLVMValueRef LLVMAddFunction(LLVMModuleRef M, const char *Name, LLVMTypeRef FunctionTy);