Python bindings available.

Hi all,

I'd like to announce the availability of Python bindings for LLVM.

It is built over llvm-c, and currently exposes enough APIs to build an
in-memory IR (and dump it!). It needs LLVM 2.3 latest and Python 2.5
(2.4 should be sufficient, but I haven't tested). Tested only on
Linux/i386.

Would love to hear your comments.

[Needless to say, it's all work in progress, but mostly it works as
expected. More tests, documentation and APIs will follow.]

It's all here: http://mdevan.nfshost.com/llvm-py.html

Thanks & Regards,
-Mahadevan.

I'd like to announce the availability of Python bindings for LLVM.

It is built over llvm-c, and currently exposes enough APIs to build an
in-memory IR (and dump it!). It needs LLVM 2.3 latest and Python 2.5
(2.4 should be sufficient, but I haven't tested). Tested only on
Linux/i386.

Would love to hear your comments.

[Needless to say, it's all work in progress, but mostly it works as
expected. More tests, documentation and APIs will follow.]

It's all here: http://mdevan.nfshost.com/llvm-py.html

Hi Mahadevan,

Very nice! The OO syntax is pleasantly succinct. :slight_smile:

Constant.string(value, dont_null_terminate) -- value is a string
Constant.struct(consts, packed) -- a struct, consts is a list of other constants, packed is boolean

I did this in Ocaml initially, but found the boolean constants pretty confusing to read in code. I kept asking “What's that random true doing there?” Therefore, the bindings expose these as const_string/const_stringz and const_struct/const_packed_struct respectively. I figure the user can always write her own in the (very) rare cases that it is necessary to conditionalize such things:

     let const_string_maybez nullterm =
       if nullterm then const_stringz else const_string

Memory Buffer and Type Handles are not yet implemented.

:slight_smile: Type handles in particular are very important. You can't form a recursive type without using them, so you can't build any sort of data structure.

Builder wraps an llvm::IRBuilder object. It is created with the static method new (builder = Builder.new()).

Uninitialized builders are very dangerous (they leak instructions if you use them), so you might want to add overloads for new in order to avoid boilerplate code.

It can be positioned using the methodsposition(block, instr=None), position_before(instr) and position_at_end(block).

There's an "IR navigator" concept you can implement to avoid writing so many overloads here. It provides a complete "position" or "iterator" concept. It's not entirely explicit in the C bindings—it would be memory-inefficient if it were. But you can build it atop them easily. It's useful whenever the C bindings have Before/AtEnd functions, and you can implement it wherever you see First/Last/Next/Prev functions. The C bindings support this for functions, global variables, arguments, basic blocks, and instructions.

In Ocaml, we coded it up using a variant type, like (Before element | At_end parent). The basic operations for forward iteration are Parent.begin and Element.succ, which were implemented like this:

     Parent.begin =
       if this.first_element is null
         return At_end this
       else
         return Before this.first_element

     Element.succ =
       if this.next_element is null
         return At_end this.parent
       else
         return Before this.next_element

Then the user could build many IR navigation algorithms. The simplest one, "for each", is thus:

     for_elements(parent, callback) =
       pos = parent.begin
       loop
         match pos with
         > At_end _ -> break
         > Before element ->
             callback(element)
             pos = element.succ

     for_elements(parent, do_my_thing)

This representation was idiomatic in a functional language because it's compatible with recursion (you can translate for_elements into a tail recursive loop), but perhaps an enumerator class would be more idiomatic in Python:

     for_elements(parent, callback) =
       pos = parent.begin
       while pos.has_next()
         callback(pos.current)

The upshot, aside from being able to iterate the IR, was that it's easy to create builders anywhere with just one overload:

     // At the start or end of a BB:
     Builder.new(At_end bb)
     Builder.new(bb.begin)

     // Before or after a given instruction:
     Builder.new(Before instr)
     Builder.new(instr.succ)

This is actually more succinct than C++ because unlike BasicBlock::iterator, the position always knows its parent element (it's either parent or element.parent), so there's no need to pass it in separately as in builder.position(block, instr). Also, this could return a precise position:

The current block is returned via the r/o property insert_block.

Finally, just as the C++ STL has reverse_iterator, it did prove necessary to have a separate (At_begin parent | After element) type in order to walk the IR backwards.

Cheers,
Gordon

Hi Gordon,

Thanks for your comments.

> Constant.string(value, dont_null_terminate) -- value is a string
> Constant.struct(consts, packed) -- a struct, consts is a list of
> other constants, packed is boolean

I did this in Ocaml initially, but found the boolean constants pretty
confusing to read in code. I kept asking "What's that random true
doing there?" Therefore, the bindings expose these as const_string/
const_stringz and const_struct/const_packed_struct respectively. I

OK, will do.

:slight_smile: Type handles in particular are very important. You can't form a
recursive type without using them, so you can't build any sort of data
structure.

On it already. BTW, where can I find a good example of how to use
it?

Uninitialized builders are very dangerous (they leak instructions if
you use them), so you might want to add overloads for new in order to
avoid boilerplate code.

By 'uninitialized', I guess you're referring to builders that are yet
positioned on a block/instruction? Maybe it makes more sense to
create it 'from' a block, something like:

builder = basic_block_obj.builder()

with it being positioned at the end of the block by default. But then,
your ocaml syntax is much cleaner:

     // At the start or end of a BB:
     Builder.new(At_end bb)
     Builder.new(bb.begin)

     // Before or after a given instruction:
     Builder.new(Before instr)
     Builder.new(instr.succ)

so I'll see how this can be done a bit, ah, Pythonically.

Finally, just as the C++ STL has reverse_iterator, it did prove
necessary to have a separate (At_begin parent | After element) type in
order to walk the IR backwards.

Well, it's possible to do:

for inst in reversed(block.instructions):
  # do stuff with inst

which will iterate backwards over the instructions of a block.

Thanks & Regards,
-Mahadevan.

Hi Gordon,

Thanks for your comments.

No problem.

:slight_smile: Type handles in particular are very important. You can't form a recursive type without using them, so you can't build any sort of data structure.

On it already. BTW, where can I find a good example of how to use it?

To close the loop with C++ syntax, LLVMTypeHandleRef is really a pointer to a heap-allocated PATypeHolder, which is discussed here:

   http://llvm.org/docs/ProgrammersManual.html#BuildRecType

The file test/Bindings/Ocaml/vmcore.ml contains this fragment:

   (* RUN: grep -v {RecursiveTy.*RecursiveTy} < %t.ll
    *)
   let ty = opaque_type () in
   let th = handle_to_type ty in
   refine_type ty (pointer_type ty);
   let ty = type_of_handle th in
   insist (define_type_name "RecursiveTy" ty m);
   insist (ty == element_type ty)

Which constructs %RecursiveType = type %RecursiveType*.

Finally, just as the C++ STL has reverse_iterator, it did prove necessary to have a separate (At_begin parent | After element) type in order to walk the IR backwards.

Well, it's possible to do:

for inst in reversed(block.instructions):
# do stuff with inst

which will iterate backwards over the instructions of a block.

Certainly. There are advantages to using the "up, prev, next" pointers embedded within the IR in some circumstances, though. Consider what if 'do stuff with inst' might entail deleting another related instruction, where that instruction might appear later in reversed(block.instructions)--dangling pointer, boom! :slight_smile: On the other hand, creating a copy as your code presumably does is also sometimes exactly what's wanted... TMTOWTDI, as they say.

— Gordon

Hi Mahadevan,

One more thing I noticed that may be a problem. Automatic finalizers like this one are very dangerous when cooperating with the C++ object model:

void dtor_LLVMModuleRef(void *p)
{
LLVMModuleRef m = (LLVMModuleRef)p;
LLVMDisposeModule(m);
}

Consider the case where a function creates and populates a Module, stuffs it in an ExistingModuleProvider for the JIT, then returns the ModuleProvider, dropping direct reference to the Module. (ModuleProvider takes ownership of the Module.) I presume that your Python object is under the impression it owns the Module; when that goes out of scope, its refcount goes to zero and it invokes its dtor, disposing of the Module. D’oh— now the ModuleProvider has a dangling pointer. :slight_smile: The routine LLVMModuleRef LLVMGetGlobalParent(LLVMValueRef Global); poses a related problem; in this case, the returned reference is non-owning, so you must not dtor it from Python.

The fix, of course, is providing a dispose routine and requiring the user to call it, since you can’t know what they’ve done with the pointer. Luckily, the IR is not subject to these subtleties. None of your LLVMValueRef wrappers need destructors, either manual or automatic, because LLVMDisposeModule will destroy the contained objects recursively.

Builders and type handles are unlikely to ever be subject to these sorts of circumstances, though, so letting Python garbage collect them is advisable.

— Gordon

Certainly. There are advantages to using the "up, prev, next" pointers
embedded within the IR in some circumstances, though. Consider what if
'do stuff with inst' might entail deleting another related
instruction, where that instruction might appear later in
reversed(block.instructions)--dangling pointer, boom! :slight_smile: On the other

OK, that makes sense. I'll use first-class iterators.

Python iterators don't support reverse iteration though. Even if the container
supports it (ala rbegin/rend in C++). The "iterator protocol" is defined only
for forward iteration. That's why the "reversed()" idiom.

BTW, I didn't find any APIs for deleting an instruction?

hand, creating a copy as your code presumably does is also sometimes
exactly what's wanted... TMTOWTDI, as they say.

Pythoners will disagree -- see 'python -c "import this"' :wink:

Regards,
-Mahadevan.

That surprises me. Guess I was all tuckered out after all those builders!

— Gordon

Consider the case where a function creates and populates a Module, stuffs it
in an ExistingModuleProvider for the JIT, then returns the ModuleProvider,
dropping direct reference to the Module. (ModuleProvider takes ownership of
the Module.) I presume that your Python object is under the impression it
owns the Module; when that goes out of scope, its refcount goes to zero and
it invokes its dtor, disposing of the Module. D'oh— now the ModuleProvider
has a dangling pointer. :slight_smile:

Ah. Good one. Would the following fix it?

1) Have ModuleProvider maintain a reference to the Module it owns,
     so that the ref count is at least 1 at any time. This is easily done.
     The only thing left is when an MP goes away, the module's dtor
     will be called first, deleting the module, then the MP's dtor will
     be called, which will try to delete the same module again.

2a) So either we can prevent the actual destruction of modules that
    are attached to MPs, or

2b) Do not do anything in the dtors of MPs (while letting the dtor of
    modules do the work)

Both options have the disadvantage of assuming the C/C++ implementation
(like MP::dtor deletes only the module and nothing else).

The routine LLVMModuleRef
LLVMGetGlobalParent(LLVMValueRef Global); poses a related problem; in this
case, the returned reference is non-owning, so you must not dtor it from
Python.

If I do this:

  m1 = Module.new()
  g1 = m1.add_global_variable(ty, "name")
  m2 = g1.module

will the LLVMModuleRef pointer returned in the last call be the
same as that of m1? If so probably we can get "g1.module" to
return the original object itself.

The fix, of course, is providing a dispose routine and requiring the user to
call it, since you can't know what they've done with the pointer.

It'd be much easier to use it without an explicit destruction call.
I'd prefer to do it only if there's absolutely no other go.

Regards,
-Mahadevan.

Consider the case where a function creates and populates a Module, stuffs it in an ExistingModuleProvider for the JIT, then returns the ModuleProvider, dropping direct reference to the Module. (ModuleProvider takes ownership of the Module.) I presume that your Python object is under the impression it owns the Module; when that goes out of scope, its refcount goes to zero and it invokes its dtor, disposing of the Module. D'oh— now the ModuleProvider has a dangling pointer. :slight_smile:

Ah. Good one. Would the following fix it?

1) Have ModuleProvider maintain a reference to the Module it owns,
    so that the ref count is at least 1 at any time. This is easily done.
    The only thing left is when an MP goes away, the module's dtor
    will be called first, deleting the module, then the MP's dtor will
    be called, which will try to delete the same module again.

2a) So either we can prevent the actual destruction of modules that
   are attached to MPs, or

2b) Do not do anything in the dtors of MPs (while letting the dtor of
   modules do the work)

The ModuleProvider itself also needs to be destroyed or you have a memory leak.

Both options have the disadvantage of assuming the C/C++ implementation (like MP::dtor deletes only the module and nothing else).

The routine LLVMModuleRef LLVMGetGlobalParent(LLVMValueRef Global); poses a related problem; in this case, the returned reference is non-owning, so you must not dtor it from Python.

If I do this:

m1 = Module.new()
g1 = m1.add_global_variable(ty, "name")
m2 = g1.module

will the LLVMModuleRef pointer returned in the last call be the
same as that of m1? If so probably we can get "g1.module" to
return the original object itself.

It will, so that's a possibile.

The fix, of course, is providing a dispose routine and requiring the user to call it, since you can't know what they've done with the pointer.

It'd be much easier to use it without an explicit destruction call. I'd prefer to do it only if there's absolutely no other go.

It's certainly the easiest option. The thing to remember is that C++ won't play with your garbage collector at all.

The only other option I can see is to

  1. Add an 'owns pointer?' flag to each Python wrapper object.
  2. Update that flag dynamically as the value is used, in accordance with the C++ ownership semantics.
  3. Maintain a LLVMFooRef -> Python object table.

So you would reset the 'owns pointer?' flag on m if the user calls ModuleProvider.new(m).

The table is necessary, else mp = ModuleProvider.new(m2) [using your example fragment above] would trigger a double free by mp and m1. Python users are probably not performance sensitive to routing lookups through the dictionary, so this might be quite workable.

But even with all that work, you'll still have the possibility of surprising results:

   def f(m):
     mp = ModuleProvider.new(m)
     # do something useful?

   m = Module.new("")
   f(m)
   print(m.name) # Segfault, heap corruption, etc.

For Value and subclasses, you can perform the optimization that the Value is always owned by the Module and skip invoking all of this machinery. Likewise for Pass, if it ever gets bound.

— Gordon

Riiight. Let's try again:

1) The MP dtor does a no-op (deletes self, but not the module it owns)

2) The module has a 'provider' attribute, initially None.

3) When a MP is created for a module, it sets module.provider to itself.

4) MP itself has a reference to the module it owns.

Essentially, this creates a mutual reference between the two objects
and both remain active untill *all* references to *both* the objects
go away from the user code.

It does appear that Python collects these properly.

Regards,
-Mahadevan.

That's not how the object works...

— Gordon

Yes, I know :wink: I was hoping you could add a detach() or some such API
for the MP...

> That's not how the object works...

Gordon, I think I can make it work if we have the following additional
function in LLVM-C:

LLVMModuleRef LLVMGetModule(LLVMModuleProviderRef MP) {
  return wrap(unwrap(MP)->getModule());
}

Do you think you could add that? Can I send you a patch?

Thanks & Regards,
-Mahadevan.

That's not how the object works...

Gordon, I think I can make it work if we have the following additional
function in LLVM-C:

LLVMModuleRef LLVMGetModule(LLVMModuleProviderRef MP) {
return wrap(unwrap(MP)->getModule());
}

Can I ask, how general is your solution? I only intended to use this particular example as a proxy for the general problem of interaction between GC and manual memory management being complex. Will you require changing your Module bindings if another kind of object can take ownership of a Module?

Do you think you could add that? Can I send you a patch?

Certainly.

— Gordon

Well, the root cause (in this case) was that the concept of the ownership
of the module getting transferred to the module provider was not reflected
in the design. Initially, there was a module, which owned a LLVMModuleRef:

module_obj has-a LLVMModuleRef

After transferring ownership to the module provider, the situation becomes:

mp_obj has-a LLVMModuleProviderRef
mp_obj has-a module_obj
module_obj belongs-to-a mp_obj

and also:
LLVMModuleProviderRef has-a LLVMModuleRef

All relations are necessary to satisfy the use cases. The only way for the
module_obj to get the LLVMModuleRef *should* be via
mp_obj->LLVMModuleProviderRef->getModule.

Like so:

class ModuleProvider(object):

    def __init__(self, ptr, module):
        self.ptr = ptr
        self.module = module
        module._attach_to_provider(self)

    @property
    def module_ptr_non_owning:
        return _core.LLVMGetModule(self.ptr)

class Module(object):

    def __init__(self, ptr):
        self._ptr = ptr
        self.provider = None

    def _attach_to_provider(self, provider):
        self.provider = provider
        self._ptr = None

    @property
    def ptr(self):
        if self.provider:
            return self.provider.module_ptr_non_owning
        else:
            assert self._ptr
            return self._ptr

    # Use self.ptr temporarily, for making LLVM*() calls that
    # require a LLVMModuleRef.

I've to admit assuming that only ModuleProviders own Modules.
The concept of non-owned wrapping objects are also missing.
The design will have to evolve as I cover more APIs, I guess.

Regards,
-Mahadevan.

Hm. I may misunderstand, but I'm not sure that's an improvement over the problem you're trying to solve. How about something like this? (Please forgive any accent; I don't speak snake fluently.)

class Pet(object):
   @staticmethod
   def new():
     # Create a 'free' pet. It can later become owned, but not to more than one owner.
     return Pet(_core.LLVMCreatePet(), None)

   def __init__(self, ptr, owner):
     self.ptr = ptr
     self.owner = owner

   def __del__(self):
     # Do not call the dtor if something else has committed to do so for us!
     if !self.owner:
       _core.LLVMDisposePet(self.ptr)

class Owner(pet):
   @staticmethod
   def new():
     if pet.owner:
       raise OwnershipError("Pet already owned!")
     return Owner(_core.LLVMCreateOwner(pet.ptr), pet)

   def __init__(self, ptr, kid):
     self.ptr = ptr
     pet.owner = self

   def add(self, pet):
     if (pet.owner)
       raise OwnershipError("Pet already owned!")
     _core.LLVMAddPet(pet.ptr)
     pet.owner = self

   @property
   def first_pet(self):
      # Create an 'owned' pet.
     return Pet(_core.LLVMGetFirstPet(self.ptr), self)

   def __del__(self):
     _core.LLVMDisposeOwner(pet.owner)

This will create gobs of child->parent references. Don't remember whether that's a problem in Python. This pattern is composable, which is an important property; Owner could play the role of Pet to some OtherOwner.

But this is really delving into the realm of excessively elaborate, so I think I'll bow out now. Do whatever you think best. :slight_smile:

Um. Unfortunately in this case, Python will not collect an instance of
Pet which owned by an Owner, because both objects in the reference
cycle have finalizers (__del__). Python will just add both objects to
a garbage list (and will not call any of the finalizers).

It does this because it cannot guess which finalizer to call first.

But I get your point though, and will try to work this into the code.
Let me also add some tests to exercise all these well.

Thanks for your inputs.

Thanks & Regards,
-Mahadevan.

Actually, there's no reference cycle, which is what I presume is the problem you're referring to:

class Owner(pet):
   ...

   def __init__(self, ptr, kid):
     self.ptr = ptr
     pet.owner = self

Note there is no Owner.pet field. So it should be clear to Python that Pet must be collected before Owner, which is precisely what we're trying to tell the GC ('if owner is collected before pet, then pet becomes a dangling pointer, so don't do that').

— Gordon

Ah, that's true. My mistake.

Actually the self.ptr object is a PyCObject with "owned" semantics
right now -- that is, when self.ptr is finalized, a LLVMDisposeXXX() will
be called. This has to be changed (self.ptr finalization does nothing),
and the ownership maintained via Python finalizers (like you've done above).

-Mahadevan.