dynamic typing system

This isn’t a strictly llvm-related problem, but I thought I’d ask anyway to see if anyone can help.

I’m trying to write a dynamically typed language on top of llvm. My initial idea was to have a general object type for all objects in my language. I came up with:

{ i8, i8* }

the first element of the structure would hold the type of the object, and the second is a pointer to the actual data.

Now, I’m not exactly sure how to get my data allocated somewhere in order to be able to get a pointer to it. My initial thought was heap allocations, though there doesn’t seem to be any llvm instructions that perform heap allocations, though I imagine you just use C’s malloc and free? If you do use those, however, is there a way of getting the byte-size of a type, to know what to pass to malloc? There’s also the issue of having to know when to be able to free() the pointers.

The other option, I guess, would be stack allocations with alloca instructions? I don’t need to worry about the sizes of types or about calling free, but now my objects can’t live on past the scope of a function, which may complicate things. For instance, if at my jiting repl (set up like the Kaleidoscope tutorial, where top-level expressions are wrapped in lambdas and then executed), I type in “5”, the repl should spit 5 back to me. If I use allocas here there isn’t a problem. But if I define a global variable and assign 5 to it, the data I alloca’d is going to be gone after the anonymous function returns. This makes it seem like heap allocations would be a better choice.

So basically, I’m sort of stuck not knowing the best way to implement this (or which way will even be possible). I’d appreciate any input/guidance on how to proceed.

Alec Benzer wrote:

This isn't a strictly llvm-related problem, but I thought I'd ask anyway to see if anyone can help.

I'm trying to write a dynamically typed language on top of llvm. My initial idea was to have a general object type for all objects in my language. I came up with:

{ i8, i8* }

the first element of the structure would hold the type of the object, and the second is a pointer to the actual data.

Now, I'm not exactly sure how to get my data allocated somewhere in order to be able to get a pointer to it. My initial thought was heap allocations, though there doesn't seem to be any llvm instructions that perform heap allocations, though I imagine you just use C's malloc and free? If you do use those, however, is there a way of getting the byte-size of a type, to know what to pass to malloc? There's also the issue of having to know when to be able to free() the pointers.

For heap allocations, you need to insert a call to a function that performs heap allocations. Creating a call to malloc is one possibility. However, depending on your front-end language, you may wish to use a garbage-collection allocator (e.g., the Boehm conservative allocator) or a region-based allocator (e.g., the allocators used in the Automatic Pool Allocator and SAFECode projects or the kmem_cache_alloc() allocator used in Linux). Whatever heap allocator you use is a design choice for your language.

To get the allocation size, use the TargetData analysis pass. This pass has methods that can tell you the allocation size of various LLVM types.

Note that if you go with a garbage-collected heap allocator, you will probably need to do some additional work so that the GC knows where to find the roots to use for starting scans of the heap. I believe LLVM provides intrinsics to aid with this.

-- John T.

That's putting it mildly.
If locally created objects can outlive the stack frame, you need a heap. (Unless your language is specifically geared towards region analysis, but your post doesn't sound as if that's the part of the learning curve you're on.)
Some language infrastructures actually allocated everything on the heap. Stack-based allocation would be added as an optimization then. (This might be reasonable, or the worst idea ever, depending on specifics of the language.)

Regards,
Jo

If your data needs to outlive the function it was allocated in you
need to use heap allocation (either malloc or a custom allocation/gc
-- just call it as a C function for now, for some allocators you can
inline it directly into your code).
To calculate the size for allocation either use TargetData or use GEP
trick (http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt).
Note that in your struct you don't gain much by declaring type as i8
-- pointer that follows it will be aligned at 4 or 8 bytes, thus
creating a hole after i8. Type is usually stored as a pointer to some
type info or method table. Dynamically typed languages usually put it
as the first word of every object and use "plain" pointers to objects.
Sometimes "fat" pointers (a pair of pointer to type and pointer to
data) are used.

Eugene

This is a huge subject, but briefly:
1. Unless your type system is really weird, you will need a garbage-collected heap. The design and quality of your GC will determine a lot of the performance of your language. However, it is relatively easy to improve your garbage collector and allocator without fundamentally changing your object representation.
2. You should design your object representation to make it as cheap as possible to create and recognize the values you expect to be working with the most. If your dynamic programs are going to do a lot of text processing, don't make it expensive to check whether a value is a string. In particular, booleans and small integers should be easy to recognize.
3. If you have any control over your language's semantics, I strongly suggest avoiding implicit coercions; it should be a (dynamic) type error to use a string where an integer was expected. This can substantially cut down on the amount of code you have to generate.
4. So-called "fat pointers" (pointers that include extra bytes of information, like {i8, i8*}) can significantly increase the amount of memory you use, since most architectures require pointers to be aligned on 4- or 8-byte boundaries (or perform better when they are). Try to see if you can get away with tagged pointers, i.e. using the low bits of a value to determine how to interpret the rest of it. Getting good low-level performance out of a dynamically-typed language requires a lot of type-unsafe tricks like this.

John.