Hello- I'm looking to implement a new programming language using LLVM as a back-end. Generally it's looking very good, I only have one question. The language is going to be an ML-style language, similiar to Haskell or Ocaml, except explicitly multithreaded and (like Haskell but unlike Ocaml) purely functional. But this means that speed of allocation is essential- purely functional languages have been measured to allocate an average of 1 word every 6 instructions. Generally they allocate larger blocks (often 10-20 words at a shot) less often, but we're still talking about an insane (to imperitive programming language standards) allocation rate.
What I'd like is something similiar to the Ocaml garbage collector, but with a unique young heap for each thread (so I don't have to acquire a lock to allocate). But this requires me to have two words of thread-local storage for every thread (young_heap_pointer and young_heap_limit). So this is the question I have: what is the best way to implement this, given exceedingly tight constraints on performance? I'd like something that could be implemented on both Unix/Mac and Windows, just to make it more of a problem.
By far the best way I can think of is to play games with the virtual memory- have the two pointers (and maybe a little bit of other per-thread information) on a global page all by themselves, and remap that virtual page for every thread. Then I could simply dereference the globals. On Unix I think I can do this playing games with mmap(), but I'm not sure how to do this on Windows. I played around a little with the VirtualAlloc function, but it doesn't look like it does what I need. Almost as good would be to add a level of indirection, have a global pointer to where the page is to be mapped, and just map it in the same place in every thread.
Another possibility, and I'm not sure how to do this in LLVM, would be to sacrifice a register to hold the pointer to the unique per-thread structure. This would be worthwhile to me even on the register-starved x86-32. I suppose I could also just add a "hidden" (compiler-added and -maintained) argument to every function which is the pointer to the per-thread data.
Using the normal thread-local storage scares me, because I don't know the performance implications. Obviously calling a syscall every allocation would be unacceptable, nor would I like an O(log N) tree-walking hit. Ocaml's allocation is like 3-5 clock cycles for the common case (no minor GC needed), so adding another 5 clock cycles to get the thread-local data doubles the cost of allocation. Adding 50 or 1000 clock cycles is right out. Basically, I'd like allocation to remain single-digit clock cycle cost for the common case. If someone can convince me that thread-local storage is that fast, then we're good. Otherwise, I need to look elsewhere.
Opinions, advice, and/or alternative ideas about this problem would all be greatly welcomed. Which of the above alternatives do you think would work best? Or some other approach entirely?
Thanks.
Brian