I’m wondering if some of you old hands might offer me some guidance. I’ve been working for some time with a virtual machine platform that is based loosely on the instruction set of a Forth hardware processor that Charles Moore designed sometime back. He used what he called a “MISC” or “Minimal Instruction Set Computer” and the original instruction set I was working off of (and have modified slightly in the process) had a total of 32 byte code primitives. He chose this limit because in his hardware design he was encoding each instruction in 5 bits and storing four of them in a 20-bit word. I’ve developed my platform for the intel processor family and thus have 32-bit words and store 4 8-bit instructions in one 32-bit word. I’m still only using a fraction of the available 256 codes that can be represented by a byte (last time I checked I was up to about 50 some of them), but like Moore, rely heavily on the original 32.
I’ve constructed a typical Forth dictionary mechanism on top of this VM and the code stream consists of a combination of four types of execution. Like moth Forth envrionments there is an interpreter mode that involves immediate lookup of words in a dictionary and execution of them as they are encountered during parsing of the input stream. The remaining three execution types involve code that has been compiled into these dictionary words (which happens when the execution engine is switched from interpreter to compile mode). As in any classic Forth system, once these compiled words are created they are stored in the dictionary andare then available to the interpreter and to any other compiled words that wish to use them. When executing a compiled word (either by the interpreter or in the process of executing some other compiled word that uses it),the execution engine always technically executes byte codes. But the reason there are three distinct code execution types is that one of the byte codes is a CALL primitive and is used to invoke higher-level Forth dictionary words; another byte code is an ASM! primitive that drops into intel assembler execution mode and executes intel code directly. So a typical compiled word might be some combination of Forth-level calls to other dictionary words, some sequence of VM byte codes that ultimately execute VM primitives, and intel assembler bits that are executed directly by the processor. Just to complete the picture, each VM byte code is itself implemented by a small hand-coded intel assembly routine that among other things limits its register use to preserve those registers reserved by the underlying execution engine. There’s only a few free registers available, as I reserve EAX for the top of the VM data stack, EDI for the top of the VM return stack, EDX for the forth code instruction pointer, ESI for the VM address register, and ESP for the Intel stack pointer. This leaves EBX and ECX and EBP for general use, although some of my routines do push some of the reserved registers for use while they are executing and then pop them before returning to the dispatch loop.
Ok, so I am looking to build a compilation mechanism using LLVM and imagine leveraging optimizations at two distinct levels. I’m sure there must be opportunities to optimize my VM byte code sequences (for example eliminating OP_PUSH and OP_POP or OP_RPUSH and OP_RPOP sequences or reordering stack use in general to be more optimized and things like that). This is obviously something I will need to do myself and probably won’t leverage too much of LLVM to help with (although if I’m wrong about this and there is support at this level, I would love to understand what and where it is). Where I see making most use of LLVM is in optimizing the assembly language sequences that end up getting executed, whether its through sequences of VM byte codes or in my direct assembler language codeing (I forgot to mention that I have built an in-line assembler to be able to include intel assembler instructions in my code streams). So what I see is, for example, taking a sequence of 20 byte code instructions that might make up a word’s definition and optimizing the entire sequence of intel assembly instructions that these translate to. This obviously has the potential to increase code size dramatically, since I could be turning a sequence of 20 bytes into a much longer sequence of intel code. On the other hand, the VM-level stack manipulation that makes up a substantial amount of VM-level code can probably be optimized away through judicious use of the full intel register set. And other VM instructions can probably benefit from similar savings. The +, 2*, COM, and address register store and fetche byte codes can each become a single assembler instruction instead of a call to a byte code routine. Other savings would likely involve moving current memory access to registers, eliminating calls and returns from successive byte code routines, converting tail calls to jumps, etc. So in many cases the increased code won’t be so gigantic and will be more than worth it from an execution cost savings standpoint.
So here are my questions (sorry to take so long to get to them). If given the choice, am I generally better off taking my byte code level and translating it to LLVM intermediate code and then driving new LLVM-optimized intel code generation from there? Or given that I already have each of these byte codes translated to intel assembler, am I better off just gathering all those assembler sequences together and optimizing this at the assembly language level. I’m going to guess the former, since working at the byte code level, I may be able to substitute entire byte codes with more judicious use of registers and end up with a much shorter piece of aggregated assembler to work with. But it could also be the case that there are no good rules of thumb and that I simply need to evaluate on a case by case basis. But if anyone does have any wisdom to offer or sources they can direct me to, I would be most appreciative.
The other question is that a lot of my code isn’t even written in byte code, but rather directly in assembly language using my built-in assembler. For this code, does LLVM offer opportunities to optimize, even if the source assembler isn’t something that has been generated from LLVM intermediate code? Also I’m imagining that if I succeed in optimize compiling a particular word into a sequence of assembly language, then any call to that word from another word (through my CALL op) can be translated to a jmp instruction, therby eliminating the substantial call overhead of my high-level forth code. Once again any guidance would be greatly appreciated, even if it’s just directing me to any good source material that might be helpful
Thanks very much for slogging through this and for any assistance anyone might be able to provide.