Creating a virtual machine: stack, regs alloc & other problems

Hello.

It is stack VM, and one designed to utilize all the advantages of the assembly language implementation.

This sounds very, very familiar. Are you willing to share which
VM/language you're working on?

I would like to because I think the additional context would be helpful...let me ask for permission first.

doing this using using C (by C function calls, CPS) led to significant performance loss.

I assume the continuation is a tail call? If so, have you examined
where the performance loss originated? An obvious tail call in C code
being compiled by modern Clang should be code generated as a tail call.
You might be stumbling across an implementation limit and adjusting your
input slightly might bypass that.

Yes, both executor and it's continuation are tail call. And we tried gcc 4.8, 4.9, 5.1
and clang 3.6 C compilers. All are good to infer a tail call if the function call is explicit. But if
a continuation function pointer is popped up from VM stack, neither of these were able to produce a jump,
leading to machine stack overflow. Not sure why clang wasn't able to do this, because while using
LLVM IR it works (using a test code, of course).

We are considering extending LLVM by creating a special calling convention which
forces a function (using this convention) to pass args in registers and to
be force tail-call optimized.

You absolutely will need a custom calling convention for the register
assignments and such. If your source IR uses musttail, in principal you
shouldn't need to do anything special for the tail calls provided you're
running on one of the architectures where musttail has been implemented.

musttail has some limitations (ex., the caller and callee prototypes must match), but tail
with -tailcallopt work just fine.
Looks like we were on the right track with that question. Thanks!

Creating a new calling convention is easy. It will require a custom
LLVM build to get started since you have to change td and cpp files in
the target. For examples, see the existing ones in
Target/X86/X86CallingConv.td

Yes, we already have a custom build. Thanks!

2. Because the existing VM runtime is written in x86 assembly, and doesn't do function calls, it uses ESP register for VM stack purposes
(again, it is not in use for low-level calls). We want to do the same.

This will be tricky... Do you absolutely absolutely need this?

We still have to support x86 32-bit and this arch has a lack of GP registers, and
a) esp register becomes almost unused,
b) we will have to do stack operations using other regs, which may lead to more spilling.
So, it's good to have the esp reg doing what is has to...but for VM stack.
We don't want to keep original x86 assembly version along with our new one llvm-based (hopefully we will make it).

We think that it could be implemented as intrinsics as well? Or perhaps we should create intrinsics for arbitrary machine stack access?
We tried, for example, stacksave-sub-store-stackrestore sequence, but it never folds into a single push operation.

I would suggest just implementing your virtual machine stack as a normal
bit of memory. There's no reason that the compiler needs to know that
this is the VM stack versus some other buffer. You will need to provide
aliasing facts, but that's a much smaller extension*. Trying to much
with the frames at runtime using the intrinsics is going to end very badly.
* Don't under estimate this point. Providing aliasing metadata will be
*really* important for this scheme to work reasonably well. You may
need to add custom extensions locally or propose extensions upstream to
encode the information you need.

Could you please be more specific? Pointing to some docs or examples will work just fine. :slight_smile:

Another option: If you know the size of your vm frame statically, you
can emit loads from the vm stack locations into SSA values (or allocas,
which will become SSA values) and spill as needed to ensure the VM stack
is up to date as required by your language requirements. This will
likely a) decrease your dependence of the pass ordering above, and b)
give slightly better results since LLVM is going to have to be
conservative about calls into your runtime and your custom lowering gets
to use language specific knowledge.

Hmm.. this sounds interesting. thanks.
So, if I understand correctly, if we need to allocate a VM stack frame, the idea is to create
enough allocas, then store there values which need to be in the VM frame?
But how can it survive optimizing passes?
(I assume that we did stackload before and esp points to VM stack)
Could you please explain more?

3. Since the machine stack is a VM stack, we are not allowed to use alloca. It's not a problem, but the machine register allocator/spiller
can still use the machine stack for register spilling purposes.
How could this be solved? Should we provide our own register allocator? Or could it be solved by providing a machine function pass,
which will run on a function marked with our calling conv and substitute machine instructions responsible for spilling to stack
with load/store instructions to our heap memory?

I don't understand what you're trying to ask here. If you can spill to
the machine frame (instead of the VM stack frame), what's the problem?

I mean that if we do stackload and the machine stack points to the VM stack (and we
somehow solved the problem above), LLVM still wants to spill regs to stack.
It would be good to have the spill slots in the VM context, but not in the stack
(neither machine nor VM).

Thank you for your time.

At a meta level, let me give my standard warning: implementing a
functional compiler for a language of your choice on LLVM is relatively
easy; your looking at around 1-3 man years of effort depending on the
language. implementing a *performant* compiler is far, far harder.
Unless you're willing to budget for upwards of 10 man years of skilled
compiler engineering time, you may need to adjust your expectations.
How hard the problem will be also depends on how good your current
implementation is of course. :slight_smile:

I appreciate the warning.
Strictly speaking, we don't implement the compiler itself. It's only a runtime
for interpreting bytecodes compiled earlier. JIT is upcoming, but not for now.
It's also not including a memory manager - it works good enough written in C.
Our task is really not easy, but not so hard as you think. :slight_smile: We think. :slight_smile:

To give a flavour for the tuning involved, you might find this document
helpful:
http://llvm.org/docs/Frontend/PerformanceTips.html
If you're serious about the project, I highly recommend that you make an
effort to attend the developers conference in Oct. You'll want to have
a bunch of high bandwidth conversations with people who've been down
this road before and email just doesn't quite work for that.

Thanks! We will keep this in mind.

Hello.

It is stack VM, and one designed to utilize all the advantages of the assembly language implementation.

This sounds very, very familiar. Are you willing to share which
VM/language you're working on?

I would like to because I think the additional context would be helpful...let me ask for permission first.

doing this using using C (by C function calls, CPS) led to significant performance loss.

I assume the continuation is a tail call? If so, have you examined
where the performance loss originated? An obvious tail call in C code
being compiled by modern Clang should be code generated as a tail call.
You might be stumbling across an implementation limit and adjusting your
input slightly might bypass that.

Yes, both executor and it's continuation are tail call. And we tried gcc 4.8, 4.9, 5.1
and clang 3.6 C compilers. All are good to infer a tail call if the function call is explicit. But if
a continuation function pointer is popped up from VM stack, neither of these were able to produce a jump,
leading to machine stack overflow. Not sure why clang wasn't able to do this, because while using
LLVM IR it works (using a test code, of course).

I would suggest looking into this. The smallest C reproducer which doesn't get a tail call would be interesting to see and might get fixed. :slight_smile:

One thing you might be running into if you're VM is in C vs C++ is that C++ pointers-to-member functions aren't just function pointers.

We are considering extending LLVM by creating a special calling convention which
forces a function (using this convention) to pass args in registers and to
be force tail-call optimized.

You absolutely will need a custom calling convention for the register
assignments and such. If your source IR uses musttail, in principal you
shouldn't need to do anything special for the tail calls provided you're
running on one of the architectures where musttail has been implemented.

musttail has some limitations (ex., the caller and callee prototypes must match), but tail
with -tailcallopt work just fine.
Looks like we were on the right track with that question. Thanks!

Creating a new calling convention is easy. It will require a custom
LLVM build to get started since you have to change td and cpp files in
the target. For examples, see the existing ones in
Target/X86/X86CallingConv.td

Yes, we already have a custom build. Thanks!

2. Because the existing VM runtime is written in x86 assembly, and doesn't do function calls, it uses ESP register for VM stack purposes
(again, it is not in use for low-level calls). We want to do the same.

This will be tricky... Do you absolutely absolutely need this?

We still have to support x86 32-bit and this arch has a lack of GP registers, and
a) esp register becomes almost unused,
b) we will have to do stack operations using other regs, which may lead to more spilling.
So, it's good to have the esp reg doing what is has to...but for VM stack.
We don't want to keep original x86 assembly version along with our new one llvm-based (hopefully we will make it).

How does your current runtime track spills inserted by the compiler? Is that integrated with the VM stack? Or is that a distinct stack? If you didn't mind spills being interwoven with vm frames, you could model the vm stack operations as dynamic allocas potentially.

We think that it could be implemented as intrinsics as well? Or perhaps we should create intrinsics for arbitrary machine stack access?
We tried, for example, stacksave-sub-store-stackrestore sequence, but it never folds into a single push operation.

I would suggest just implementing your virtual machine stack as a normal
bit of memory. There's no reason that the compiler needs to know that
this is the VM stack versus some other buffer. You will need to provide
aliasing facts, but that's a much smaller extension*. Trying to much
with the frames at runtime using the intrinsics is going to end very badly.
* Don't under estimate this point. Providing aliasing metadata will be
*really* important for this scheme to work reasonably well. You may
need to add custom extensions locally or propose extensions upstream to
encode the information you need.

Could you please be more specific? Pointing to some docs or examples will work just fine. :slight_smile:

See LangRef. Search for noalias, tbaa, invariant.load, inbounds, readonly, readnone, argmemonly, nonnull...

See the alias analysis docs and the TBAA pass as an example of how to write and integrate a custom AA pass.

Another option: If you know the size of your vm frame statically, you
can emit loads from the vm stack locations into SSA values (or allocas,
which will become SSA values) and spill as needed to ensure the VM stack
is up to date as required by your language requirements. This will
likely a) decrease your dependence of the pass ordering above, and b)
give slightly better results since LLVM is going to have to be
conservative about calls into your runtime and your custom lowering gets
to use language specific knowledge.

Hmm.. this sounds interesting. thanks.
So, if I understand correctly, if we need to allocate a VM stack frame, the idea is to create
enough allocas, then store there values which need to be in the VM frame?
But how can it survive optimizing passes?

The allocas specially shouldn't survive optimization. That's the point. :slight_smile: If the vm stack has been escaped at all the relevant points, the spills to the vm stack memory can't be eliminated. As a result, you'd get the effect of having a materialized vm stack when you need it, and everything in SSA/execution stack the rest of the time.

(I assume that we did stackload before and esp points to VM stack)
Could you please explain more?

I was assuming you still had a separate execution stack and vm stack. Mixing the two without letting LLVM spill things between VM stack sections would be "interesting".

3. Since the machine stack is a VM stack, we are not allowed to use alloca. It's not a problem, but the machine register allocator/spiller
can still use the machine stack for register spilling purposes.
How could this be solved? Should we provide our own register allocator? Or could it be solved by providing a machine function pass,
which will run on a function marked with our calling conv and substitute machine instructions responsible for spilling to stack
with load/store instructions to our heap memory?

I don't understand what you're trying to ask here. If you can spill to
the machine frame (instead of the VM stack frame), what's the problem?

I mean that if we do stackload and the machine stack points to the VM stack (and we
somehow solved the problem above), LLVM still wants to spill regs to stack.
It would be good to have the spill slots in the VM context, but not in the stack
(neither machine nor VM).

This is going to be a really problematic design point. The entire LLVM backend assumes it owns the execution stack. That's a really really built in assumption. Trying to change that would be extremely challenging. (See comment below)

Thank you for your time.

At a meta level, let me give my standard warning: implementing a
functional compiler for a language of your choice on LLVM is relatively
easy; your looking at around 1-3 man years of effort depending on the
language. implementing a *performant* compiler is far, far harder.
Unless you're willing to budget for upwards of 10 man years of skilled
compiler engineering time, you may need to adjust your expectations.
How hard the problem will be also depends on how good your current
implementation is of course. :slight_smile:

I appreciate the warning.
Strictly speaking, we don't implement the compiler itself. It's only a runtime
for interpreting bytecodes compiled earlier. JIT is upcoming, but not for now.
It's also not including a memory manager - it works good enough written in C.
Our task is really not easy, but not so hard as you think. :slight_smile: We think. :slight_smile:

Wait, what? I think I got something confused at some point. All of my answers above were with a JIT in mind. :slight_smile: Doing an interpreter is a slightly different beast.

For the record, trying to not have an execution stack for spilling just started seeming a lot more sane. :slight_smile: I would still start with a design that uses an extra (ESP) register for the execution stack, but tuning the IR for an interpreter to not spill or changing the base pointer to be something special with a fixed scratch pad seems approachable. Dealing with a restricted bit of code is much more approachable than whatever a JIT might emit. :slight_smile: Still challenging though.

Alex:

I'm not sure you're taking the right approach with this. You can either
have portability or you can play games with the calling convention assumed
by the back end, or you can modify the compiler to suit your desired
calling convention, but you probably can't get all three.

I'm the guy behind HDTrans (dynamic binrary translation for x86), and we
used direct x86 instruction emission as well, and we cheated like crazy on
calling conventions, stacks, you name it. So I understand where you are
coming from. I've also done some bytecode VM work. You just aren't going to
get a portable result that way, and as others have said already, using
llvm-il isn't going to get you there. I think you are better off stepping
back and looking at this as a new engineering problem rather than trying to
translate your existing solution piece by piece. The bad news is that this
infrastructure my not let you get quite as far down toward the bare metal.
The good news is that it can be exploited to do more in the way of dynamic
optimization than is typically feasible with directly hand-generated
machine code.

If you like, get in touch with me off-line. I don't want to go spouting off
useless ideas here, because I don't understand what you are trying to do
yet. But I'd be happy to talk with you to get a slightly better sense and
see if I can offer some practical help.

Jonathan
shap (at) eros-osdogorg
Dog should be dot, of course. :slight_smile:

Philip:

And just as an aside, I suspect I understand why the register spill issue
is important. My guess would be that you have a need to keep your VM stack
in a state that is bitwise compliant with some external specification at
every point where a sequence point potentially exists?

shap

I’m not sure what the OP requirements are. That’s what I was trying to clarify/understand. I certainly don’t have that requirement for my VM, but I can’t speak for the OP.

Philip