Creating a virtual machine: stack, regs alloc & other problems

Not sure I understand. Are you talking about writing extensions to Clang?
Our general idea is to code VM in LLVM IR, then run llc to produce obj files, and so on.

Yes, writing in LLVM-IR is not consider a portable solution
For example you state you having a problem:
" After a high-level method executed by such a low-level function, there is a continuation that follows. The continuation is passed by VM stack and doing this using using C (by C function calls, CPS) led to significant performance loss."

What I was alluding to is to write annotations in the source code and then write LLVM passes to specifically target your performance problem and produce the code that runs at high performance. Either attributes on functions, calling conventions or a set of target independent intrinsic functions. All of these can be handled by custom passes that are expanded to target specific code. This can give you the portability of C and solve the performance bottlenecks caused by it. You can even have a bitcode library of hand written inline assembly that these functions expand to and have them linked in depending on the architecture.

This however runs on the assumption that you control the compiler, if that is the case, there are lots of changes you can make. Now, the downside is maintenance on LLVM version changes, but if that is relatively static, then it doesn’t factor in too much.