Global register variables/custom calling conventions

Hi all,

I'm implementing an LLVM backend for qemu-arm (by creating an ARM frontend for LLVM - from what I understand a slightly different approach than the original llvm-qemu project) and I've got to the point where I have to deal with Qemu's use of global register variables. From what I've read LLVM doesn't support GCC style register variables and the workaround was to implement a custom calling convention to ensure they weren't clobbered (at least from a caller's point of view)[1]. I haven't played with the LLVM code base much, just used the interfaces, and so was wondering if anyone had some pointers on what to look at/how to go about this?

Cheers,

Andrew

[1] http://markmail.org/message/ko3rrf5zacoocvg5#query:+page:1+mid:5uygbscyiqmuxqi4+state:results

Hello

I'm implementing an LLVM backend for qemu-arm (by creating an ARM frontend
for LLVM - from what I understand a slightly different approach than the
original llvm-qemu project)

I don't see the difference so far. Could you please explain?

and I've got to the point where I have to deal with Qemu's use of global register variables.

Why? The whole point of llvm-qemu project was that you don't need to
go with all that hacks and workarounds - you just emit LLVM IR and let
the codegenerator deal with register allocation, etc. For example, you
have much available registers on x86-64 (and even on x86-32!).

Anton Korobeynikov wrote:

Hello

I'm implementing an LLVM backend for qemu-arm (by creating an ARM frontend
for LLVM - from what I understand a slightly different approach than the
original llvm-qemu project)

I don't see the difference so far. Could you please explain?

Again, from what I understand, llvm-qemu worked by emitting LLVM IR from QEMU IR API calls. This project goes straight from ARM to LLVM IR, bypassing QEMU's IR, (partially) in the hope that more information about the original intent of the code is retained. I don't have any numbers handy for comparison against the argument "it'd optimise away", however, we have to have an implementation to test this theory anyway :wink:

and I've got to the point where I have to deal with Qemu's use of global register variables.

Why? The whole point of llvm-qemu project was that you don't need to
go with all that hacks and workarounds

The point of this is to provide an alternative backend to QEMU that can be run in a separate thread to generate optimised blocks, while working as transparently as possible. A nice property of TCG (QEMU's current JIT, which was dyngen when llvm-qemu was written) is that it's extremely fast at generating reasonable code - this approach keeps it in place while we do extra, possibly more expensive work out of sight. It might not be a pretty idea, but LLVM does generate some very tight code :slight_smile: It's an experiment - humour me...

Sorry if this is somewhat OT.

Cheers,

Andrew

Hello

Again, from what I understand, llvm-qemu worked by emitting LLVM IR from
QEMU IR API calls. This project goes straight from ARM to LLVM IR, bypassing
QEMU's IR, (partially) in the hope that more information about the original
intent of the code is retained.

Ok, what's left from QEMU then? :slight_smile:

generating reasonable code - this approach keeps it in place while we do
extra, possibly more expensive work out of sight. It might not be a pretty
idea, but LLVM does generate some very tight code :slight_smile: It's an experiment -
humour me...

Well, but I still don't get the reason why you need to pin (some)
internal QEMU state variables to fixed registers?

Anton Korobeynikov wrote:

Ok, what's left from QEMU then? :slight_smile:

The hardware emulation (interrupts, condition flags, register file etc) and execution framework (block selection and execution) from qemu are still used - translating the ARM to the native architecture is only part of the story :slight_smile:

generating reasonable code - this approach keeps it in place while we do
extra, possibly more expensive work out of sight. It might not be a pretty
idea, but LLVM does generate some very tight code :slight_smile: It's an experiment -
humour me...

Well, but I still don't get the reason why you need to pin (some)
internal QEMU state variables to fixed registers?

TCG seperates the guest (ARM) code into blocks - my front end translates these to LLVM IR for LLVM to translate to x86. The assumption is that LLVM will produce a better translation than TCG*. At some future point the TCG-generated native block is replaced by the LLVM's, and as such it needs to hand control back to qemu in a state that it would expect from TCG. Essentially the idea is to take the same input and produce the same output as the original TCG block, but munge things in the middle to (hopefully) be more efficient using LLVM's local optimisations.

All up, the the LLVM generated code needs to conform to what qemu is expecting from TCG in terms of emulated register state, and, as it's pinning values in host registers, some parts of the host register state. The pinned register in this case holds a pointer to the emulated cpu's state structure as it's frequently accessed. Clobbering this would break things such as direct block chaining (which avoids the need to jump out to the dispatch loop to find the next block to execute) and possibly other parts of the execution framework.

Hope it makes some sense :slight_smile:

Cheers,

Andrew

* At this point the frontend only supports a few ARM instructions, and it avoids generating IR that could lead to positional dependence. The general approach has the nice property that if an instruction that isn't implemented is encountered, the block simply isn't translated by LLVM and the TCG version lives on.

Hi Andrew,

TCG seperates the guest (ARM) code into blocks - my front end translates
these to LLVM IR for LLVM to translate to x86. The assumption is that LLVM
will produce a better translation than TCG*. At some future point the
TCG-generated native block is replaced by the LLVM's, and as such it needs
to hand control back to qemu in a state that it would expect from TCG.
Essentially the idea is to take the same input and produce the same output
as the original TCG block, but munge things in the middle to (hopefully) be
more efficient using LLVM's local optimisations.

I'm curious, are you translating directly from ARM to LLVM IR or from
TCG IR to LLVM IR?

llvm-qemu just put the pinned variables into memory (and lived with
the performance penalty), but I agree it would be much nicer to have a
custom calling convention in order to avoid this.

Cheers,

Tilmann

I'm translating straight from ARM to LLVM IR, avoiding TCG IR. I've taken the approach of implementing each basic block determined by TCG as a function in LLVM, which takes a pointer to the ARM CPU state struct as a parameter and returns the pointer to the struct. Currently I'm using LLVM's JIT to generate the target (x86_64) instructions, and the plan is to copy the instructions it generates into the translated code buffer. As mentioned in the previous email, the front-end has been designed to avoid location specific code and as such should be suitable for copying around in memory.

At the moment I'm not explicitly specifying a calling convention for the
functions, but LLVM seems to consistently put the ARM CPU state struct pointer (parameter) into %rdi. As a hack-around (since posting the original message to the list) I'm injecting a couple of MOV instructions to move the pointer from %r14 (AREG0 - TCG's pointer to the CPU state struct on x86_64) to %rdi before and back (from %rdi) to %r14 after the copied function block. Implementing a custom calling convention would avoid the injected MOVs, saving two instructions per block I guess. This isn't a huge win, and as such isn't a big priority, however it'd be a nice thing to have... It'd make things slightly less hackish anyway.

Cheers,

Andrew

At the moment I'm not explicitly specifying a calling convention for the
functions, but LLVM seems to consistently put the ARM CPU state struct
pointer (parameter) into %rdi.

That's correct - please consider reading x86-64 ABI document.

As a hack-around (since posting the original
message to the list) I'm injecting a couple of MOV instructions to move the
pointer from %r14 (AREG0 - TCG's pointer to the CPU state struct on x86_64)

Use inline assembler to load the address of this struct, this is much
better than implementing a new calling convention.

to %rdi before and back (from %rdi) to %r14 after the copied function block.

You don't need this - I doubt you're changing the pointer to the struct.

Hi Anton

At the moment I'm not explicitly specifying a calling convention for the
functions, but LLVM seems to consistently put the ARM CPU state struct
pointer (parameter) into %rdi.

That's correct - please consider reading x86-64 ABI document.

Hmm - good idea :slight_smile: I had a look at [1] and it clears up a lot.

Use inline assembler to load the address of this struct, this is much
better than implementing a new calling convention.

Ok - it certainly seems a lot less effort.

to %rdi before and back (from %rdi) to %r14 after the copied function block.

You don't need this - I doubt you're changing the pointer to the struct.

Yes, I'm not changing the value. Before reading the ABI document it was just a bit of defensive programming to ensure that it got placed back in %r14, but clearly %r14 shouldn't be changing in value anyway.

Thanks for your help :slight_smile:

Andrew

[1] http://www.x86-64.org/documentation/abi-0.99.pdf