Dynamic (JIT) type resolution

Hi everyone,

I would like to implement an equivalent mechanism of function callbacks
in the JIT, but for fields. Typically in Java, when you compile a
method, there may be some instructions (getfield, putfield) that depend
on a type which is not yet resolved.

I think the best way to do this in LLVM is to add an intrinsic. The
intrinsic would be only valid if we jit, and would be lowered only in
the codegen part (don't know yet if it would be in the target dependent
or target independent part).

The callback method will resolve the type and patch the store/load
instruction so that the correct address is used (exactly like the JIT
function callback compiles a method and patch the call)

Now there is one issue to deal with here: how to represent the
intrinsic? It can either be 1) llvm.getfieldptr.{type} or 2) have two
kinds of intrinsics llvm.getfield.{type} and llvm.storefield.{type}.

I'd prefer using 1) since its closer to the LLVM instruction set
(GetElementPtrInst), however there may be some tricky issues on where
and how the callback function must patch the code. For example, how are
move instructions (for spilling registers) inserted in LLVM? By choosing
1), can I face the issue of having a move instruction between the
getfieldptr call and the load/store? I probably can also face the
problem of code optimization, where the store/load would not be next to
the callback call.

Will I also have these issues with 2)? I don't know if LLVM does
optimization on DAG nodes. The dag nodes that I would like to generate
for a llvm.loadfield.{type} would be:

DAG.getCall(FieldCallback); // Or something similar, I don't know
exactly the syntax :wink:
DAG.getLoad();

When (if possible) can I be sure that these two instructions are next to
each other in the native code?

(Oh, and also, I would like codegen to not clobber caller-saved
registers when doing the call. Is that even possible? This is just an
optimization problem, so we can skip it for now).

Thanks,
Nicolas

I don't think this is really the right way to go. Can you give an example snippet of Java code that would need this and what you are proposing? With a concrete example that shows how the lazy class loading stuff works we can talk about different possibilities,

-Chris

Nicholas,

I guess you're trying to solve the fragile base-class problem by deferring field offset calculations until JIT compilation time?

Perhaps I'm missing something, but can't you accomplish this by using external constants in the source program, to be supplied at JIT/link time?

     external constant i32 @obj.x.offs;
     external constant i32 @obj.y.offs;

     define float @xPlusY(i8* %obj) {
     entry:
       %x.offs = load i32* @obj.x.offs;
       %x.ptr = getelementptr %obj, i32 %x.offs;
       %x.ptr2 = bitcast i8* %x.ptr to float*
       %x = load float* %x.ptr2
       %y.offs = load i32* @obj.y.offs;
       %y.ptr = getelementptr %obj, i32 %y.offs;
       %y.ptr2 = bitcast i8* %y.ptr to float*
       %y = load float* %y.ptr2
       %sum = add float %x, %y
       ret float %sum
     }

Or, quite similarly, accessor functions also to be supplied by the JIT/linker:

     declare float @obj.x(i8* %obj)
     declare float @obj.y(i8* %obj)

     define float @xPlusY(i8* %obj) {
     entry:
       %x = call float @obj.x(i8* %obj);
       %y = call float @obj.y(i8* %obj);
       %sum = add float %x, %y
       ret float %sum
     }

In either case, an optimization pass could trivially zero out the overhead with no need to modify LLVM.

Chris Lattner wrote:

I don't think this is really the right way to go. Can you give an
example snippet of Java code that would need this and what you are
proposing? With a concrete example that shows how the lazy class
loading stuff works we can talk about different possibilities,

Field operations in Java (getfield, putfield, getstatic, putstatic) do
_not_ need what I'm proposing. What I'm proposing is just performance
related (just like method patching in callbacks is an optimization in
order to not call the callback everytime).

Here's a simple example: consider class One:

public class One {
  double a;
}

and class Two:
public class Two {
   static double getDoubleFromOne(One arg) {
        return One.a;
   }
}

Here's the bytecode generated for the getDoubleFromOne method:

ldarg 0
getfield "One", "a"
return

What happens in Java is that types are created lazily. Which means you
can compile getDoubleFromOne without knowing the layout of the class
One. Therefore, if you compile getDoubleFromOne without the layout
information, the compiler will generate the code:

r2 = arg0
r1 = getElementOffsetOf("One", "a");
return r1(r2);

getElementOffsetOf will trigger resolution of the class "One" and return
some offset for "a".

My goal here is to avoid having the call to getElementOffsetOf for the
next executions of getDoubleFromOne. One solution is to recompile
getDoubleFromOne, now that we have the layout information, but this may
not be desirable (increase of compilation time or what if we have many
getElementOffsetOf calls in one method which belong to "if" statements,
etc).

What I would like to do is for getElementOffsetOf to dynamically patch
the native code so that after the execution of getElementOffsetOf in the
getDoubleFromOne method the native code looks like:

r2 = arg0
nop
return offset(r2)

The problem is that getElementOffsetOf does not know where is the code
for loading the field. The compiler may have move it after some
instructions, or there may be some move instructions due to register
spills, etc.

Thus the introduction of the llvm.getfield* intrinsic :slight_smile:

Hi Gordon,

Gordon Henriksen wrote:

Nicholas,

I guess you're trying to solve the fragile base-class problem by
deferring field offset calculations until JIT compilation time?

No. I'm deferring field offset calculation until /execution /time.

Perhaps I'm missing something, but can't you accomplish this by using
external constants in the source program, to be supplied at JIT/link
time?
  
If the JIT could make a callback to dynamically resolve the type, that's
the idea (I'm thinking, if a field can have a GhostLinkage, can the
compiler generate a callback? In this case I don't need a new intrinsic).

Or, quite similarly, accessor functions also to be supplied by the JIT/
linker:
  
You lose performance here, because you have to call the accessor
function each time you call the xPlusY function. My goal is to patch the
native code so that you do not need to call these accessor functions again.

     declare float @obj.x(i8* %obj)
     declare float @obj.y(i8* %obj)

     define float @xPlusY(i8* %obj) {
     entry:
       %x = call float @obj.x(i8* %obj);
       %y = call float @obj.y(i8* %obj);
       %sum = add float %x, %y
       ret float %sum
     }

After the first execution of xPlusY, the new native code (changed to
LLVM, and imagine there are also some nops to remove the previsous
accessors calls) of xPlusY would be:

%x = offsetx(%obj)
%y = offsety(%obj)
%sum = add float %x, %y
ret float %sum

Thanks,
Nicolas

Just run the inliner.

— Gordon

Gordon Henriksen wrote:

Just run the inliner.
  
No. :wink: The inliner would require to recompile the function, and I don't
want that. And recall that at compilation time, I do not have the type
layout. It's only at execution time that I can have it.

Nicolas

I'm missing something here. If you lazily compile getDoubleFromOne the first time it is called, it seems like you are guaranteed to have layout information, as an instance of "One" is required to be passed in for this to run.

I guess a better example would be:

static double x() {
   (new One).a = 4;
}

or something like that. The best way (I think) to handle this is to either:

1. interpret the first time through the code, then jit compile after the class is loaded, or:
2. jit compile code that is slower than need be (using function calls to cause the lazy stuff to happen) and then replace it when the class is loaded.

-Chris

Chris Lattner wrote:

I'm missing something here. If you lazily compile getDoubleFromOne the
first time it is called, it seems like you are guaranteed to have layout
information, as an instance of "One" is required to be passed in for this
to run.
  
Right, you got me :wink: Thanks for giving a better example.

1. interpret the first time through the code, then jit compile after the
class is loaded, or:
  
Not the best way to do: what if you have many field operations on many
classes in one method? You'll interpret the function as long as all
field operations are not resolved?

2. jit compile code that is slower than need be (using function calls to
cause the lazy stuff to happen) and then replace it when the class is
loaded.
  
You mean replace the code at the IR level right? and then recompile the
function. You then have the same issue than 1).

By the way, this is a classic optimization in Java that all VMs with JIT
do (sorry, I don't have links here).

Nicolas

Not the best way to do: what if you have many field operations on many
classes in one method? You'll interpret the function as long as all
field operations are not resolved?

right

2. jit compile code that is slower than need be (using function calls to
cause the lazy stuff to happen) and then replace it when the class is
loaded.

You mean replace the code at the IR level right? and then recompile the
function. You then have the same issue than 1).

I don't see how it's the same issue. Once you jit it, it is just machine code. Updating the machine code with better code seems like a reasonable and obvious optimization.

-Chris

Chris Lattner wrote:

I don't see how it's the same issue. Once you jit it, it is just machine
code. Updating the machine code with better code seems like a reasonable
and obvious optimization.

That's what I want to do! :wink: However, from what I understand, your
solution is to recompile the method. And I don't want to do that. I only
want to dynamically patch the field operation in the native code.

Chris Lattner wrote:

I don't see how it's the same issue. Once you jit it, it is just machine
code. Updating the machine code with better code seems like a reasonable
and obvious optimization.

That's what I want to do! :wink: However, from what I understand, your
solution is to recompile the method. And I don't want to do that. I only
want to dynamically patch the field operation in the native code.

maybe a tradeoff is possible:
the function to get the offset is replaced by a function pointer and a stub (avoiding many of the general problems involved with using self-modifying-code).

the fist time it is called, the function pointer points to 'stub A', which calls the function to lookup the slot offset,
this function then stores the value in a variable, and updates the function pointer to point to 'stub B'.

'stub B', simply returns the value stored in the variable.

this should not be too difficult to implement I would think (albeit admittedly I still don't know a whole lot about LLVM).

hope this is of some use, in any case.

BGB wrote:

maybe a tradeoff is possible:
the function to get the offset is replaced by a function pointer and a stub
(avoiding many of the general problems involved with using
self-modifying-code).
  
For me there are no problems of self-modifying code (the LLVM jit
already does it)

the fist time it is called, the function pointer points to 'stub A', which
calls the function to lookup the slot offset,
this function then stores the value in a variable, and updates the function
pointer to point to 'stub B'.

'stub B', simply returns the value stored in the variable.

That's again what I want to avoid. This is my current implementation,
and I _really_ would like to avoid unnecessary calls once the type is
resolved.

BGB wrote:

maybe a tradeoff is possible:
the function to get the offset is replaced by a function pointer and a stub
(avoiding many of the general problems involved with using
self-modifying-code).

For me there are no problems of self-modifying code (the LLVM jit
already does it)

there are, 'further' problems, than simply having a JIT generate the code...

the fist time it is called, the function pointer points to 'stub A', which
calls the function to lookup the slot offset,
this function then stores the value in a variable, and updates the function
pointer to point to 'stub B'.

'stub B', simply returns the value stored in the variable.

That's again what I want to avoid. This is my current implementation,
and I _really_ would like to avoid unnecessary calls once the type is
resolved.

ok, here is another option then:
the stub is set up to manually modify the code near the return address (may be arch specific and require spiffy use of assembler).

an example is this (assuming x86):
for the stub, as opposed to executing a 'ret', you have something like this in its place:
pop edx
mov byte [edx-5], 0xB8 ;mov eax, imm32
mov dword [edx-4], eax ;store return value in imm
jmp edx

this will replace the 'call' instruction generating the call with a mov, thus further runs will not invoke a call (note, be very sure this stub is only ever called with a 'call' instruction).

(in LLVM, though I am not certain as I don't know much of the internals, this could require either tweaking the codegen, or hand-crafted stub generation).

now, how you would go about implementing this is up to you.

BGB wrote:

ok, here is another option then:
the stub is set up to manually modify the code near the return address (may
be arch specific and require spiffy use of assembler).

That's not another option. That's what I want to do.

now, how you would go about implementing this is up to you.
  
And I need to know if I can currently do that with LLVM. Thus my initial
question :slight_smile:

for x86:
I presume LLVM supports inline assembler, and also uses the traditional frame pointer-based stack layout, but I don't know the details (people who know LLVM could probably be more helpful here).

ok, maybe here is an option:
after getting the value (in the created stub), execute a chunk of inline assembler to modify the caller (note: will probably need to modify inline asm syntax, as compilers tend to differ on this point...).

something like this (figure out sort of how it works, and maybe mutate into something usable):
int lookup_stub()
{
    int i;

    i=getElementOffsetOf("One", "a");

    asm {
        mov eax, i //return value
        mov edx, [ebp+4] //return addr
        mov byte [edx-5], 0xB8 //mov eax, imm32
        mov [edx-4], eax //store return value in imm
    }

    return(i);
}

BGB wrote:

for x86:
I presume LLVM supports inline assembler, and also uses the traditional
frame pointer-based stack layout, but I don't know the details (people who
know LLVM could probably be more helpful here).

ok, maybe here is an option:
after getting the value (in the created stub), execute a chunk of inline
assembler to modify the caller (note: will probably need to modify inline
asm syntax, as compilers tend to differ on this point...).

Thanks BGB, but at this point I can handle it ;-). The problem is not
how in theory (patch the caller), but how in the LLVM code generation
process. There are three steps for that:

1) How to represent the stub in the IR? An intrinsic? An external field
with a ghost linkage?
2) At which point in LLVM do we know we won't have any optimization, so
that the field operation is next to the stub call?
3) Make LLVM think the stub call is not a real call, so that
callee-saved registers do not get clobbered.

Thanks BGB, but at this point I can handle it ;-). The problem is not
how in theory (patch the caller), but how in the LLVM code generation
process. There are three steps for that:

1) How to represent the stub in the IR? An intrinsic? An external field
with a ghost linkage?

Probably an intrinsic. The trick is making it general purpose, not language specific.

2) At which point in LLVM do we know we won't have any optimization, so
that the field operation is next to the stub call?

You'll probably want to do this late in the code generator.

3) Make LLVM think the stub call is not a real call, so that
callee-saved registers do not get clobbered.

LLVM supports calling convetions like "coldcc". The idea of coldcc is that the function is compiled to not clobber callee-save registers. However, in practice, no current code generators do anything special with it, it's compiled like fastcc or ccc.

-Chris

Hi Chris,

Chris Lattner wrote:

  

Thanks BGB, but at this point I can handle it ;-). The problem is not
how in theory (patch the caller), but how in the LLVM code generation
process. There are three steps for that:

1) How to represent the stub in the IR? An intrinsic? An external field
with a ghost linkage?
    
Probably an intrinsic. The trick is making it general purpose, not
language specific.

I'm not sure actually how can I do that language specific ;). OK
currently only Java does that (from what I know), but intrinsics like
getElementDouble or setElementDouble do look general purpose. Unless,
I'm too Java-minded?

2) At which point in LLVM do we know we won't have any optimization, so
that the field operation is next to the stub call?
    
You'll probably want to do this late in the code generator.
  
Therefore when instructions are lowered to target-specific instructions?

3) Make LLVM think the stub call is not a real call, so that
callee-saved registers do not get clobbered.
    
LLVM supports calling convetions like "coldcc". The idea of coldcc is
that the function is compiled to not clobber callee-save registers.
However, in practice, no current code generators do anything special with
it, it's compiled like fastcc or ccc.

As long as there is room for it, and obviously coldcc is here for that,
it works fine for me :slight_smile:

Thanks,
Nicolas

I'm not sure actually how can I do that language specific ;). OK
currently only Java does that (from what I know), but intrinsics like
getElementDouble or setElementDouble do look general purpose. Unless,
I'm too Java-minded?

Focus on the mechanism behind what you want to do, not on what you want to do itself :). Think about a way to separate what you want to accomplish from how it gets done.

You'll probably want to do this late in the code generator.

Therefore when instructions are lowered to target-specific instructions?

yep,

-Chris