nested function's static link gets clobbered

Fellow developers,

I’m parallelizing loops to be called by pthread. The thread body that I pass to pthread_create looks like

define i8* @loop1({ i32*, i32* }* nest %parent_frame, i8* %arg) parent_frame is pointer to shared variables in original function

0x00007f0de11c41f0: mov (%r10),%rax
0x00007f0de11c41f3: cmpl $0x63,(%rax)
0x00007f0de11c41f6: jg 0x7f0de11c420c
0x00007f0de11c41fc: mov 0x8(%r10),%rax
0x00007f0de11c4200: incl (%rax)
0x00007f0de11c4202: mov (%r10),%rax
0x00007f0de11c4205: incl (%rax)
0x00007f0de11c4207: jmpq 0x7f0de11c41f0
0x00007f0de11c420c: xor %rax,%rax
0x00007f0de11c420f: retq

I use init_trampoline to generate code that sets up the static link:

0x00007fffee982316: mov $0x7f48e1a08fb0,%r11
0x00007fffee982320: mov $0x7fffee982330,%r10 the static link
0x00007fffee98232a: rex.WB jmpq *%r11

The program crashes in loop1 on the 2nd instruction. r10, which contained the static link was different from the value set by the trampoline.

Upon closer inspection, it looks like the trampoline first jumps to a stub that compiles loop1:

0x00007f48e1a08fb0: mov $0x5c61c0,%r10
0x00007f48e1a08fba: callq *%r10
0x00007f48e1a08fbd: int $0x0

But that clobbers r10 which loop1 needs. According to the x86-64 ABI, r10 isn’t preserved across functions, but here it needs to be. Is there anyway
to force LLVM to do that? I tried telling lli to compile the entire program (-no-lazy) so that the stub won’t be generated, but gives the error:

LLVM JIT requested to do lazy compilation of function ‘_Z41__static_initialization_and_destruction_0ii’ when lazy compiles are disabled!

Any ideas?

Note, I had to compile lli with -z execstack in order for trampolines on the stack to work.

Hmm, lli.cpp does this:
if (NoLazyCompilation)
    EE->DisableLazyCompilation();
[....]
// Run static constructors.
  EE->runStaticConstructorsDestructors(false);

  if (NoLazyCompilation) {
    for (Module::iterator I = Mod->begin(), E = Mod->end(); I != E; ++I) {
      Function *Fn = &*I;
      if (Fn != MainFn && !Fn->isDeclaration())
        EE->getPointerToFunction(Fn);
    }
  }

If you actually have static constructors and destructors then nolazy may
not work. You could try moving the runStatic... below the NoLazy block,
but it could be that compiling the functions themselves could need those
constructors to be run already.

The easiest way out seems to move the DisableLazyCompilation just after
you've run the static constructors.

Best regards,
--Edwin

Hi,

I'm parallelizing loops to be called by pthread. The thread body that I pass
to pthread_create looks like

define i8* @loop1({ i32*, i32* }* nest %parent_frame, i8* %arg)
parent_frame is pointer to shared variables in original function

0x00007f0de11c41f0: mov (%r10),%rax
0x00007f0de11c41f3: cmpl $0x63,(%rax)
0x00007f0de11c41f6: jg 0x7f0de11c420c
0x00007f0de11c41fc: mov 0x8(%r10),%rax
0x00007f0de11c4200: incl (%rax)
0x00007f0de11c4202: mov (%r10),%rax
0x00007f0de11c4205: incl (%rax)
0x00007f0de11c4207: jmpq 0x7f0de11c41f0
0x00007f0de11c420c: xor %rax,%rax
0x00007f0de11c420f: retq

I use init_trampoline to generate code that sets up the static link:

0x00007fffee982316: mov $0x7f48e1a08fb0,%r11
0x00007fffee982320: mov $0x7fffee982330,%r10 the static
link
0x00007fffee98232a: rex.WB jmpq *%r11

The program crashes in loop1 on the 2nd instruction. r10, which contained
the static link was different from the value set by the trampoline.

Upon closer inspection, it looks like the trampoline first jumps to a stub
that compiles loop1:

0x00007f48e1a08fb0: mov $0x5c61c0,%r10
0x00007f48e1a08fba: callq *%r10
0x00007f48e1a08fbd: int $0x0

But that clobbers r10 which loop1 needs. According to the x86-64 ABI, r10
isn't preserved across functions, but here it needs to be. Is there anyway
to force LLVM to do that?

you must be the first person to try using nest functions with the JIT :slight_smile:
If you look in X86JITInfo.cpp, in the function X86JITInfo::emitFunctionStub,
you will see the code generating the stub and using r10. I think the right
solution is to change r10 to a different call clobbered register. It would
also be possible to have the trampoline use a different register, but since
the x86-64 ABI explicitly states that r10 should be used for the static chain,
I'd rather not.

I'm also wondering about the x86-32 case. There are no comments in the
JIT stub code in this case, so I'm not sure which register it is using.
The problem with x86-32 is that there are so few registers, and for some
calling conventions there is only one spare call clobbered register
available. This is used by trampolines, so if it's also used by JIT,
which is almost surely the case, that will cause trouble. Even worse,
it looks like the JIT is wrong even without trampolines, because for
the C and X86_StdCall conventions it is ECX that is spare, while for
X86_FastCall and Fast it is EAX. Yet the JIT always uses the same
hardwired code, and does not adjust according to the calling convention.
So presumably it is broken for one of these sets of calling conventions.

Hopefully Anton can comment on this.

I tried telling lli to compile the entire program
(-no-lazy) so that the stub won't be generated, but gives the error:

LLVM JIT requested to do lazy compilation of function
'_Z41__static_initialization_and_destruction_0ii' when lazy compiles are
disabled!

Any ideas?

Note, I had to compile lli with -z execstack in order for trampolines on the
stack to work.

Maybe lli can be taught to mark itself as having an executable stack when
it sees a trampoline. I'm not sure how this can best be done. On linux
I guess it can be done using mmap.

Ciao,

Duncan.

I admit I got carried away with trying to use an extra static link when the arg parameter would’ve sufficed. Using a static link is probably still a better idea because if there are > 1 loop to parallelize in a function, they would share the same parent frame struct but might have a separate structs describing their parameters.

"you must be the first person to try using nest functions with the JIT :slight_smile: "

Well, this is a project in a dynamic optimization course. The JIT lacks a lot of things for this purpose like recompiling, patching old callers to refer to the new code, and deleting old machine code - currently, it just overwrites the old code with a branch to the new code and makes no attempt to patch the callers. We’ll probably come up with something more sophisticated and submit it.

“If you look in X86JITInfo.cpp, in the function X86JITInfo::emitFunctionStub,
you will see the code generating the stub and using r10”

I didn’t expect it to be that easy. I thought I needed to add special rules to the register allocator. I’ll take a look at it.