A weird, reproducable problem with MCJIT

I switched my Common Lisp compiler to use MCJIT on the weekend and ran
into a weird problem compiling one particular function.

It crashes with an EXC_BAD_ACCESS error in MCJIT::finalizeObject when
calling processFDE.

The weird part is that the function does not appear to do anything
special and I've whittled it down to
the minimum size that still causes the crash. If I remove even one
statement it compiles fine. Note: The function doesn't make much sense
anymore but it does compile fine.
It does have a lot of nested scopes.

I can single step through processFDE and I see it pulls up a Length in
processFDE of 1 and then a length of 16#1000000 - clearly something has
been corrupted.
   
Here is the top of the backtrace from lldb:

Hi Christian,

Thanks for sharing this.

Yaron Keren has been investigating some problems in the EH frame registration code recently, and I think this may be related. It at least sounds similar to the type of variations in behavior based on code size that Yaron was seeing.

-Andy

Hi,

I had similar problems with EH in ELF in RTDyldMemoryManager::registerEHFrames() calling __register_frame().

I’m not sure these problems are related to this problem since your crash happens in RuntimeDyldMachO::registerEHFrames() in its own processFDE (there are two functions named processFDE(), one in RuntimeDyldMachO.cpp and one in RTDyldMemoryManager.cpp) before RTDyldMemoryManager::registerEHFrames() and __register_frame() are called.

It would seem that even if RTDyldMemoryManager::registerEHFrames() and __register_frame() got problematic input (as with the ELF dyn. linker) it should not cause a crash in the calling code but either a malfunction of exceptions or crash in RTDyldMemoryManager::registerEHFrames() / __register_frame(). A crash like the one you see should be related to RuntimeDyldMachO::registerEHFrames() inputs only.

Yaron

Hi,

one possibility (discussed on IRC) is that zero-sized atoms are being created where BBs contain only a single 'unreachable'.
This situation (BBs with only unreachable) occurs in a number of places in Christian's code.

If this were the case (on OSX, with ld64) then ld64 would complain about it.
I'm not sure what happens if a similar situation is presented to MCJIT.

Also, now I see the debugging comment below - it seems that the size might be '1' - rather than 0 - so perhaps a red-herring.

cheers
Iain

Andrew,

Thanks for following up.
Some people on IRC suggested that perhaps my BasicBlocks that contain
only an "unreachable" IR instruction were being removed and two
BasicBlock labels were getting the same address. I inserted a "call
@unreachableError()" instruction before every "unreachable" instruction
that I generated and that caused the function below to compile fine but
another function now exhibits the same crash.

It seems very sensitive to the size of something within the function.

Another thing. When processFDE get's called I inserted printf
statements to print the lengths of the FDE's. Here's what it looks like
when everything compiles and finalizeObject doesn't crash:

"/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:81 In registerEHFrames P = 0x10975ec20
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 28/1c
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 68/44
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 44/2c
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 68/44
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 68/44
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 68/44
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 52/34
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 20/14
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 28/1c
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 28/1c
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 44/2c
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 36/24
"Loading bitcode file: /Users/meister/Development/cando/brcl/src/lisp/kernel/lsp/setf.bc
"/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:81 In registerEHFrames P = 0x231875000
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 28/1c
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 68/44
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 44/2c
... and so on

Here's what it looks like when it crashes:
"" In environment: COMMON-LISP:NIL
"/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:81 In registerEHFrames P = 0x27ddf71b8
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 1/1
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 16777216/1000000
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 17267682/1077be2
/Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:26 processFDE Length = 1515870810/5a5a5a5a
Process 85800 stopped
* thread #1: tid = 0x59640d, 0x0000000106627f2c libLLVM-3.4svn.dylib`llvm::processFDE(unsigned char*, long, long) + 44 at /Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:25, stop reason = EXC_BAD_ACCESS (code=1, address=0x2da414805)
    frame #0: 0x0000000106627f2c libLLVM-3.4svn.dylib`llvm::processFDE(unsigned char*, long, long) + 44 at /Users/meister/Development/cando/brcl/externals/src/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldMachO.cpp:25
   22 namespace llvm {
   23
   24 static unsigned char *processFDE(unsigned char *P, intptr_t DeltaForText, intptr_t DeltaForEH) {
-> 25 uint32_t Length = *((uint32_t*)P);
   26 printf("%s:%d processFDE Length = %u/%x\n", __FILE__, __LINE__, Length, Length);
   27 P += 4;
   28 unsigned char *Ret = P + Length;

Notice that the "Length" field of the very first FDE is 1 byte long
which is crazy - after that it it crashes.

I'm trying to track down where the FDE entries are created - I don't
know the code well at all.

I can run any other tests you suggest. This problem is reproducable.

Best,

.Chris.

"Kaylor, Andrew" <andrew.kaylor@intel.com> writes:

Yaron,

Did you find a way around the problem?

It looks like the problem comes before processFDE because by the time it
gets to processFDE the eh_frame data is already corrupted.

Does ELF and MachO share the same eh_frame format?

I am developing this code in parallel on an Ubuntu Linux system but I
haven't tried to run it on there for a couple of weeks. I'll bring it
up to date and try my test case on it and we'll see what happens.

Best,

.Chris.

Yaron Keren <yaron.keren@gmail.com> writes:

Hi,

There may be two problems with __register_frame usage. However based on

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-April/061768.html

I think the existing code is correct for OS-X but likely buggy for Linux
and Windows systems.

Your crash is on OS-X, right?

Anyhow, the first problem is very easy to fix. On Linux and Windows (at
least) __register_frame should be called once and not called on every FDE
as in processFDE in RTDyldMemoryManager,cpp does.

So RTDyldMemoryManager::registerEHFrames was modified to:

void RTDyldMemoryManager::registerEHFrames(uint8_t *Addr,
                                           uint64_t LoadAddr,
                                           size_t Size) {
  __register_frame(Addr);
}

On Windows 7 / MingW (gcc) this completely solved the problems I had with
erratic exception behaviour.

The second issue is a bit more complicated. With executable files, the
linker combines .eh frames with four zero bytes from crtend to marking
.eh_frame section end. As Rafael writes, this can't be done in codegen
since it's a linker function done when all .eh_frames are combined.

The dynamic linker must perform the same function, else
__register_frame(.eh_frame) might continue processing after .eh_frame,
depending if there were four zero bytes following it - or not - by chance.

However this again isn't likely to be your source of problem, as
__registerframe on OS-X processes one FDE at the time and the calling
function processFDE() in RTDyldMemoryManager.cpp does know the size of
eh_frame so it will not overrun the frame.

The solution would be to allocate a larger buffer, copy .eh_frame into it
with four zero bytes appended. This buffer needs to live as long as long
it's registered in the runtime library.

Yaron

With the help of iain@codesourcery.com and andrew.kaylor@intel.com we
tracked the problem down to a bad relocation that was clobbering the
first bytes of the eh_frame. I think this problem/solution may be OS X
specific.

On akaylor's suggestion I made the change below and my reproducable test
case now compiles fine with MCJIT.

As well, my Common Lisp code base now compiles using MCJIT - that's about 1,000
functions at one MCJIT module per function.

In llvm/lib/ExecutionEngine/RuntimeDyld

Index: RuntimeDyldImpl.h

Correct or no I don’t know, but this change will affect all x86-64 targets including Linux and Windows as getMaxStubSize() is called from the ELF linker as well as the Mach-O linker.

Yes, you are correct Yaron. Before we commit this we ought to put a check in to see what the target OS is. I just suggested the change below as a quick and easy way to verify that this was the cause of the problem. I’ll clean it up.

-Andy

If I spoke incorrectly about what systems this problem/change effects I
apologize. I'll leave it to Andrew to determine that.

Best,

.Chris.

"Kaylor, Andrew" <andrew.kaylor@intel.com> writes:

This issue should be fixed in r192737.

I moved the getMaxStubSize and getStubAlignment functions into the RuntimeDyldELF and RuntimeDyldMachO classes since the code that generates the stubs is specific to those classes and a difference between them is what led to this problem.

In the case of RuntimeDyldMachO I removed the stub size handling for architectures that are not currently handled in that class because I thought it would be best not to have artifacts that might seem to be correct (but possibly weren't) hanging around if someone later added support for one of these architectures.

-Andy