Illegal instruction exception (perhaps due to a heap corruption)

I maintain a library (GitHub - Tekenlight/luaffifb: FFI package for Lua 5.1 and Lua 5.2) which is a fork from an archived library of Facebook (GitHub - facebookarchive/luaffifb: FFI package for Lua 5.1 and Lua 5.2)
The library exposes interface (Foreign Function Interface) to C run time for Lua programs, i.e. Using this, it is possible to invoke C functions such as printf, scant etc… from Lua files

In the fork I have extended this capability to ARM64 based Linux OS (ubuntu specifically) and that is working OK.

Currently I am working on porting this ARM64 extension to Apple M1 (ARM64 chip and OSX combination).

While doing this I have come across this issue (it looks like a defect in the library itself): when I run a test case, it sometimes results in "Illegal instruction” signal and the program terminates

The lldbg output indicates that the generated machine instructions are correct.

@c.lua 23
ffi.c:3023
ffi.c:3035 L[2] = add_i8
ffi.c:3037 L[2] = add_i8
ffi.c:3038 L[add_i8] = 7,0x100268478
@c.lua 25 function: 0x600001700ac0 function
Process 35339 stopped

  • thread #1, queue = ‘com.apple.main-thread’, stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0xd10103ff)
    frame #0: 0x0000000100268478
    → 0x100268478: sub sp, sp, #0x40
    0x10026847c: stp x29, x30, [sp, #0x30]
    0x100268480: stp x23, x24, [sp, #0x20]
    0x100268484: stp x21, x22, [sp, #0x10]
    Target 0: (lua) stopped.

(lldb) q

The address where the signal “illegal instruction” is generated has the correct code.

Any thoughts that could help me debug this will help

In another run the register values were captured as follows; (they all seem correct)

(lldb) register read
General Purpose Registers:
x0 = 0x0000000100808808
x1 = 0x0000000100304360
x2 = 0x0000000000000001
x3 = 0x0000000000000000
x4 = 0x0000000000000000
x5 = 0x0000000000000000
x6 = 0x000000000000000a
x7 = 0x0000000000000ec0
x8 = 0x0000000100268478
x9 = 0x0000600002100eb0
x10 = 0x0000000100025cb0 lua___lldb_unnamed_symbol29$$lua x11 = 0x0000000000000024 x12 = 0x0000000000000000 x13 = 0x0000000000000000 x14 = 0x0000000000000001 x15 = 0x0000000000000002 x16 = 0x0000000193de6358 libsystem_kernel.dylib__error
x17 = 0x00000001ee651858 (void *)0x0000000193de6358: __error
x18 = 0x0000000000000000
x19 = 0x0000000100118060
x20 = 0x00000001000022c8 luamain at lua.c:596 x21 = 0x00000001000c4070 dylddyld4::sConfigBuffer
x22 = 0x0000000000000000
x23 = 0x0000000000000000
x24 = 0x0000000000000000
x25 = 0x0000000000000000
x26 = 0x0000000000000000
x27 = 0x0000000000000000
x28 = 0x0000000000000000
fp = 0x000000016fdfda60
lr = 0x000000010000d508 lua`luaD_precall + 428 at ldo.c:449:9
sp = 0x000000016fdfd9e0
pc = 0x0000000100268478
cpsr = 0x20001000

So from the dump it seems to be crashing on the instruction sub sp, sp, #0x40, which shouldn’t be able to fault? From your description it sounds like you might be doing some JIT-like things, if so did you insert the required memory barriers after writing the instruction stream? See section K10.5.2 (barrier litmus tests) in the Architecture Reference Manual for the gory details.

If you were testing on a simpler Linux core, it might have been more tolerant of mistakes there (even more likely if you were emulating ARM-Linux on X86); the M1 is pretty aggressive in exploiting permitted optimizations in the memory model.

Other than that, are you aware that the MacOS ABI for varargs functions (like printf, scanf) is different from both normal functions and Linux?

Linux has a complex va_list struct that allows it to treat varargs calls exactly the same as normal ones. MacOS pushes any anonymous argument onto the stack (in 8-byte slots) so that va_list can just be a single pointer into the stack.

Not handling that difference could obviously lead to crashes, particularly in scanf though printf isn’t invulnerable. I wouldn’t quite expect the backtrace it looks like you’re getting though.

Thank you for the reply

This is indeed it is JIT module that provides capability to invoke C function from within lua (ffi)

From the literature, it seems there are different instructions for introducing code and data barriers
Thus atomic_thread_fence might not be sufficient to create a barrier for code (Not sure about this)

Are you aware of any library functions (like atomic_thread_fence) that can generate the barrier

Yes, an atomic_thread_fence will definitely not be enough. That’s only supposed to cover things in the scope of the C++11 memory model.

There’s nothing standardized, and in general it gets pretty complicated. For example this is how LLVM does it, just on Unix platforms.

At the core it seems that on Linux you can probably use __clear_cache, and on MacOS you want sys_icache_invalidate. Careful, they have different interfaces.

Thank you very much

This seems to have worked, tested the code by running it repeatedly for many times, it does not result in any error

Now will get into other parts of the ffi (JIT) features like multiple arguments of different sizes and types and variable arguments etc.

While reviewing this over once again, came across one thought.

In a multi-core scenario, while the functions (sys_icache_invalidate and __clear_cache) clear the cache on the core this set of commands run, what will happen to any cached instructions on other cores?

Should the programmer take any specific steps for that?

Yes, an example sequence is in the spec I linked to. The full invalidate isn’t needed, the second core has to wait for synchronization from the core that writes the instructions and calls sys_icache_invalidate to signal that’s done (normal C++11 primatives are fine for this), then execute an isb instruction before jumping to the code.

The best way to do an isb is probably to #include <arm_acle.h> and call __isb(). That should work across all platforms because ARM themselves define what’s in that header.

One question:

The new code segment should be visible to whichever core (among several of them). Thus it will be necessary to issue the isb instruction before the jump from the time there has been a change to the code segment, as we don’t know if the cache at the current core on which code is running is yet refreshed or not (Any thread-level synchronization mechanisms will be insufficient because same thread can run on another core after some time)

My question: Is there a way to know that the isb need not be issued anymore, in order to take benefit of cache?

Not intrinsically. I’d probably be thinking about some kind of generational system, where adding an FFI shim increments a global “number of FFIs”, then a caller checks whether it’s had an isb for the thing it’s about to call and issues one if not.

May or may not be worth it depending on how expensive an isb is (CPU dependent), how fast thread-local access is, and how fast the interpreter is.

Oh, actually that’s complete nonsense isn’t it. The issue is per-core, not per-thread, and there isn’t really any way to check what core you’re on (because it could change as soon as you’ve looked with bad scheduling luck).

Probably best to just put an unconditional isb in and hope the kernel does between task switches too.

Agree :slight_smile:

Unconditional isb is what I have done for now, ensured that once the JIT compilation for the new shim is through, the code issues an unconditional ISB, this will run on whichever core

Case 1: If it runs on the same core, it will clear the pipeline cache and things will work
Case 2: If this thread is run on another core, that core will do a fresh instruction fetch anyway

Thanks

I am referring to Apple Developer Documentation.

This page in the section " Update Code that Passes Arguments to Variadic Functions" talks about updating NSRN (Next SIMD and Floating-point Register Number) to 8 and not about integer registers (x0 to x7 or NGRN).

Can you please help me understand this better?

If there is any detailed literature it will help

Well, that was clearly rubbish. And I have the horrible suspicion I wrote it a few years back.

What I was trying to describe is that any anonymous argument (i.e. one that’s part of the ... or in an unprototyped function) gets assigned to the next stack slot with minimum size and alignment of 8-bytes (possibly more depending on the type, as with normal args). That is, they’re not allowed to be passed in x0-x7 or q0-q7 like normal.

Thanks a lot for the clarifications and the help, it has made a huge difference to the effort.