A problem with the arm64 unwind plans I'm looking at

Hi Tamas & Pavel, I thought you might have some ideas so I wanted to show a problem I'm looking at right now. The arm64 instruction unwinder forwards the unwind state based on branch instructions within the function. So if one block of code ends in an epilogue, the next instruction (which is presumably a branch target) will have the correct original unwind state. This change went in to UnwindAssemblyInstEmulation.cpp mid-2015 in r240533 - the code it replaced was poorly written, we're better off with this approach.

However I'm looking at a problem where clang will come up with a branch table for a bunch of case statements. e.g. this function:

    0x100007df0 <+0>: stp x22, x21, [sp, #-0x30]!
    0x100007df4 <+4>: stp x20, x19, [sp, #0x10]
    0x100007df8 <+8>: stp x29, x30, [sp, #0x20]
    0x100007dfc <+12>: add x29, sp, #0x20 ; =0x20
    0x100007e00 <+16>: sub sp, sp, #0x10 ; =0x10
    0x100007e04 <+20>: mov x19, x1
    0x100007e08 <+24>: mov x20, x0
    0x100007e0c <+28>: add w21, w20, w20, lsl #2
    0x100007e10 <+32>: bl 0x100007f58 ; symbol stub for: getpid
    0x100007e14 <+36>: add w0, w0, w21
    0x100007e18 <+40>: mov w8, w20
    0x100007e1c <+44>: cmp w20, #0x1d ; =0x1d
    0x100007e20 <+48>: b.hi 0x100007e4c ; <+92> at a.c:112
    0x100007e24 <+52>: adr x9, #0x90 ; switcher + 196
    0x100007e28 <+56>: nop
    0x100007e2c <+60>: ldrsw x8, [x9, x8, lsl #2]
    0x100007e30 <+64>: add x8, x8, x9
    0x100007e34 <+68>: br x8
    0x100007e38 <+72>: sub sp, x29, #0x20 ; =0x20
    0x100007e3c <+76>: ldp x29, x30, [sp, #0x20]
    0x100007e40 <+80>: ldp x20, x19, [sp, #0x10]
    0x100007e44 <+84>: ldp x22, x21, [sp], #0x30
    0x100007e48 <+88>: ret
    0x100007e4c <+92>: add w0, w0, #0x1 ; =0x1
    0x100007e50 <+96>: b 0x100007e38 ; <+72> at a.c:115
    0x100007e54 <+100>: orr w8, wzr, #0x7
    0x100007e58 <+104>: str x8, [sp, #0x8]
    0x100007e5c <+108>: sxtw x8, w19
    0x100007e60 <+112>: str x8, [sp]
    0x100007e64 <+116>: adr x0, #0x148 ; "%c %d\n"
    0x100007e68 <+120>: nop
    0x100007e6c <+124>: bl 0x100007f64 ; symbol stub for: printf
    0x100007e70 <+128>: sub sp, x29, #0x20 ; =0x20
    0x100007e74 <+132>: ldp x29, x30, [sp, #0x20]
    0x100007e78 <+136>: ldp x20, x19, [sp, #0x10]
    0x100007e7c <+140>: ldp x22, x21, [sp], #0x30
    0x100007e80 <+144>: b 0x100007f38 ; f3 at b.c:4
    0x100007e84 <+148>: sxtw x8, w19
    0x100007e88 <+152>: str x8, [sp]
    0x100007e8c <+156>: adr x0, #0x127 ; "%c\n"
    0x100007e90 <+160>: nop
    0x100007e94 <+164>: bl 0x100007f64 ; symbol stub for: printf
    0x100007e98 <+168>: bl 0x100007f40 ; f4 at b.c:7
    0x100007e9c <+172>: sxtw x8, w19
    0x100007ea0 <+176>: str x8, [sp]
    0x100007ea4 <+180>: adr x0, #0x10f ; "%c\n"
    0x100007ea8 <+184>: nop
    0x100007eac <+188>: bl 0x100007f64 ; symbol stub for: printf
    0x100007eb0 <+192>: bl 0x100007f4c ; symbol stub for: abort

It loads data from the jump table and branches to the correct block in the +52 .. +68 instructions. We have epilogues at 88, 144, and 192. And we get an unwind plan like

row[0]: 0: CFA=sp +0 =>
row[1]: 4: CFA=sp+48 => x21=[CFA-40] x22=[CFA-48]
row[2]: 8: CFA=sp+48 => x19=[CFA-24] x20=[CFA-32] x21=[CFA-40] x22=[CFA-48]
row[3]: 12: CFA=sp+48 => x19=[CFA-24] x20=[CFA-32] x21=[CFA-40] x22=[CFA-48] fp=[CFA-16] lr=[CFA-8]
row[4]: 20: CFA=sp+64 => x19=[CFA-24] x20=[CFA-32] x21=[CFA-40] x22=[CFA-48] fp=[CFA-16] lr=[CFA-8]
row[5]: 80: CFA=sp+64 => x19=[CFA-24] x20=[CFA-32] x21=[CFA-40] x22=[CFA-48] fp= <same> lr= <same>
row[6]: 84: CFA=sp+64 => x19= <same> x20= <same> x21=[CFA-40] x22=[CFA-48] fp= <same> lr= <same>
row[7]: 88: CFA=sp +0 => x19= <same> x20= <same> x21= <same> x22= <same> fp= <same> lr= <same>
row[8]: 92: CFA=sp+64 => x19=[CFA-24] x20=[CFA-32] x21=[CFA-40] x22=[CFA-48] fp=[CFA-16] lr=[CFA-8]
row[9]: 108: CFA=sp+64 => x8=[CFA-56] x19=[CFA-24] x20=[CFA-32] x21=[CFA-40] x22=[CFA-48] fp=[CFA-16] lr=[CFA-8]
row[10]: 136: CFA=sp+64 => x8=[CFA-56] x19=[CFA-24] x20=[CFA-32] x21=[CFA-40] x22=[CFA-48] fp= <same> lr= <same>
row[11]: 140: CFA=sp+64 => x8=[CFA-56] x19= <same> x20= <same> x21=[CFA-40] x22=[CFA-48] fp= <same> lr= <same>
row[12]: 144: CFA=sp +0 => x8=[CFA-56] x19= <same> x20= <same> x21= <same> x22= <same> fp= <same> lr= <same>

where we have no unwind state for the range 148..192 (I complicated it a little by calling a noreturn function that ended up being the last one -- that's why it doesn't do an epilogue sequence at the very end of the function).

I'm not sure how we should address this one - our branch-target approach can't do the right thing here, there is no indication (for lldb) of the branch from instruction +68 to +148. Should we recognize "ret" and "b" (with a range outside the bounds of the current function) as epilogues and try to find something that looks like valid unwind state from earlier in the function body?

What actually happens right now (if lldb is stopped in the range 148 - 192) is that we find a LR with a save status of IsSame which RegisterContextLLDB rejects as impossible (yes, this IS actually possible - if we're stopped in the range of +80 - +88 then the return address is in the lr and the ret instruction will use it when we get there - but let's ignore that separate bug for now) is that we reject the unwind plan because of the lr being IsSame and we fall back to the ABI's architectural default unwind plan which does work in this case, although we won't know about the x19 & x20 register spills to the stack.

This is a synthetic example, of course, the real place where I hit this issue is in some apple internal code, but it's the same instruction pattern.

fwiw I reproduced it with these two source files linked together with clang and -Os optimization; gcc is unlikely to generate the same code. It's probably possible to reduce the test case further but I'll add a unit test of the raw instructions once we figure out how to handle this.

a.c (1.45 KB)

b.c (100 Bytes)

Hi Jason,

I thought about this situation when implemented the original branch following code and haven’t been able to come up with a really good solution.

My only idea is the same what you mentioned. We should try to recognize all unconditional branches and returns (but not calls) and then if the following instruction don’t have any unwind information yet (e.g. haven’t been a branch target so far) then we try to find some reasonable unwind info from the previous lines.

The difficult question is how to find the correct information. One possible heuristic I have in mind is to try to find any call instruction inside the function before the current PC and use the unwind info from there. The reason I like this heuristic because there won’t be a call instruction inside the prologue or epilogue and on ARM based on the ABI every call instruction have to have the same unwind info. Other possible alternative (or if we don’t have a call instruction) is to use the unwind info line with the information about the highest number of registers. If multiple lines have the same number of information then either use the earliest one or the one with the fewest registers being set to IsSame to avoid picking something from an epilogue.

I don’t think any of my suggestions are really good but I don’t have any better idea at the moment.

Tamas

Yeah I was thinking that maybe if we spot an epilogue instruction (ret, b <some-other-function>), and the next instruction doesn't have a reinstated register context, we could backtrack to the initial register context of this block of instructions (and if it's not the beginning of the function), re-instate that register context for the next instruction.

It doesn't help if we have a dynamic dispatch after the initial part of the function. For that, we'd need to do something like your suggestion of finding the biggest collection of register saves.

e.g. if I rearrange/modify my example function a little to make it more interesting (I didn't fix up the +offsets)

prologue:

    0x100007df0 <+0>: stp x22, x21, [sp, #-0x30]!
    0x100007df4 <+4>: stp x20, x19, [sp, #0x10]
    0x100007df8 <+8>: stp x29, x30, [sp, #0x20]
    0x100007dfc <+12>: add x29, sp, #0x20 ; =0x20

direct branch:

    0x100007e1c <+44>: cmp w20, #0x1d ; =0x1d
    0x100007e20 <+48>: b.hi 0x100007e4c ; <+92> { block #3 }

dynamic dispatch:

    0x100007e24 <+52>: adr x9, #0x90 ; switcher + 196
    0x100007e28 <+56>: nop
    0x100007e2c <+60>: ldrsw x8, [x9, x8, lsl #2]
    0x100007e30 <+64>: add x8, x8, x9
    0x100007e34 <+68>: br x8

block #1

    0x100007e9c <+172>: sxtw x8, w19
    0x100007ea0 <+176>: str x8, [sp]
    0x100007ea4 <+180>: adr x0, #0x10f ; "%c\n"
    0x100007ea8 <+184>: nop
    0x100007eac <+188>: bl 0x100007f64 ; symbol stub for: printf
    0x100007e70 <+128>: sub sp, x29, #0x20 ; =0x20
    0x100007e74 <+132>: ldp x29, x30, [sp, #0x20]
    0x100007e78 <+136>: ldp x20, x19, [sp, #0x10]
    0x100007e7c <+140>: ldp x22, x21, [sp], #0x30
    0x100007eb0 <+192>: b 0x100007f4c ; symbol stub for: abort

block #2

    0x100007e38 <+72>: sub sp, x29, #0x20 ; =0x20
    0x100007e3c <+76>: ldp x29, x30, [sp, #0x20]
    0x100007e40 <+80>: ldp x20, x19, [sp, #0x10]
    0x100007e44 <+84>: ldp x22, x21, [sp], #0x30
    0x100007e48 <+88>: ret

block #3

    0x100007e4c <+92>: add w0, w0, #0x1 ; =0x1
    0x100007e50 <+96>: b 0x100007e38 ; <+72> at a.c:115
    0x100007e54 <+100>: orr w8, wzr, #0x7
    0x100007e58 <+104>: str x8, [sp, #0x8]
    0x100007e5c <+108>: sxtw x8, w19
    0x100007e60 <+112>: str x8, [sp]
    0x100007e64 <+116>: adr x0, #0x148 ; "%c %d\n"
    0x100007e68 <+120>: nop
    0x100007e6c <+124>: bl 0x100007f64 ; symbol stub for: printf
    0x100007e70 <+128>: sub sp, x29, #0x20 ; =0x20
    0x100007e74 <+132>: ldp x29, x30, [sp, #0x20]
    0x100007e78 <+136>: ldp x20, x19, [sp, #0x10]
    0x100007e7c <+140>: ldp x22, x21, [sp], #0x30
    0x100007e80 <+144>: b 0x100007f38 ; f3 at b.c:4

block #4

    0x100007e38 <+72>: sub sp, x29, #0x20 ; =0x20
    0x100007e3c <+76>: ldp x29, x30, [sp, #0x20]
    0x100007e40 <+80>: ldp x20, x19, [sp, #0x10]
    0x100007e44 <+84>: ldp x22, x21, [sp], #0x30
    0x100007e48 <+88>: ret

First, an easy one: When we get to the first instruction of 'block #4', we've seen a complete epilogue ending in 'B other-function' and the first instruction of block #4 is not branched to. If we find the previous direct branch target -- to the first instruction of 'block #3' was conditionally branched to, we reuse that register context for block #4. This could easily go wrong for hand-written assembly where you might undo the stack state part-way and then branch to another part of the function. But I doubt compiler generated code is ever going to do that.

Second, a trickier one: When we get to the first instruction of 'block #2', we have no previous branch target register context to re-instate. We could look further into the function (to block target #3 again) and reuse that register state, the assumption being that a function has one prologue that sets up the complete register state and then doesn't change anything outside mid-function epilogues. I'm not opposed to that idea. The other way would be to look backwards in the instruction stream for the row with the most registers saved, as you suggested, maybe reusing the earliest one if there are multiple entries with the same # of registers (this would need to ignore IsSame registers).

Let me see if I can code something along these lines and we can look at how that turns out.

Based on your comments I have one more idea for a good heuristic. What if we detect a dynamic branch (e.g. “br ”, “tbb …”, etc…) and store the register state for that place. Then when we find a block with no unwind info for the first instruction then we use the one we saved for the dynamic branch (as we know that the only way that block can be reached is through a dynamic branch). If there is exactly 1 dynamic branch in the code then this should gave us the “perfect” result while if we have multiple dynamic branches then we will pick one “randomly” but for compiler generated code I think it will be good enough. The only tricky case is if we fail to detect the dynamic branch but that should be easy to fix as we already track every branch on ARM (for single stepping) and doing it on AArch64 should be easy as well.

I like that idea. A bunch of other work just landed on my desk so it might be a bit before I do it, but I'll see how that patch looks.