[AArch64] Generated assembly differs depending on whether debug information is generated or not

Hi,

we at Arm have noticed that assembly can differ when compiling for AArch64

depending on whether debug information is generated or not.

The issue is reproducible for the following small example compiled with -O1

for aarch64-arm-linux-gnu:

a() {

b(a);

for (;:wink:

c("", b);

}

The reason for the difference is that AArch64 frame lowering emits CFI

instructions if debug information is enabled but not if not. CFI instructions

act as scheduling boundaries during instruction scheduling and therefore lead to

differing scheduling regions and an overall different instruction scheduling.

We see several ways to fix the issue and would welcome comments on this:

  1. Enabling unwind tables by default for AArch64: By enabling unwind tables

by default CFI instructions will be inserted in both, debug and non-debug

mode. This should lead to smaller scheduling regions and probably to less

scheduling potential.

However, I’ve measured the average size of scheduling regions for randomly

generated programs with and without default unwind tables and found an

average difference of 0.5 to 1 instruction. Other architectures such as x86

do exactly this and therefore don’t face the issue.

The following patch on Phabricator introduces the said change:

https://reviews.llvm.org/D68076

  1. Postpone insertion of CFI instructions until after instruction scheduling.

This would require a new pass running after instruction scheduling that

inserts CFI instructions if needed. The downside I see is increased

compile-time and probably some code duplication with frame lowering.

  1. Change instruction scheduling such that CFI instructions get tied together

with relevant instructions in such a way that they get scheduled together.

If this could work it would probably the cleanest solution.

To summarize:

  1. would make scheduling in the non-debug case behave like in the

debug case and therefore probably cost some scheduling potential. However, it

would be by far the most easy to implement. 2. + 3. would probably lead to

better scheduling but seem to be more complex to implement.

Comments and additional ideas are welcome.

David

Hi David,

This is PR37240 (https://bugs.llvm.org/show_bug.cgi?id=37240). I suspect this problem affects all targets; your patch D68076 would address it only for AArch64. Although I would suggest you do some careful measurements to determine the runtime performance effect, to decide whether this is acceptable.

The more complete approach in your steps 2 + 3 would solve this for all targets, assuming the solution did not have to be very target-specific. This would benefit the entire community.

–paulr

Hi Paul,

thanks for your comments.

This is PR37240 (https://bugs.llvm.org/show_bug.cgi?id=37240). I suspect this problem affects all targets; your patch D68076 would address it only for AArch64. Although I would suggest you do some careful measurements to determine the runtime performance effect, to decide whether this is acceptable.

Yes, in principle the problem that instruction scheduling is dependent on the presence of cfi instruction should affect more targets than AArch64. However, this does not imply that all of these targets produce inconsistent assembly depending on debug information.

The more complete approach in your steps 2 + 3 would solve this for all targets, assuming the solution did not have to be very target-specific. This would benefit the entire community.

At least 2. would require a lot of target dependent changes because the insertion of cfi instructions would have to be moved from target specific frame lowering into an (probably again target specific) insertion pass.

David

Hi David,

Thanks for looking into this.

It seems like D68076 might not address the underlying issue here (e.g. it probably doesn’t improve the situation for projects using -g -fno-unwind-tables?).

Would you mind elaborating a bit on your proposals to delay/change CFI instruction insertion? In particular, it’d help to hear a bit about how CFI instructions are inserted today (is some of it done by CFIInstrInserter, and the rest by target-specific frame lowering code?).

best,
vedant

Hi Vedant,

thanks for your answer and sorry for the late response.

It seems like D68076 might not address the underlying issue here (e.g. it probably doesn’t improve the situation for projects using -g -fno-unwind-tables?).

Yes, D68075 is a somewhat conservative patch that aligns the behaviour on AArch64 (for GNU targets) that leads to consistent generated assembly. As you said it does not help if unwind tables are explicitly disabled (-fno-unwind-tables). It is a conservative patch since it decreases scheduling potential (due to smaller scheduling regions) for the non-debug case but fixes the bug of generating inconsistent assembly.

Would you mind elaborating a bit on your proposals to delay/change CFI instruction insertion? In particular, it’d help to hear a bit about how CFI instructions are inserted today (is some of it done by CFIInstrInserter, and the rest by target-specific frame lowering code?).

CFI instructions are inserted during target specific frame lowering, the CFIInstrInserter is only run on X86 targets and seems to verify the correctness of CFI instructions after the got inserted during X86 frame lowering.

Regarding the 3 possible roadmaps to solve the issue (see my first email) I currently think 3 (changing instruction scheduling such that CFI instructions are scheduled together with stack altering instructions) is the most promising one because it wouldn’t require targets specific changes. Since e.g. X86 or if I remember correctly also AArch64 on Darwin targets insert CFI instructions in both, debug and non-debug mode, solution 3 would increase scheduling potential for these targets.

To summarize: D68075 would align non-debug mode with debug mode and therefore potentially decrease scheduling potential. Solution 3 would align debug mode with non-debug mode (in terms of instruction scheduling) and therefore increase scheduling potential.

David

Hi Vedant,

thanks for your answer and sorry for the late response.

It seems like D68076 might not address the underlying issue here (e.g. it probably doesn’t improve the situation for projects using -g -fno-unwind-tables?).

Yes, D68075 is a somewhat conservative patch that aligns the behaviour on AArch64 (for GNU targets) that leads to consistent generated assembly. As you said it does not help if unwind tables are explicitly disabled (-fno-unwind-tables). It is a conservative patch since it decreases scheduling potential (due to smaller scheduling regions) for the non-debug case but fixes the bug of generating inconsistent assembly.

Would you mind elaborating a bit on your proposals to delay/change CFI instruction insertion? In particular, it’d help to hear a bit about how CFI instructions are inserted today (is some of it done by CFIInstrInserter, and the rest by target-specific frame lowering code?).

CFI instructions are inserted during target specific frame lowering, the CFIInstrInserter is only run on X86 targets and seems to verify the correctness of CFI instructions after the got inserted during X86 frame lowering.

Regarding the 3 possible roadmaps to solve the issue (see my first email) I currently think 3 (changing instruction scheduling such that CFI instructions are scheduled together with stack altering instructions) is the most promising one because it wouldn’t require targets specific changes. Since e.g. X86 or if I remember correctly also AArch64 on Darwin targets insert CFI instructions in both, debug and non-debug mode, solution 3 would increase scheduling potential for these targets.

To summarize: D68075 would align non-debug mode with debug mode and therefore potentially decrease scheduling potential. Solution 3 would align debug mode with non-debug mode (in terms of instruction scheduling) and therefore increase scheduling potential.

Thanks for breaking things down so clearly. My gut instinct would be to push for changes that make scheduling decisions the same modulo CFI instructions, but I really don’t know how much work that entails, or if it would pay for its own complexity. OTOH the “option 1” patch you have is an immediate fix.

CC’ing some folks who probably have more experience working with CFI/scheduling than me (+ Amara, Adam, Florian).

vedant

Hi David,
I indeed forgot to cc the list.

The last time I’ve checked the scheduling/tracking of debug values was done in a best-effort way by simply “remembering” all consecutive dbg instructions that followed some other instruction in ScheduleDAGInstrs::buildSchedGraph. This works for most cases but sometimes it can produce wrong debug info by rescheduling unrelated instructions (or not scheduling the related ones) since, IRC, it’s perfectly valid to have something like

R1 = …

DEBUG_VALUE R1, …

<some instruction that doesn’t touch R1>

DEBUG_VALUE R1,

And for this example if the first instruction is moved , the first dbg value would be moved as well (as it should) while second one will stay after the second instruction (which would produce wrong dbg info at that point).
If there was a way to properly associate each instruction with all affected dbg_values whether they are, it could solve this problem, although there might be other approaches as well.

Hi,

thanks for adding some more people to this thread.

I’ve just finished a first version of a patch that implements scheduling of CFI
instructions, currently controllable via a flag:

Enabling scheduling of CFI instructions by default will currently break some existing tests. I would like to get this patch accepted with scheduling of CFI instructions disabled by default. Tests that would currently fail can then be fixed in a follow-up patch and we could eventually enable CFI instruction scheduling by default (if this gets community consent). I’ve added Paul, Tim and Vedant as reviewers, if someone else is interested I would be really happy about some feedback. David

Hi all,

thinking a bit harder about the problem I now believe the problem is not fixable
by changing scheduling and my patch gets obsolete. Since a cfi instruction describes the stack with respect to all stack altering instructions that precede it, no such stack altering instruction is allowed to be scheduled below the cfi instruction. But the opposite is also true: No stack altering instruction is allowed to be move above a cfi instruction. Cfi instructions have to act as barriers for stack altering instructions. The current situation let’s them act as barriers for all instructions, my patch relaxed this behaviour too much. By letting cfi instructions act as barriers for stack altering instructions only, scheduling could be improved. However, the generated assembly would still be different to the one without cfi instructions (e.g. non-debug case). Next plan: Postpone insertion of cfi instructions after machine scheduling. This is option 3 in the original email and will probably require target specific code. I’ll keep you updated. David