LLVM trunk generates different machine code for JCC instruction w/ or w/o debug info

Zhiwei_Chen · December 29, 2020, 2:25pm

Hi folks, it’s my first post in llvm-dev mailing list, and definitely not the last

Recently, I found an elf file built with or without debug info has different machine code generated. Sadly, it cannot be reproduced in a piece of code. Here is my investigation.

clang -S -emit-llvm foo.cc -O3 -ggdb3 -o dbg.ll
clang -S -emit-llvm foo.cc -O3 -o rel.ll

Where foo.cc is a cc file in my company of 10k+ LOC and depends on tons of 3rd libraries.

The difference between dbg.ll and rel.ll are the llvm debug intrinsics. Emmmm, looks fine.

llc dbg.ll -o dbg.s
llc rel.ll -o rel.s

And the asm instructions are the same. Emmm, fine again.

llvm-mc -filetype=obj dbg.s -o dbg.o
llvm-mc -filetype=obj rel.s -o rel.o

The 2 obj files generated by LLVM assembler has DIFFERENT machine codes.

74 19 je f20

The obj compiled with debug info use 0x74 to represent a JE instruction, while

0f 84 15 00 00 00 je f20

The obj compiled without debug info use 0x0f 0x84 instead.

What? Why the debug info affects the generation of machine code? As a LLVM beginner, I’m willing to dive deeper to find the root cause.

Thanks in advance.

dblaikie · December 29, 2020, 7:45pm

Yeah - we try to ensure that LLVM’s debug info doesn’t change what code is generated, but it’s best effort - no one’s done fuzzing/etc to make it especially robust.

If you want to investigate this I’d suggest using CReduce ( https://embed.cs.utah.edu/creduce/ ) to reduce the example to something small/manageable and then possibly report it here and/or investigate it yourself (LLVM/Clang support dumping the intermediate representation after every pass (-mllvm -dump-after-all/-print-after-all, something like that, I forget the precise spelling) and you could see where the IR or machine IR diverges between the debuginfo/not-debuginfo cases)

Neil_Nelson · December 29, 2020, 7:54pm

Bug 37728 - [meta] Make llvm passes debug info invariant
Further discussion on methods.

Neil Nelson

MaskRay · December 29, 2020, 8:09pm

Bug 37728 - [meta] Make llvm passes debug info invariant
37728 – [meta] Make llvm passes debug info invariant

Further discussion on methods.
https://groups.google.com/g/llvm-dev/c/yvbWr4azdh0/m/gy1tQIzIDwAJ

Neil Nelson

Thanks for the links:)

Hi folks, it’s my first post in llvm-dev mailing list, and definitely not the last

Recently, I found an elf file built with or without debug info has different machine code generated. Sadly, it cannot be reproduced in a piece of code. Here is my investigation.

clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -ggdb3 -o dbg.ll
clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -o rel.ll

Where foo.cc <http://foo.cc> is a cc file in my company of 10k+ LOC and depends on tons of 3rd libraries.

The difference between dbg.ll and rel.ll are the llvm debug intrinsics. Emmmm, looks fine.

llc dbg.ll -o dbg.s
llc rel.ll -o rel.s

And the asm instructions are the same. Emmm, fine again.

llvm-mc -filetype=obj dbg.s -o dbg.o
llvm-mc -filetype=obj rel.s -o rel.o

The 2 obj files generated by LLVM assembler has DIFFERENT machine codes.

74 19 je f20

The obj compiled with debug info use 0x74 to represent a JE instruction, while

0f 84 15 00 00 00 je f20

The obj compiled without debug info use 0x0f 0x84 instead.

What? Why the debug info affects the generation of machine code? As a LLVM beginner, I’m willing to dive deeper to find the root cause.

Thanks in advance.

llvm.dbg.* are intrinsics (subset of Instruction).

DbgInfoIntrinsic
   DbgLabelInst
   DbgVariableIntrinsic
     DbgValueInst: llvm.dbg.value
     DbgAddrIntrinsic: llvm.dbg.addr
     DbgDeclareInst: llvm.dbg.declare (similar to llvm.dbg.addr, but not control-dependent)

It is very easy to forget accounting for their existence in an optimization pass.

for (Instruction &I : BB) {
   if (isa<DbgInfoIntrinsic>(I))
     continue;
   ...
}

for (Instruction &I : instructions(F)) {
   if (isa<DbgInfoIntrinsic>(I))
     continue;
   ...
}

If an optimization pass does not skip llvm.dbg.* but makes their occurrences affect its heuristics (for example, counting the number of instructions in a basic block), the transformation result may be different with and w/o llvm.dbg.*.

GCC has -fcompare-debug and it seems that in the past they had fought diligently with the debug-affecting-codegen problems as well. (I am happy to take a stab at implementing it if others think it is mildly useful)

It is not clear how serious the problem in LLVM is. If for example, the llvm-project codebase can be fixed relatively easily, we probably could add a built bot to detect new problems.

Yes, reduce the source with some tools like creduce is important.

With the new pass manager (-fno-legacy-pass-manager, which will hopefully become the default in the next release),
you can dump changed IR with -print-changed, e.g.

clang -fno-legacy-pass-manager -mllvm -print-changed -S -O2 a.c 2> log

This is usually more readable than -print-after-all.

vedantk · January 4, 2021, 2:11am

Bug 37728 - [meta] Make llvm passes debug info invariant
https://bugs.llvm.org/show_bug.cgi?id=37728

Further discussion on methods.
https://groups.google.com/g/llvm-dev/c/yvbWr4azdh0/m/gy1tQIzIDwAJ

Neil Nelson

Thanks for the links:)

Hi folks, it’s my first post in llvm-dev mailing list, and definitely not the last

Recently, I found an elf file built with or without debug info has different machine code generated. Sadly, it cannot be reproduced in a piece of code. Here is my investigation.

clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -ggdb3 -o dbg.ll
clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -o rel.ll

Where foo.cc <http://foo.cc> is a cc file in my company of 10k+ LOC and depends on tons of 3rd libraries.

The difference between dbg.ll and rel.ll are the llvm debug intrinsics. Emmmm, looks fine.

llc dbg.ll -o dbg.s
llc rel.ll -o rel.s

And the asm instructions are the same. Emmm, fine again.

llvm-mc -filetype=obj dbg.s -o dbg.o
llvm-mc -filetype=obj rel.s -o rel.o

The 2 obj files generated by LLVM assembler has DIFFERENT machine codes.

74 19 je f20

The obj compiled with debug info use 0x74 to represent a JE instruction, while

0f 84 15 00 00 00 je f20

The obj compiled without debug info use 0x0f 0x84 instead.

What? Why the debug info affects the generation of machine code? As a LLVM beginner, I’m willing to dive deeper to find the root cause.

Thanks in advance.

llvm.dbg.* are intrinsics (subset of Instruction).

DbgInfoIntrinsic
DbgLabelInst
DbgVariableIntrinsic
DbgValueInst: llvm.dbg.value
DbgAddrIntrinsic: llvm.dbg.addr
DbgDeclareInst: llvm.dbg.declare (similar to llvm.dbg.addr, but not control-dependent)

It is very easy to forget accounting for their existence in an optimization pass.

for (Instruction &I : BB) {
if (isa(I))
continue;
…
}

for (Instruction &I : instructions(F)) {
if (isa(I))
continue;
…
}

If an optimization pass does not skip llvm.dbg.* but makes their occurrences affect its heuristics (for example, counting the number of instructions in a basic block), the transformation result may be different with and w/o llvm.dbg.*.

GCC has -fcompare-debug and it seems that in the past they had fought diligently with the debug-affecting-codegen problems as well. (I am happy to take a stab at implementing it if others think it is mildly useful)

It is not clear how serious the problem in LLVM is. If for example, the llvm-project codebase can be fixed relatively easily, we probably could add a built bot to detect new problems.

Thanks for diving into this. Fwiw, we already have some tooling for identifying and investigating debug-affecting-codegen issues [1][2][3]. I’m not familiar with gcc’s -fcompare-debug: while it could be better than what we’ve got, imho it makes sense to focus on addressing issues we already know about or can trivially detect. (To find lots more of these issues, simply build LNT [4] with the Os and Os-g profiles and diff the object files, or run [3] on the tests for your backend of choice.)

To elaborate on [3] a bit: there appears to be a long tail of codegen difference bugs lurking around in the various backends, but not many (if any? – it’s been a while since I looked) at the IR level. I believe one of the root causes for this is that IR-level use-def chains ignore llvm.dbg.* uses by default (thanks to the ValueAsMetadata abstraction), while MIR-level use-def chains include debug uses by default (see MachineRegisterInfo::use_*). It appears to be way too easy to write backend code that incorrectly assumes that debug uses are not there.

I went on a bit of a spree trying to fix some of those issues in the AArch64 backend, starting with [5]. For a brief moment it was possible to add debug info to all the tests in test/CodeGen/AArch64 and still have all of them pass. Alas, that’s no longer true. Adding a buildbot could help with this. It could also be valuable to change the MachineRegisterInfo default to ignore debug uses – that’s a larger change that would require a fair amount of community review and buy-in.

[1] Object file level diffing: https://github.com/vedantk/scripts/blob/master/objdiff_driver.sh
[2] IR-level debug-affecting-codegen detection: https://github.com/vedantk/scripts/blob/master/opt-check-dbg-invar.sh
[3] MIR-level debug-affecting-codegen detection: https://llvm.org/docs/HowToUpdateDebugInfo.html#mutation-testing-for-mir-level-transformations (e.g. llvm-lit test/CodeGen/AArch64 -Dllc="llc -debugify-and-strip-all-safe")
[4] https://github.com/llvm/llvm-test-suite
[5] https://reviews.llvm.org/rG5c04274dab4858180d756329d11499df247e9d2d

vedant

MaskRay · January 11, 2021, 11:37pm

Bug 37728 - [meta] Make llvm passes debug info invariant
37728 – [meta] Make llvm passes debug info invariant

Further discussion on methods.
https://groups.google.com/g/llvm-dev/c/yvbWr4azdh0/m/gy1tQIzIDwAJ

Neil Nelson

Thanks for the links:)

Hi folks, it’s my first post in llvm-dev mailing list, and definitely not the last

Recently, I found an elf file built with or without debug info has different machine code generated. Sadly, it cannot be reproduced in a piece of code. Here is my investigation.

clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -ggdb3 -o dbg.ll
clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -o rel.ll

Where foo.cc <http://foo.cc> is a cc file in my company of 10k+ LOC and depends on tons of 3rd libraries.

The difference between dbg.ll and rel.ll are the llvm debug intrinsics. Emmmm, looks fine.

llc dbg.ll -o dbg.s
llc rel.ll -o rel.s

And the asm instructions are the same. Emmm, fine again.

llvm-mc -filetype=obj dbg.s -o dbg.o
llvm-mc -filetype=obj rel.s -o rel.o

The 2 obj files generated by LLVM assembler has DIFFERENT machine codes.

74 19 je f20

The obj compiled with debug info use 0x74 to represent a JE instruction, while

0f 84 15 00 00 00 je f20

The obj compiled without debug info use 0x0f 0x84 instead.

What? Why the debug info affects the generation of machine code? As a LLVM beginner, I’m willing to dive deeper to find the root cause.

Thanks in advance.

llvm.dbg.* are intrinsics (subset of Instruction).

DbgInfoIntrinsic
DbgLabelInst
DbgVariableIntrinsic
   DbgValueInst: llvm.dbg.value
   DbgAddrIntrinsic: llvm.dbg.addr
   DbgDeclareInst: llvm.dbg.declare (similar to llvm.dbg.addr, but not control-dependent)

It is very easy to forget accounting for their existence in an optimization pass.

for (Instruction &I : BB) {
if (isa<DbgInfoIntrinsic>(I))
   continue;
...
}

for (Instruction &I : instructions(F)) {
if (isa<DbgInfoIntrinsic>(I))
   continue;
...
}

If an optimization pass does not skip llvm.dbg.* but makes their occurrences affect its heuristics (for example, counting the number of instructions in a basic block), the transformation result may be different with and w/o llvm.dbg.*.

GCC has -fcompare-debug and it seems that in the past they had fought diligently with the debug-affecting-codegen problems as well. (I am happy to take a stab at implementing it if others think it is mildly useful)

It is not clear how serious the problem in LLVM is. If for example, the llvm-project codebase can be fixed relatively easily, we probably could add a built bot to detect new problems.

Thanks for diving into this. Fwiw, we already have some tooling for identifying and investigating debug-affecting-codegen issues [1][2][3]. I'm not familiar with gcc's -fcompare-debug: while it could be better than what we've got, imho it makes sense to focus on addressing issues we already know about or can trivially detect. (To find lots more of these issues, simply build LNT [4] with the Os and Os-g profiles and diff the object files, or run [3] on the tests for your backend of choice.)

To elaborate on [3] a bit: there appears to be a long tail of codegen difference bugs lurking around in the various backends, but not many (if any? -- it's been a while since I looked) at the IR level. I believe one of the root causes for this is that IR-level use-def chains ignore llvm.dbg.* uses by default (thanks to the ValueAsMetadata abstraction), while MIR-level use-def chains _include_ debug uses by default (see MachineRegisterInfo::use_*). It appears to be way too easy to write backend code that incorrectly assumes that debug uses are not there.

I went on a bit of a spree trying to fix some of those issues in the AArch64 backend, starting with [5]. For a brief moment it was possible to add debug info to all the tests in test/CodeGen/AArch64 and still have all of them pass. Alas, that's no longer true. Adding a buildbot could help with this. It could also be valuable to change the MachineRegisterInfo default to ignore debug uses -- that's a larger change that would require a fair amount of community review and buy-in.

[1] Object file level diffing: https://github.com/vedantk/scripts/blob/master/objdiff_driver.sh
[2] IR-level debug-affecting-codegen detection: https://github.com/vedantk/scripts/blob/master/opt-check-dbg-invar.sh
[3] MIR-level debug-affecting-codegen detection: How to Update Debug Info: A Guide for LLVM Pass Authors — LLVM 18.0.0git documentation (e.g. `llvm-lit test/CodeGen/AArch64 -Dllc="llc -debugify-and-strip-all-safe"`)
[4] GitHub - llvm/llvm-test-suite
[5] rG5c04274dab48

vedant

Really appreciate the links:) I'll study them. A build bot will
definitely be helpful.

For Zhiwei's original problem (JCC + .p2align 4, 0x90) difference
(reported on 42138 – Different codegen with/without -g),
I have found the root cause: an assembler optimization implemented in
X86AsmBackend.
I have attached more information on ⚙ D75203 [X86] Relax existing instructions to reduce the number of nops needed for alignment purposes

pogo59 · January 12, 2021, 1:54pm

Please mention this in PR42138 as well.
Thanks,
--paulr

Topic		Replies	Views
[RFC] Instruction API changes needed to eliminate debug intrinsics from IR IR & Optimizations debuginfo	20	4198	March 5, 2024
BoF: Debug info for optimized code. LLVM Dev List Archives	8	311	November 11, 2016
Proposal: Debug information improvement - keep the line number with optimizations LLVM Dev List Archives	1	143	February 2, 2009
Reviving the DebugIR pass LLVM Dev List Archives	9	248	March 23, 2018
Optimization passes and debug info LLVM Dev List Archives	12	289	July 24, 2008

LLVM trunk generates different machine code for JCC instruction w/ or w/o debug info

Related topics