LLVM trunk generates different machine code for JCC instruction w/ or w/o debug info

Hi folks, it’s my first post in llvm-dev mailing list, and definitely not the last :slight_smile:

Recently, I found an elf file built with or without debug info has different machine code generated. Sadly, it cannot be reproduced in a piece of code. Here is my investigation.

clang -S -emit-llvm foo.cc -O3 -ggdb3 -o dbg.ll
clang -S -emit-llvm foo.cc -O3 -o rel.ll

Where foo.cc is a cc file in my company of 10k+ LOC and depends on tons of 3rd libraries.

The difference between dbg.ll and rel.ll are the llvm debug intrinsics. Emmmm, looks fine.

llc dbg.ll -o dbg.s
llc rel.ll -o rel.s

And the asm instructions are the same. Emmm, fine again.

llvm-mc -filetype=obj dbg.s -o dbg.o
llvm-mc -filetype=obj rel.s -o rel.o

The 2 obj files generated by LLVM assembler has DIFFERENT machine codes.

74 19 je f20

The obj compiled with debug info use 0x74 to represent a JE instruction, while

0f 84 15 00 00 00 je f20

The obj compiled without debug info use 0x0f 0x84 instead.

What? Why the debug info affects the generation of machine code? As a LLVM beginner, I’m willing to dive deeper to find the root cause.

Thanks in advance.

Yeah - we try to ensure that LLVM’s debug info doesn’t change what code is generated, but it’s best effort - no one’s done fuzzing/etc to make it especially robust.

If you want to investigate this I’d suggest using CReduce ( https://embed.cs.utah.edu/creduce/ ) to reduce the example to something small/manageable and then possibly report it here and/or investigate it yourself (LLVM/Clang support dumping the intermediate representation after every pass (-mllvm -dump-after-all/-print-after-all, something like that, I forget the precise spelling) and you could see where the IR or machine IR diverges between the debuginfo/not-debuginfo cases)

Bug 37728 - [meta] Make llvm passes debug info invariant
Further discussion on methods.

Neil Nelson

Bug 37728 - [meta] Make llvm passes debug info invariant
37728 – [meta] Make llvm passes debug info invariant

Further discussion on methods.
https://groups.google.com/g/llvm-dev/c/yvbWr4azdh0/m/gy1tQIzIDwAJ

Neil Nelson

Thanks for the links:)

Hi folks, it’s my first post in llvm-dev mailing list, and definitely not the last :slight_smile:

Recently, I found an elf file built with or without debug info has different machine code generated. Sadly, it cannot be reproduced in a piece of code. Here is my investigation.

clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -ggdb3 -o dbg.ll
clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -o rel.ll

Where foo.cc <http://foo.cc> is a cc file in my company of 10k+ LOC and depends on tons of 3rd libraries.

The difference between dbg.ll and rel.ll are the llvm debug intrinsics. Emmmm, looks fine.

llc dbg.ll -o dbg.s
llc rel.ll -o rel.s

And the asm instructions are the same. Emmm, fine again.

llvm-mc -filetype=obj dbg.s -o dbg.o
llvm-mc -filetype=obj rel.s -o rel.o

The 2 obj files generated by LLVM assembler has DIFFERENT machine codes.

74 19 je f20

The obj compiled with debug info use 0x74 to represent a JE instruction, while

0f 84 15 00 00 00 je f20

The obj compiled without debug info use 0x0f 0x84 instead.

What? Why the debug info affects the generation of machine code? As a LLVM beginner, I’m willing to dive deeper to find the root cause.

Thanks in advance.

llvm.dbg.* are intrinsics (subset of Instruction).

DbgInfoIntrinsic
   DbgLabelInst
   DbgVariableIntrinsic
     DbgValueInst: llvm.dbg.value
     DbgAddrIntrinsic: llvm.dbg.addr
     DbgDeclareInst: llvm.dbg.declare (similar to llvm.dbg.addr, but not control-dependent)

It is very easy to forget accounting for their existence in an optimization pass.

for (Instruction &I : BB) {
   if (isa<DbgInfoIntrinsic>(I))
     continue;
   ...
}

for (Instruction &I : instructions(F)) {
   if (isa<DbgInfoIntrinsic>(I))
     continue;
   ...
}

If an optimization pass does not skip llvm.dbg.* but makes their occurrences affect its heuristics (for example, counting the number of instructions in a basic block), the transformation result may be different with and w/o llvm.dbg.*.

GCC has -fcompare-debug and it seems that in the past they had fought diligently with the debug-affecting-codegen problems as well. (I am happy to take a stab at implementing it if others think it is mildly useful)

It is not clear how serious the problem in LLVM is. If for example, the llvm-project codebase can be fixed relatively easily, we probably could add a built bot to detect new problems.

Yes, reduce the source with some tools like creduce is important.

With the new pass manager (-fno-legacy-pass-manager, which will hopefully become the default in the next release),
you can dump changed IR with -print-changed, e.g.

   clang -fno-legacy-pass-manager -mllvm -print-changed -S -O2 a.c 2> log

This is usually more readable than -print-after-all.

Bug 37728 - [meta] Make llvm passes debug info invariant
https://bugs.llvm.org/show_bug.cgi?id=37728

Further discussion on methods.
https://groups.google.com/g/llvm-dev/c/yvbWr4azdh0/m/gy1tQIzIDwAJ

Neil Nelson

Thanks for the links:)

Hi folks, it’s my first post in llvm-dev mailing list, and definitely not the last :slight_smile:

Recently, I found an elf file built with or without debug info has different machine code generated. Sadly, it cannot be reproduced in a piece of code. Here is my investigation.

clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -ggdb3 -o dbg.ll
clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -o rel.ll

Where foo.cc <http://foo.cc> is a cc file in my company of 10k+ LOC and depends on tons of 3rd libraries.

The difference between dbg.ll and rel.ll are the llvm debug intrinsics. Emmmm, looks fine.

llc dbg.ll -o dbg.s
llc rel.ll -o rel.s

And the asm instructions are the same. Emmm, fine again.

llvm-mc -filetype=obj dbg.s -o dbg.o
llvm-mc -filetype=obj rel.s -o rel.o

The 2 obj files generated by LLVM assembler has DIFFERENT machine codes.

74 19 je f20

The obj compiled with debug info use 0x74 to represent a JE instruction, while

0f 84 15 00 00 00 je f20

The obj compiled without debug info use 0x0f 0x84 instead.

What? Why the debug info affects the generation of machine code? As a LLVM beginner, I’m willing to dive deeper to find the root cause.

Thanks in advance.

llvm.dbg.* are intrinsics (subset of Instruction).

DbgInfoIntrinsic
DbgLabelInst
DbgVariableIntrinsic
DbgValueInst: llvm.dbg.value
DbgAddrIntrinsic: llvm.dbg.addr
DbgDeclareInst: llvm.dbg.declare (similar to llvm.dbg.addr, but not control-dependent)

It is very easy to forget accounting for their existence in an optimization pass.

for (Instruction &I : BB) {
if (isa(I))
continue;

}

for (Instruction &I : instructions(F)) {
if (isa(I))
continue;

}

If an optimization pass does not skip llvm.dbg.* but makes their occurrences affect its heuristics (for example, counting the number of instructions in a basic block), the transformation result may be different with and w/o llvm.dbg.*.

GCC has -fcompare-debug and it seems that in the past they had fought diligently with the debug-affecting-codegen problems as well. (I am happy to take a stab at implementing it if others think it is mildly useful)

It is not clear how serious the problem in LLVM is. If for example, the llvm-project codebase can be fixed relatively easily, we probably could add a built bot to detect new problems.

Thanks for diving into this. Fwiw, we already have some tooling for identifying and investigating debug-affecting-codegen issues [1][2][3]. I’m not familiar with gcc’s -fcompare-debug: while it could be better than what we’ve got, imho it makes sense to focus on addressing issues we already know about or can trivially detect. (To find lots more of these issues, simply build LNT [4] with the Os and Os-g profiles and diff the object files, or run [3] on the tests for your backend of choice.)

To elaborate on [3] a bit: there appears to be a long tail of codegen difference bugs lurking around in the various backends, but not many (if any? – it’s been a while since I looked) at the IR level. I believe one of the root causes for this is that IR-level use-def chains ignore llvm.dbg.* uses by default (thanks to the ValueAsMetadata abstraction), while MIR-level use-def chains include debug uses by default (see MachineRegisterInfo::use_*). It appears to be way too easy to write backend code that incorrectly assumes that debug uses are not there.

I went on a bit of a spree trying to fix some of those issues in the AArch64 backend, starting with [5]. For a brief moment it was possible to add debug info to all the tests in test/CodeGen/AArch64 and still have all of them pass. Alas, that’s no longer true. Adding a buildbot could help with this. It could also be valuable to change the MachineRegisterInfo default to ignore debug uses – that’s a larger change that would require a fair amount of community review and buy-in.

[1] Object file level diffing: https://github.com/vedantk/scripts/blob/master/objdiff_driver.sh
[2] IR-level debug-affecting-codegen detection: https://github.com/vedantk/scripts/blob/master/opt-check-dbg-invar.sh
[3] MIR-level debug-affecting-codegen detection: https://llvm.org/docs/HowToUpdateDebugInfo.html#mutation-testing-for-mir-level-transformations (e.g. llvm-lit test/CodeGen/AArch64 -Dllc="llc -debugify-and-strip-all-safe")
[4] https://github.com/llvm/llvm-test-suite
[5] https://reviews.llvm.org/rG5c04274dab4858180d756329d11499df247e9d2d

vedant

Bug 37728 - [meta] Make llvm passes debug info invariant
37728 – [meta] Make llvm passes debug info invariant

Further discussion on methods.
https://groups.google.com/g/llvm-dev/c/yvbWr4azdh0/m/gy1tQIzIDwAJ

Neil Nelson

Thanks for the links:)

Hi folks, it’s my first post in llvm-dev mailing list, and definitely not the last :slight_smile:

Recently, I found an elf file built with or without debug info has different machine code generated. Sadly, it cannot be reproduced in a piece of code. Here is my investigation.

clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -ggdb3 -o dbg.ll
clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -o rel.ll

Where foo.cc <http://foo.cc> is a cc file in my company of 10k+ LOC and depends on tons of 3rd libraries.

The difference between dbg.ll and rel.ll are the llvm debug intrinsics. Emmmm, looks fine.

llc dbg.ll -o dbg.s
llc rel.ll -o rel.s

And the asm instructions are the same. Emmm, fine again.

llvm-mc -filetype=obj dbg.s -o dbg.o
llvm-mc -filetype=obj rel.s -o rel.o

The 2 obj files generated by LLVM assembler has DIFFERENT machine codes.

74 19 je f20

The obj compiled with debug info use 0x74 to represent a JE instruction, while

0f 84 15 00 00 00 je f20

The obj compiled without debug info use 0x0f 0x84 instead.

What? Why the debug info affects the generation of machine code? As a LLVM beginner, I’m willing to dive deeper to find the root cause.

Thanks in advance.

llvm.dbg.* are intrinsics (subset of Instruction).

DbgInfoIntrinsic
DbgLabelInst
DbgVariableIntrinsic
   DbgValueInst: llvm.dbg.value
   DbgAddrIntrinsic: llvm.dbg.addr
   DbgDeclareInst: llvm.dbg.declare (similar to llvm.dbg.addr, but not control-dependent)

It is very easy to forget accounting for their existence in an optimization pass.

for (Instruction &I : BB) {
if (isa<DbgInfoIntrinsic>(I))
   continue;
...
}

for (Instruction &I : instructions(F)) {
if (isa<DbgInfoIntrinsic>(I))
   continue;
...
}

If an optimization pass does not skip llvm.dbg.* but makes their occurrences affect its heuristics (for example, counting the number of instructions in a basic block), the transformation result may be different with and w/o llvm.dbg.*.

GCC has -fcompare-debug and it seems that in the past they had fought diligently with the debug-affecting-codegen problems as well. (I am happy to take a stab at implementing it if others think it is mildly useful)

It is not clear how serious the problem in LLVM is. If for example, the llvm-project codebase can be fixed relatively easily, we probably could add a built bot to detect new problems.

Thanks for diving into this. Fwiw, we already have some tooling for identifying and investigating debug-affecting-codegen issues [1][2][3]. I'm not familiar with gcc's -fcompare-debug: while it could be better than what we've got, imho it makes sense to focus on addressing issues we already know about or can trivially detect. (To find lots more of these issues, simply build LNT [4] with the Os and Os-g profiles and diff the object files, or run [3] on the tests for your backend of choice.)

To elaborate on [3] a bit: there appears to be a long tail of codegen difference bugs lurking around in the various backends, but not many (if any? -- it's been a while since I looked) at the IR level. I believe one of the root causes for this is that IR-level use-def chains ignore llvm.dbg.* uses by default (thanks to the ValueAsMetadata abstraction), while MIR-level use-def chains _include_ debug uses by default (see MachineRegisterInfo::use_*). It appears to be way too easy to write backend code that incorrectly assumes that debug uses are not there.

I went on a bit of a spree trying to fix some of those issues in the AArch64 backend, starting with [5]. For a brief moment it was possible to add debug info to all the tests in test/CodeGen/AArch64 and still have all of them pass. Alas, that's no longer true. Adding a buildbot could help with this. It could also be valuable to change the MachineRegisterInfo default to ignore debug uses -- that's a larger change that would require a fair amount of community review and buy-in.

[1] Object file level diffing: https://github.com/vedantk/scripts/blob/master/objdiff_driver.sh
[2] IR-level debug-affecting-codegen detection: https://github.com/vedantk/scripts/blob/master/opt-check-dbg-invar.sh
[3] MIR-level debug-affecting-codegen detection: How to Update Debug Info: A Guide for LLVM Pass Authors — LLVM 15.0.0git documentation (e.g. `llvm-lit test/CodeGen/AArch64 -Dllc="llc -debugify-and-strip-all-safe"`)
[4] GitHub - llvm/llvm-test-suite
[5] rG5c04274dab48

vedant

Really appreciate the links:) I'll study them. A build bot will
definitely be helpful.

For Zhiwei's original problem (JCC + .p2align 4, 0x90) difference
(reported on 42138 – Different codegen with/without -g),
I have found the root cause: an assembler optimization implemented in
X86AsmBackend.
I have attached more information on ⚙ D75203 [X86] Relax existing instructions to reduce the number of nops needed for alignment purposes

Please mention this in PR42138 as well.
Thanks,
--paulr