Expressiveness of column numbers in dwarf using clang 3.0?

Hi all,

I am processing DWARF line and column information in (x86 and ARM) executables in order to produce a mapping from the machine instructions back to the original source code (C/C++). Using the line numbers is quite straightforward (“libdwarf” [1] is doing the work me.) But when comparing the column numbers (extracted from the DWARF line table) with the corresponding source code locations, it becomes clear that they are not very “useful”.

Consider the following small example (C++):

1: #include <iostream>
2: #include <ctime>
3: #include <cstdlib>
4: using namespace std;
5: int main() {
6: int j = 0; cin >> j; long sum = (j < 0 ? -5 : 4) + rand();
7: for(int i = 0; i < j; i++) { sum += j*j-2; cout << (sum / 2) << endl; }
8: srand(time(NULL));
9: double d = rand() / 10.341; int t = (int)d+j*sum;
10: cout << sum << d << t << j;
11: return (0);
12: }

Compiling this with “clang++ Main.cpp -g -O3 -o column” result in the following location information within the generated executable:

$ dwarfdump -l column

.debug_line: line number info for a single cu
Source lines (from CU-DIE at .debug_info offset 11):
<source file> [line,column] <pc> //<new stmt or basic block
.../locale_facets.h: [868, 2] 0x80488f0 // new statement
[...]
.../Main.cpp: ````[ 8, 2] 0x804896f // new statement
.../Main.cpp: ````[ 9,28] 0x8048983 // new statement
.../ostream: ````[165, 9] 0x8048990 // new statement
.../Main.cpp: ````[ 9,28] 0x80489a0 // new statement
.../ostream: ````[209, 9] 0x80489ac // new statement
.../Main.cpp: ````[ 9,28] 0x80489b5 // new statement
.../ostream: ````[209, 9] 0x80489bb // new statement
[...]
.../basic_ios.h: [ 48, 2] 0x8048a23 // new statement // end of text sequence

Now, have a look at source code line 9. The extracted debug info above says that we’ve 3 “instruction sets” (beginning at 0x8048983, ``0x80489a0 and 0x80489b5 respectively) which correspond to line 9. But all of them are labeled with column number 28! According to my understanding, this does not contribute any further information to support my task (= mapping assembler code back to the source lines or even to statements within a line). Did i miss anything?

Furthermore, I would like to use clang as a cross-compiler for ARM (as mentioned above). Is there any “native” or “default” way to achieve that? I have already successfully cross-compiled for ARM using arm-elf-gcc/g++ and newlib. But, for example, compiling with

clang++ -march=armv7-a -mfloat-abi=soft -ccc-host-triple arm-elf -integrated-as -g Main.cpp -o a.out

results in the following error message:

Main.cpp:1:10: fatal error: 'iostream' file not found
#include <iostream>
^
1 error generated.

Obviously, clang++ cannot find the C++ standard header “iostream”. As far as I know, I have to tell clang++ to use the newlib headers and libs but I don’t know how to do that…

I would be grateful for any help/hints!

Best regards
Adrian

PS: I am using LLVM/clang 3.0, SVN rev. 131589.

[1] http://wiki.dwarfstd.org/index.php?title=Libdwarf_And_Dwarfdump

You are looking at the line table produced at -O3, i.e. after aggressive optimizer had opportunities to optimize code. Try -O0 and see if it helps.

First of all, thanks for your reply!

I’ve already checked that at -O0 but it results in the same information. (The documentation about “Source Level Debugging with LLVM” says “LLVM debug information always provides information to accurately read the source-level state of the program, regardless of which LLVM optimizations have been run, and without any modification to the optimizations themselves.” [1])

Any other ideas?

Best regards
Adrian

[1]

Hi all,

I am processing DWARF line and column information in (x86 and ARM) executables in order to produce a mapping from the machine instructions back to the original source code (C/C++). Using the line numbers is quite straightforward (“libdwarf” [1] is doing the work me.) But when comparing the column numbers (extracted from the DWARF line table) with the corresponding source code locations, it becomes clear that they are not very “useful”.

Consider the following small example (C++):

1: #include <iostream>
2: #include <ctime>
3: #include <cstdlib>
4: using namespace std;
5: int main() {
6: int j = 0; cin >> j; long sum = (j < 0 ? -5 : 4) + rand();
7: for(int i = 0; i < j; i++) { sum += j*j-2; cout << (sum / 2) << endl; }
8: srand(time(NULL));
9: double d = rand() / 10.341; int t = (int)d+j*sum;
10: cout << sum << d << t << j;
11: return (0);
12: }

Compiling this with “clang++ Main.cpp -g -O3 -o column” result in the following location information within the generated executable:

$ dwarfdump -l column

.debug_line: line number info for a single cu
Source lines (from CU-DIE at .debug_info offset 11):
<source file> [line,column] <pc> //<new stmt or basic block
.../locale_facets.h: [868, 2] 0x80488f0 // new statement
[...]
.../Main.cpp: ````[ 8, 2] 0x804896f // new statement
.../Main.cpp: ````[ 9,28] 0x8048983 // new statement
.../ostream: ````[165, 9] 0x8048990 // new statement
.../Main.cpp: ````[ 9,28] 0x80489a0 // new statement
.../ostream: ````[209, 9] 0x80489ac // new statement
.../Main.cpp: ````[ 9,28] 0x80489b5 // new statement
.../ostream: ````[209, 9] 0x80489bb // new statement
[...]
.../basic_ios.h: [ 48, 2] 0x8048a23 // new statement // end of text sequence

Now, have a look at source code line 9. The extracted debug info above says that we’ve 3 “instruction sets” (beginning at 0x8048983, ``0x80489a0 and 0x80489b5 respectively) which correspond to line 9. But all of them are labeled with column number 28! According to my understanding, this does not contribute any further information to support my task (= mapping assembler code back to the source lines or even to statements within a line). Did i miss anything?

You are looking at the line table produced at -O3, i.e. after aggressive optimizer had opportunities to optimize code. Try -O0 and see if it helps.

First of all, thanks for your reply!

I’ve already checked that at -O0 but it results in the same information.

You mean, the instructions with given line and column number do not match the source code construct at that location ?

(The documentation about “Source Level Debugging with LLVM” says “LLVM debug information always provides information to accurately read the source-level state of the program, regardless of which LLVM optimizations have been run, and without any modification to the optimizations themselves.” [1])

It means the instructions with given line and column number matches the source code construct at that line/col number. It does not mean that optimizer/code generator will not reorder instruction. It also does not mean that optimizer/code generator will not emit instruction without line number information. It means, if there is a line number information, it is as accurate as possible to map source construct.

  • LLVM debug information does not prevent many important optimizations from happening (for example inlining, basic block reordering/merging/cleanup, tail duplication, etc), further reducing the amount of the compiler that eventually is “aware” of debugging information.

FWIW, I just updated docs to match reality. Please let me know, if there is still confusion.

Hi all,

I am processing DWARF line and column information in (x86 and ARM) executables in order to produce a mapping from the machine instructions back to the original source code (C/C++). Using the line numbers is quite straightforward (“libdwarf” [1] is doing the work me.) But when comparing the column numbers (extracted from the DWARF line table) with the corresponding source code locations, it becomes clear that they are not very “useful”.

Consider the following small example (C++):

1: #include <iostream>
2: #include <ctime>
3: #include <cstdlib>
4: using namespace std;
5: int main() {
6: int j = 0; cin >> j; long sum = (j < 0 ? -5 : 4) + rand();
7: for(int i = 0; i < j; i++) { sum += j*j-2; cout << (sum / 2) << endl; }
8: srand(time(NULL));
9: double d = rand() / 10.341; int t = (int)d+j*sum;
10: cout << sum << d << t << j;
11: return (0);
12: }

Compiling this with “clang++ Main.cpp -g -O3 -o column” result in the following location information within the generated executable:

$ dwarfdump -l column

.debug_line: line number info for a single cu
Source lines (from CU-DIE at .debug_info offset 11):
<source file> [line,column] <pc> //<new stmt or basic block
.../locale_facets.h: [868, 2] 0x80488f0 // new statement
[...]
.../Main.cpp: ````[ 8, 2] 0x804896f // new statement
.../Main.cpp: ````[ 9,28] 0x8048983 // new statement
.../ostream: ````[165, 9] 0x8048990 // new statement
.../Main.cpp: ````[ 9,28] 0x80489a0 // new statement
.../ostream: ````[209, 9] 0x80489ac // new statement
.../Main.cpp: ````[ 9,28] 0x80489b5 // new statement
.../ostream: ````[209, 9] 0x80489bb // new statement
[...]
.../basic_ios.h: [ 48, 2] 0x8048a23 // new statement // end of text sequence

Now, have a look at source code line 9. The extracted debug info above says that we’ve 3 “instruction sets” (beginning at 0x8048983, ``0x80489a0 and 0x80489b5 respectively) which correspond to line 9. But all of them are labeled with column number 28! According to my understanding, this does not contribute any further information to support my task (= mapping assembler code back to the source lines or even to statements within a line). Did i miss anything?

You are looking at the line table produced at -O3, i.e. after aggressive optimizer had opportunities to optimize code. Try -O0 and see if it helps.

First of all, thanks for your reply!

I’ve already checked that at -O0 but it results in the same information.

You mean, the instructions with given line and column number do not match the source code construct at that location ?

No, they do.

(The documentation about “Source Level Debugging with LLVM” says “LLVM debug information always provides information to accurately read the source-level state of the program, regardless of which LLVM optimizations have been run, and without any modification to the optimizations themselves.” [1])

It means the instructions with given line and column number matches the source code construct at that line/col number. It does not mean that optimizer/code generator will not reorder instruction. It also does not mean that optimizer/code generator will not emit instruction without line number information. It means, if there is a line number information, it is as accurate as possible to map source construct.

Yes, that matches my understanding, too. But I thought that clang would be able to emit more than one (different) column number per line. As in my example, for line number 9 (in Main.cpp), there are three entries in the DWARF line table. But all of them contain the same information. As a consequence, the associated assembler instructions were all mapped to the same source line and thus, the column information is useless…? I mean, what are the additional information included in the column numbers?

I extracted the assembler instructions for the 9th line (x86):
.../Main.cpp: 9
double d = rand() / 10.341; int t = (int)d+j*sum;
^
8048983: e8 40 fe ff ff call 80487c8 <rand@plt>
8048988: 89 c7 mov %eax,%edi
804898a: 8b 5d f0 mov -0x10(%ebp),%ebx
804898d: 0f af de imul %esi,%ebx
80489a0: f2 0f 2a c7 cvtsi2sd %edi,%xmm0
80489a4: f2 0f 5e 05 f0 8a 04 divsd 0x8048af0,%xmm0
80489ab: 08
80489b5: f2 0f 2c f0 cvttsd2si %xmm0,%esi
80489b9: 01 de add %ebx,%esi

I hope that makes it clearer… :wink:

BTW, any hints to my cross-compilation-related question?

Best regards
Adrian

Update: I’ve found out, that the location information are possibly incorrect,
if they point to standard C/C++ headers as shown in the following listing:
--------------------------------------------------------------------------------
[...]
/usr/include/c++/4.4/bits/basic_ios.h: 48
if (!__f)
^
58: 80489e5: e8 2e fd ff ff call 8048718 <_ZSt16__throw_bad_castv@plt>
59: 80489ea: 66 0f 1f 44 00 00 nopw 0x0(%eax,%eax,1)
60: 80489f0: 55 push %ebp
61: 80489f1: 89 e5 mov %esp,%ebp
62: 80489f3: 83 ec 18 sub $0x18,%esp
63: 80489f6: c7 04 24 94 a1 04 08 movl $0x804a194,(%esp)
64: 80489fd: e8 56 fd ff ff call 8048758 <_ZNSt8ios_base4InitC1Ev@plt>
65: 8048a02: c7 44 24 08 44 a0 04 movl $0x804a044,0x8(%esp)
66: 8048a09: 08
67: 8048a0a: c7 44 24 04 94 a1 04 movl $0x804a194,0x4(%esp)
68: 8048a11: 08
69: 8048a12: c7 04 24 78 87 04 08 movl $0x8048778,(%esp)
70: 8048a19: e8 ea fc ff ff call 8048708 <__cxa_atexit@plt>
71: 8048a1e: 83 c4 18 add $0x18,%esp
72: 8048a21: 5d pop %ebp
73: 8048a22: c3 ret
--------------------------------------------------------------------------------
/usr/include/c++/4.4/bits/basic_ios.h: 439
widen(char __c) const
^
74: 8048958: 8b 5c 30 7c mov 0x7c(%eax,%esi,1),%ebx
75: 804895c: 85 db test %ebx,%ebx
76: 804895e: 0f 84 81 00 00 00 je 80489e5 <main+0x135>
--------------------------------------------------------------------------------
/usr/include/c++/4.4/bits/locale_facets.h: 866
{
^
77: 8048964: 80 7b 1c 00 cmpb $0x0,0x1c(%ebx)
78: 8048968: 74 86 je 80488f0 <main+0x40>
--------------------------------------------------------------------------------
/usr/include/c++/4.4/bits/locale_facets.h: 867
if (_M_widen_ok)
^
79: 804896a: 8a 43 27 mov 0x27(%ebx),%al
80: 804896d: eb 99 jmp 8048908 <main+0x58>
--------------------------------------------------------------------------------
/usr/include/c++/4.4/bits/locale_facets.h: 868
return _M_widen[static_cast<unsigned char>(__c)];
^
81: 80488f0: 89 1c 24 mov %ebx,(%esp)
82: 80488f3: e8 50 fe ff ff call 8048748 <_ZNKSt5ctypeIcE13_M_widen_initEv@plt>
--------------------------------------------------------------------------------
/usr/include/c++/4.4/bits/locale_facets.h: 869
this->_M_widen_init();
^
83: 80488f8: 8b 03 mov (%ebx),%eax
84: 80488fa: 89 1c 24 mov %ebx,(%esp)
85: 80488fd: c7 44 24 04 0a 00 00 movl $0xa,0x4(%esp)
86: 8048904: 00
87: 8048905: ff 50 18 call *0x18(%eax)
--------------------------------------------------------------------------------
[...]
--------------------------------------------------------------------------------
/usr/include/c++/4.4/ostream: 538
endl(basic_ostream<_CharT, _Traits>& __os)
^
98: 8048908: 0f be c0 movsbl %al,%eax
99: 804890b: 89 44 24 04 mov %eax,0x4(%esp)
100: 804890f: 89 34 24 mov %esi,(%esp)
101: 8048912: e8 c1 fe ff ff call 80487d8 <_ZNSo3putEc@plt>
102: 8048953: 8b 06 mov (%esi),%eax
103: 8048955: 8b 40 f4 mov -0xc(%eax),%eax
--------------------------------------------------------------------------------
/usr/include/c++/4.4/ostream: 559
flush(basic_ostream<_CharT, _Traits>& __os)
^
104: 8048917: 89 04 24 mov %eax,(%esp)
105: 804891a: e8 79 fe ff ff call 8048798 <_ZNSo5flushEv@plt>
106: 804891f: 8b 75 ec mov -0x14(%ebp),%esi
107: 8048922: 47 inc %edi
--------------------------------------------------------------------------------
(The “^” marks the column position within the line.)

I am not completely sure but the mapping of line 868 in file “locale_facets.h” might be wrong: There is a call-instruction which calls “_M_widen_init” but this function is effectively called in the next line (869).

Here is the extract from locale_facets.h:
char_type
widen(char __c) const
{
if (_M_widen_ok)
return _M_widen[static_cast<unsigned char>(__c)];
this->_M_widen_init();
return this->do_widen(__c);
}

In addition, line 48 of “basic_ios.h” contains a ret-instruction which should be mapping to a return- or throw-statement. The column numbers are obviously wrong.

Are these interpretations correct?

Best regards
Adrian

It's good. Nevertheless, do you have any suggestions to my recent replies?

Best regards
   Adrian