More DWARF problems

I’ve been trying to track down the problem with the DWARF info that is being emitted by my front end, which has been broken for about a month now. Here’s what happens when I attempt to use gdb to debug one of my programs on OS X:

gdb stack crawl at point of internal error:

[ 0 ] /usr/libexec/gdb/gdb-i386-apple-darwin (align_down+0x0) [0x122300]

[ 1 ] /usr/libexec/gdb/gdb-i386-apple-darwin (find_partial_die_in_comp_unit+0x65) [0xc0e19]

[ 2 ] /usr/libexec/gdb/gdb-i386-apple-darwin (find_partial_die+0x2d4) [0xcf07f]

[ 3 ] /usr/libexec/gdb/gdb-i386-apple-darwin (fixup_partial_die+0x29) [0xcf0b3]

[ 4 ] /usr/libexec/gdb/gdb-i386-apple-darwin (scan_partial_symbols+0x26) [0xcf9e7]

[ 5 ] /usr/libexec/gdb/gdb-i386-apple-darwin (dwarf2_build_psymtabs+0xc54) [0xd093c]

[ 6 ] /usr/libexec/gdb/gdb-i386-apple-darwin (macho_symfile_read+0x145) [0x163b15]

[ 7 ] /usr/libexec/gdb/gdb-i386-apple-darwin (syms_from_objfile+0x62d) [0x52259]

[ 8 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_with_addrs_or_offsets_using_objfile+0x338) [0x561e7]

[ 9 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_with_addrs_or_offsets_using_objfile+0x2da) [0x56189]

[ 10 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_name_with_addrs_or_offsets+0x7a) [0x563c9]

[ 11 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_main_1+0xf2) [0x56e36]

[ 12 ] /usr/libexec/gdb/gdb-i386-apple-darwin (catch_command_errors+0x4d) [0x7ac88]

/SourceCache/gdb/gdb-966/src/gdb/dwarf2read.c:7593: internal-error: could not find partial DIE in cache

A problem internal to GDB has been detected,

further debugging may prove unreliable.

Quit this debugging session? (y or n)

Now, all of this was working earlier, and I don’t know whether it was something I did or a change in LLVM, but that’s not important. The real question is how to track down the problem.

In the past, the way that I have dealt with DWARF-related problems is to try a number of strategies:

  1. Reduce the problem to the smallest reproducible case. In the past I have had some success with this, but not in this case. You see, one of the problems with object-oriented languages is that even simple operations - such as appending an element to an array - can end up pulling in a very large number of classes (For example, the array class might throw an exception if your index is invalid, which pulls in the exception hierarchy and so on…)

I have a special script which attempts to compile a “minimal” test case, without the standard library and with garbage collection disabled. Unfortunately, none of the “small” test cases that I have been able to come up with exhibit the problem, and any time I use certain language features I am forced to link in the standard library which makes the test program huge. I have plenty of example cases which exhibit the problem, but they are all bitcode files on the order of 100K or more in size. And I’m not going to have much luck tracking down a needle in such a large haystack.

  1. Use dwarfdump to try and verify the validity of the debug symbols.

Unfortunately, the information from dwarfdump is not too useful in this case. Here’s what I get:

  • On OS X, with the “small” test cases I created, I get no errors at all.
  • On OS X, with my normal unit tests (with the standard library) I get hundreds of error messages of the following form:

0x00000882: DIE attribute 0x00000883: AT_type/FORM_ref4 has a value 0x00000592 that is not in the current compile unit in the .debug_info section.

0x000009a9: DIE attribute 0x000009ae: AT_type/FORM_ref4 has a value 0x000001c2 that is not in the current compile unit in the .debug_info section.

0x00000b85: DIE attribute 0x00000b8a: AT_type/FORM_ref4 has a value 0x0000055c that is not in the current compile unit in the .debug_info section.

0x00000c88: DIE attribute 0x00000c89: AT_type/FORM_ref4 has a value 0x0000055c that is not in the current compile unit in the .debug_info section.

0x00000d2f: DIE attribute 0x00000d34: AT_type/FORM_ref4 has a value 0x0000055c that is not in the current compile unit in the .debug_info section.

0x00000d9a: DIE attribute 0x00000d9f: AT_type/FORM_ref4 has a value 0x00000584 that is not in the current compile unit in the .debug_info section.

0x00000e43: DIE attribute 0x00000e48: AT_type/FORM_ref4 has a value 0x000011ac that is not in the current compile unit in the .debug_info section.

0x00000ea3: DIE attribute 0x00000ea8: AT_type/FORM_ref4 has a value 0x00001225 that is not in the current compile unit in the .debug_info section.

0x00000ebe: DIE attribute 0x00000ebf: AT_type/FORM_ref4 has a value 0x00001248 that is not in the current compile unit in the .debug_info section.

0x00000ee3: DIE attribute 0x00000ee4: AT_type/FORM_ref4 has a value 0x00001285 that is not in the current compile unit in the .debug_info section.

  • On Linux - well the problem here is that even when my DWARF info was working, dwarfdump would spit out a ton of error messages about bad file DIEs and other spam - in other words, I’ve never been able to use LLVM to produce a binary on Linux that was dwarfdump-error free. So any “new” errors are mixed in with all of the “old” errors I was seeing before.
  1. Use llbrowse to manually inspect the DIEs and see if they make sense. (Which is part of the reason why I wrote llbrowse.) Again, the problem is that I don’t know where to look, and the files are simply too large to inspect manually.

Maybe you could try dwarflint:

https://fedorahosted.org/elfutils/wiki/DwarfLint

I don't know anything about it, just seen it mentioned over on the GCC
mailing lists.

Jay.

I’ve been trying to track down the problem with the DWARF info that is being emitted by my front end, which has been broken for about a month now. Here’s what happens when I attempt to use gdb to debug one of my programs on OS X:

gdb stack crawl at point of internal error:

[ 0 ] /usr/libexec/gdb/gdb-i386-apple-darwin (align_down+0x0) [0x122300]

[ 1 ] /usr/libexec/gdb/gdb-i386-apple-darwin (find_partial_die_in_comp_unit+0x65) [0xc0e19]

[ 2 ] /usr/libexec/gdb/gdb-i386-apple-darwin (find_partial_die+0x2d4) [0xcf07f]

[ 3 ] /usr/libexec/gdb/gdb-i386-apple-darwin (fixup_partial_die+0x29) [0xcf0b3]

[ 4 ] /usr/libexec/gdb/gdb-i386-apple-darwin (scan_partial_symbols+0x26) [0xcf9e7]

[ 5 ] /usr/libexec/gdb/gdb-i386-apple-darwin (dwarf2_build_psymtabs+0xc54) [0xd093c]

[ 6 ] /usr/libexec/gdb/gdb-i386-apple-darwin (macho_symfile_read+0x145) [0x163b15]

[ 7 ] /usr/libexec/gdb/gdb-i386-apple-darwin (syms_from_objfile+0x62d) [0x52259]

[ 8 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_with_addrs_or_offsets_using_objfile+0x338) [0x561e7]

[ 9 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_with_addrs_or_offsets_using_objfile+0x2da) [0x56189]

[ 10 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_name_with_addrs_or_offsets+0x7a) [0x563c9]

[ 11 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_main_1+0xf2) [0x56e36]

[ 12 ] /usr/libexec/gdb/gdb-i386-apple-darwin (catch_command_errors+0x4d) [0x7ac88]

/SourceCache/gdb/gdb-966/src/gdb/dwarf2read.c:7593: internal-error: could not find partial DIE in cache

A problem internal to GDB has been detected,

further debugging may prove unreliable.

Quit this debugging session? (y or n)

Now, all of this was working earlier, and I don’t know whether it was something I did or a change in LLVM, but that’s not important. The real question is how to track down the problem.

I have seen gdb crash with this back trace when it has seen a subprogram specification DIE at top level, but the actual subprogram definition is not found. The definition DIE may not be found because either it is hiding deep in nested subclass or it may be missing all together in compiler output. One easy way to rule out this is to check all specification DIE’s indentation level in dwarfdump output and check corresponding level of definition die referred by it.

In the past, the way that I have dealt with DWARF-related problems is to try a number of strategies:

  1. Reduce the problem to the smallest reproducible case. In the past I have had some success with this, but not in this case. You see, one of the problems with object-oriented languages is that even simple operations - such as appending an element to an array - can end up pulling in a very large number of classes (For example, the array class might throw an exception if your index is invalid, which pulls in the exception hierarchy and so on…)

I have a special script which attempts to compile a “minimal” test case, without the standard library and with garbage collection disabled. Unfortunately, none of the “small” test cases that I have been able to come up with exhibit the problem, and any time I use certain language features I am forced to link in the standard library which makes the test program huge. I have plenty of example cases which exhibit the problem, but they are all bitcode files on the order of 100K or more in size. And I’m not going to have much luck tracking down a needle in such a large haystack.

  1. Use dwarfdump to try and verify the validity of the debug symbols.

Unfortunately, the information from dwarfdump is not too useful in this case. Here’s what I get:

  • On OS X, with the “small” test cases I created, I get no errors at all.
  • On OS X, with my normal unit tests (with the standard library) I get hundreds of error messages of the following form:

0x00000882: DIE attribute 0x00000883: AT_type/FORM_ref4 has a value 0x00000592 that is not in the current compile unit in the .debug_info section.

This indicates that while DwarfDebug.cpp was preparing dwarf info, it created a DIE 0x00000592 that was referred by another DIE 0x00000883 but somehow DIE 0x00000592 was not emitted. This could be a bug in DwarfDebug.cpp or how debug info is generated by FE.

In DwarfDebug.cpp, you’ll see code like

addDIEEntry(VariableSpecDIE, dwarf::DW_AT_specification, dwarf::DW_FORM_ref4, VariableDIE);

Here VariableSpecDIE is referring VariableDIE, but VariableDIE is missing from the output. There are other uses of DW_FORM_ref4 also. So check in our dwarfdump output what is 0x00000883 and set appropriate breakpoint in debugger and see why it is not reaching to DwarfDebug::emitDIE().

That sounds really useful…unfortunately I wasn’t able to get it to ./configure on OS X, it complained about not finding support for __thread (I’m still running OS X 10.5 and apparently the version of gcc that I have doesn’t support this.) I might try building it on Linux, but I wanted to try it on OS X since that’s where I am having the most problems.

I’ve been trying to track down the problem with the DWARF info that is being emitted by my front end, which has been broken for about a month now. Here’s what happens when I attempt to use gdb to debug one of my programs on OS X:

gdb stack crawl at point of internal error:

[ 0 ] /usr/libexec/gdb/gdb-i386-apple-darwin (align_down+0x0) [0x122300]

[ 1 ] /usr/libexec/gdb/gdb-i386-apple-darwin (find_partial_die_in_comp_unit+0x65) [0xc0e19]

[ 2 ] /usr/libexec/gdb/gdb-i386-apple-darwin (find_partial_die+0x2d4) [0xcf07f]

[ 3 ] /usr/libexec/gdb/gdb-i386-apple-darwin (fixup_partial_die+0x29) [0xcf0b3]

[ 4 ] /usr/libexec/gdb/gdb-i386-apple-darwin (scan_partial_symbols+0x26) [0xcf9e7]

[ 5 ] /usr/libexec/gdb/gdb-i386-apple-darwin (dwarf2_build_psymtabs+0xc54) [0xd093c]

[ 6 ] /usr/libexec/gdb/gdb-i386-apple-darwin (macho_symfile_read+0x145) [0x163b15]

[ 7 ] /usr/libexec/gdb/gdb-i386-apple-darwin (syms_from_objfile+0x62d) [0x52259]

[ 8 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_with_addrs_or_offsets_using_objfile+0x338) [0x561e7]

[ 9 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_with_addrs_or_offsets_using_objfile+0x2da) [0x56189]

[ 10 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_name_with_addrs_or_offsets+0x7a) [0x563c9]

[ 11 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_main_1+0xf2) [0x56e36]

[ 12 ] /usr/libexec/gdb/gdb-i386-apple-darwin (catch_command_errors+0x4d) [0x7ac88]

/SourceCache/gdb/gdb-966/src/gdb/dwarf2read.c:7593: internal-error: could not find partial DIE in cache

A problem internal to GDB has been detected,

further debugging may prove unreliable.

Quit this debugging session? (y or n)

Now, all of this was working earlier, and I don’t know whether it was something I did or a change in LLVM, but that’s not important. The real question is how to track down the problem.

I have seen gdb crash with this back trace when it has seen a subprogram specification DIE at top level, but the actual subprogram definition is not found. The definition DIE may not be found because either it is hiding deep in nested subclass or it may be missing all together in compiler output. One easy way to rule out this is to check all specification DIE’s indentation level in dwarfdump output and check corresponding level of definition die referred by it.

OK, given that much information I was able to track it down, which is that I was passing my struct type as the context parameter to DIBuilder.createMethod. If I change it to compile unit, this problem goes away. I had thought I had read somewhere that it was legal to use the enclosing class definition as the subroutine context, but now I can’t find where I read it. In any case, I guess this means that I don’t know the proper way to declare member functions in DWARF - that is, how can I declare method A of class B so that I can say “B.A” in the debugger and gdb knows where to find it?

In the past, the way that I have dealt with DWARF-related problems is to try a number of strategies:

  1. Reduce the problem to the smallest reproducible case. In the past I have had some success with this, but not in this case. You see, one of the problems with object-oriented languages is that even simple operations - such as appending an element to an array - can end up pulling in a very large number of classes (For example, the array class might throw an exception if your index is invalid, which pulls in the exception hierarchy and so on…)

I have a special script which attempts to compile a “minimal” test case, without the standard library and with garbage collection disabled. Unfortunately, none of the “small” test cases that I have been able to come up with exhibit the problem, and any time I use certain language features I am forced to link in the standard library which makes the test program huge. I have plenty of example cases which exhibit the problem, but they are all bitcode files on the order of 100K or more in size. And I’m not going to have much luck tracking down a needle in such a large haystack.

  1. Use dwarfdump to try and verify the validity of the debug symbols.

Unfortunately, the information from dwarfdump is not too useful in this case. Here’s what I get:

  • On OS X, with the “small” test cases I created, I get no errors at all.
  • On OS X, with my normal unit tests (with the standard library) I get hundreds of error messages of the following form:

0x00000882: DIE attribute 0x00000883: AT_type/FORM_ref4 has a value 0x00000592 that is not in the current compile unit in the .debug_info section.

This indicates that while DwarfDebug.cpp was preparing dwarf info, it created a DIE 0x00000592 that was referred by another DIE 0x00000883 but somehow DIE 0x00000592 was not emitted. This could be a bug in DwarfDebug.cpp or how debug info is generated by FE.

In DwarfDebug.cpp, you’ll see code like

addDIEEntry(VariableSpecDIE, dwarf::DW_AT_specification, dwarf::DW_FORM_ref4, VariableDIE);

Here VariableSpecDIE is referring VariableDIE, but VariableDIE is missing from the output. There are other uses of DW_FORM_ref4 also. So check in our dwarfdump output what is 0x00000883 and set appropriate breakpoint in debugger and see why it is not reaching to DwarfDebug::emitDIE().

OK I’m still trying to track this one down, it is apparently unrelated to the earlier problem. After fixing the problem with the subroutine context mentioned above, I now see the following in gdb:

Die: DW_TAG_formal_parameter (abbrev = 27, offset = 14760)
has children: FALSE
attributes:
DW_AT_name (DW_FORM_strp) string: “testType”
DW_AT_decl_file (DW_FORM_data1) constant: 74
DW_AT_decl_line (DW_FORM_data1) constant: 47
DW_AT_type (DW_FORM_ref4) constant ref: 43711 (adjusted)
DW_AT_location (DW_FORM_block1) block: size 2
Dwarf Error: Cannot find type of die [in module /Users/talin/Projects/tart/build-eclipse/test/stdlib/BitTricksTest.dSYM/Contents/Resources/DWARF/BitTricksTest]

This is good because I know exactly where that parameter is - now the question is to figure out what is wrong with it.

You put subroutine declaration inside the struct and put definition at compile unit level. Take a look at dwarfdump output for following simple c++ program.
— c++ —

class A {
public: int foo();
};
int A::foo() { return 42; }
A a;

I’ve been trying to track down the problem with the DWARF info that is being emitted by my front end, which has been broken for about a month now. Here’s what happens when I attempt to use gdb to debug one of my programs on OS X:

gdb stack crawl at point of internal error:

[ 0 ] /usr/libexec/gdb/gdb-i386-apple-darwin (align_down+0x0) [0x122300]

[ 1 ] /usr/libexec/gdb/gdb-i386-apple-darwin (find_partial_die_in_comp_unit+0x65) [0xc0e19]

[ 2 ] /usr/libexec/gdb/gdb-i386-apple-darwin (find_partial_die+0x2d4) [0xcf07f]

[ 3 ] /usr/libexec/gdb/gdb-i386-apple-darwin (fixup_partial_die+0x29) [0xcf0b3]

[ 4 ] /usr/libexec/gdb/gdb-i386-apple-darwin (scan_partial_symbols+0x26) [0xcf9e7]

[ 5 ] /usr/libexec/gdb/gdb-i386-apple-darwin (dwarf2_build_psymtabs+0xc54) [0xd093c]

[ 6 ] /usr/libexec/gdb/gdb-i386-apple-darwin (macho_symfile_read+0x145) [0x163b15]

[ 7 ] /usr/libexec/gdb/gdb-i386-apple-darwin (syms_from_objfile+0x62d) [0x52259]

[ 8 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_with_addrs_or_offsets_using_objfile+0x338) [0x561e7]

[ 9 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_with_addrs_or_offsets_using_objfile+0x2da) [0x56189]

[ 10 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_name_with_addrs_or_offsets+0x7a) [0x563c9]

[ 11 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_main_1+0xf2) [0x56e36]

[ 12 ] /usr/libexec/gdb/gdb-i386-apple-darwin (catch_command_errors+0x4d) [0x7ac88]

/SourceCache/gdb/gdb-966/src/gdb/dwarf2read.c:7593: internal-error: could not find partial DIE in cache

A problem internal to GDB has been detected,

further debugging may prove unreliable.

Quit this debugging session? (y or n)

Now, all of this was working earlier, and I don’t know whether it was something I did or a change in LLVM, but that’s not important. The real question is how to track down the problem.

I have seen gdb crash with this back trace when it has seen a subprogram specification DIE at top level, but the actual subprogram definition is not found. The definition DIE may not be found because either it is hiding deep in nested subclass or it may be missing all together in compiler output. One easy way to rule out this is to check all specification DIE’s indentation level in dwarfdump output and check corresponding level of definition die referred by it.

OK, given that much information I was able to track it down, which is that I was passing my struct type as the context parameter to DIBuilder.createMethod. If I change it to compile unit, this problem goes away. I had thought I had read somewhere that it was legal to use the enclosing class definition as the subroutine context, but now I can’t find where I read it. In any case, I guess this means that I don’t know the proper way to declare member functions in DWARF - that is, how can I declare method A of class B so that I can say “B.A” in the debugger and gdb knows where to find it?

You put subroutine declaration inside the struct and put definition at compile unit level. Take a look at dwarfdump output for following simple c++ program.
— c++ —

class A {
public: int foo();
};
int A::foo() { return 42; }
A a;

Wow, I never would have guessed that. Is this a limitation of DWARF or of the LLVM generator? I ask because it seems somewhat C+±centric - many languages (mine included) don’t have separate definitions and declarations for functions.

I’d say that if this is required, then there should be a note in the debugging doc about it - or better yet, a helper method in DIBuilder that automatically creates both the definition and the declaration.

I’ve been trying to track down the problem with the DWARF info that is being emitted by my front end, which has been broken for about a month now. Here’s what happens when I attempt to use gdb to debug one of my programs on OS X:

gdb stack crawl at point of internal error:

[ 0 ] /usr/libexec/gdb/gdb-i386-apple-darwin (align_down+0x0) [0x122300]

[ 1 ] /usr/libexec/gdb/gdb-i386-apple-darwin (find_partial_die_in_comp_unit+0x65) [0xc0e19]

[ 2 ] /usr/libexec/gdb/gdb-i386-apple-darwin (find_partial_die+0x2d4) [0xcf07f]

[ 3 ] /usr/libexec/gdb/gdb-i386-apple-darwin (fixup_partial_die+0x29) [0xcf0b3]

[ 4 ] /usr/libexec/gdb/gdb-i386-apple-darwin (scan_partial_symbols+0x26) [0xcf9e7]

[ 5 ] /usr/libexec/gdb/gdb-i386-apple-darwin (dwarf2_build_psymtabs+0xc54) [0xd093c]

[ 6 ] /usr/libexec/gdb/gdb-i386-apple-darwin (macho_symfile_read+0x145) [0x163b15]

[ 7 ] /usr/libexec/gdb/gdb-i386-apple-darwin (syms_from_objfile+0x62d) [0x52259]

[ 8 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_with_addrs_or_offsets_using_objfile+0x338) [0x561e7]

[ 9 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_with_addrs_or_offsets_using_objfile+0x2da) [0x56189]

[ 10 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_name_with_addrs_or_offsets+0x7a) [0x563c9]

[ 11 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_main_1+0xf2) [0x56e36]

[ 12 ] /usr/libexec/gdb/gdb-i386-apple-darwin (catch_command_errors+0x4d) [0x7ac88]

/SourceCache/gdb/gdb-966/src/gdb/dwarf2read.c:7593: internal-error: could not find partial DIE in cache

A problem internal to GDB has been detected,

further debugging may prove unreliable.

Quit this debugging session? (y or n)

Now, all of this was working earlier, and I don’t know whether it was something I did or a change in LLVM, but that’s not important. The real question is how to track down the problem.

I have seen gdb crash with this back trace when it has seen a subprogram specification DIE at top level, but the actual subprogram definition is not found. The definition DIE may not be found because either it is hiding deep in nested subclass or it may be missing all together in compiler output. One easy way to rule out this is to check all specification DIE’s indentation level in dwarfdump output and check corresponding level of definition die referred by it.

In the past, the way that I have dealt with DWARF-related problems is to try a number of strategies:

  1. Reduce the problem to the smallest reproducible case. In the past I have had some success with this, but not in this case. You see, one of the problems with object-oriented languages is that even simple operations - such as appending an element to an array - can end up pulling in a very large number of classes (For example, the array class might throw an exception if your index is invalid, which pulls in the exception hierarchy and so on…)

I have a special script which attempts to compile a “minimal” test case, without the standard library and with garbage collection disabled. Unfortunately, none of the “small” test cases that I have been able to come up with exhibit the problem, and any time I use certain language features I am forced to link in the standard library which makes the test program huge. I have plenty of example cases which exhibit the problem, but they are all bitcode files on the order of 100K or more in size. And I’m not going to have much luck tracking down a needle in such a large haystack.

  1. Use dwarfdump to try and verify the validity of the debug symbols.

Unfortunately, the information from dwarfdump is not too useful in this case. Here’s what I get:

  • On OS X, with the “small” test cases I created, I get no errors at all.
  • On OS X, with my normal unit tests (with the standard library) I get hundreds of error messages of the following form:

0x00000882: DIE attribute 0x00000883: AT_type/FORM_ref4 has a value 0x00000592 that is not in the current compile unit in the .debug_info section.

This indicates that while DwarfDebug.cpp was preparing dwarf info, it created a DIE 0x00000592 that was referred by another DIE 0x00000883 but somehow DIE 0x00000592 was not emitted. This could be a bug in DwarfDebug.cpp or how debug info is generated by FE.

In DwarfDebug.cpp, you’ll see code like

addDIEEntry(VariableSpecDIE, dwarf::DW_AT_specification, dwarf::DW_FORM_ref4, VariableDIE);

Here VariableSpecDIE is referring VariableDIE, but VariableDIE is missing from the output. There are other uses of DW_FORM_ref4 also. So check in our dwarfdump output what is 0x00000883 and set appropriate breakpoint in debugger and see why it is not reaching to DwarfDebug::emitDIE().

OK I’ve been checking this out some more, and the DIEs don’t look valid to me. Take a look at this output from dwarfdump -v:

0x000000c7: TAG_subprogram [3]
0x000000c8: AT_name( .debug_str[0x000001bd] = “construct” )
0x000000cc: AT_MIPS_linkage_name( .debug_str[0x000001c7] = “tart.reflect.Parameter.construct(tart.core.String)” )
0x000000d0: AT_decl_file( 0x3d ( “/Users/talin/Projects/tart/trunk/lib/std/tart/reflect/Parameter.tart” ) )
0x000000d1: AT_decl_line( 0x0d ( 13 ) )
0x000000d2: AT_type( cu + 0x00000066 => {0x00000103} ( ) )
0x000000d6: AT_external( 0x01 )
0x000000d7: AT_low_pc( 0x0000f780 )
0x000000db: AT_high_pc( 0x0000f7b1 )
0x000000df: AT_frame_base( <0x1> 55 ( reg5 ) )

0x000000e1: NULL

0x000000e2: Compile Unit: length = 0x00000071 version = 0x0002 abbr_offset = 0x00000000 addr_size = 0x04 (next CU at 0x00000157)

0x000000ed: TAG_compile_unit [1] *
0x000000ee: AT_producer( .debug_str[0x00000001] = “0.1 tartc” )
0x000000f2: AT_language( 0x0002 ( DW_LANG_C ) )
0x000000f4: AT_name( .debug_str[0x000001fa] = “range.tart” )
0x000000f8: AT_entry_pc( 0x00004360 )
0x000000fc: AT_stmt_list( 0x00000000 ( 0x00000000 ) )
0x00000100: AT_comp_dir( .debug_str[0x00000205] = “/Users/talin/Projects/tart/trunk/lib/std/tart/core” )
0x00000104: AT_APPLE_major_runtime_vers( 0x01 )

In particular note that the DIE starting at 0x0c7, which is a TAG_subprogram, has a return type (AT_type) which points to 0x103. However if you look further down, you’ll see that there is no DIE at offset 0x103. Instead it looks like it’s pointing into the middle of another DIE.

At least, this is true if I’m interpreting this right.

Talin,

You're developing your own language and tool sets, it brings fresh perspective in our predominately C orientated environment, which is good.

Is this a limitation of DWARF or of the LLVM generator?

Neither, IMO. My own reading of DWARF did not find clear wording on this subject. I am not even sure if it is just a darwin gdb implementation decision or not.

I ask because it seems somewhat C++-centric - many languages (mine included) don't have separate definitions and declarations for functions.

I'd say that if this is required, then there should be a note in the debugging doc about it - or better yet, a helper method in DIBuilder that automatically creates both the definition and the declaration.

Right now, DIBuilder's view is based on clang's needs. However, it is OK if you want to add new helper method in DIBuilder for your needs. All you need to do is create two MDNodes, one for definition one for declaration, with appropriate contexts.

Talin,

You’re developing your own language and tool sets, it brings fresh perspective in our predominately C orientated environment, which is good.

Thanks for the kind words :slight_smile: After roughly 4 years of working on this, I’m getting really close to an initial 0.1 release (I’ve spent most of my free time over the last couple weeks updating the documentation.)

Is this a limitation of DWARF or of the LLVM generator?

Neither, IMO. My own reading of DWARF did not find clear wording on this subject. I am not even sure if it is just a darwin gdb implementation decision or not.

I ask because it seems somewhat C+±centric - many languages (mine included) don’t have separate definitions and declarations for functions.

I’d say that if this is required, then there should be a note in the debugging doc about it - or better yet, a helper method in DIBuilder that automatically creates both the definition and the declaration.

Right now, DIBuilder’s view is based on clang’s needs. However, it is OK if you want to add new helper method in DIBuilder for your needs. All you need to do is create two MDNodes, one for definition one for declaration, with appropriate contexts.

OK sounds like I should. However, before I can do that, I need to get the debug output into a clean state that works with gdb and produces no warnings with dwarfdump. Do you have any thoughts on the output from dwarfdump that I posted in the previous message?

I’ve been trying to track down the problem with the DWARF info that is being emitted by my front end, which has been broken for about a month now. Here’s what happens when I attempt to use gdb to debug one of my programs on OS X:

gdb stack crawl at point of internal error:

[ 0 ] /usr/libexec/gdb/gdb-i386-apple-darwin (align_down+0x0) [0x122300]

[ 1 ] /usr/libexec/gdb/gdb-i386-apple-darwin (find_partial_die_in_comp_unit+0x65) [0xc0e19]

[ 2 ] /usr/libexec/gdb/gdb-i386-apple-darwin (find_partial_die+0x2d4) [0xcf07f]

[ 3 ] /usr/libexec/gdb/gdb-i386-apple-darwin (fixup_partial_die+0x29) [0xcf0b3]

[ 4 ] /usr/libexec/gdb/gdb-i386-apple-darwin (scan_partial_symbols+0x26) [0xcf9e7]

[ 5 ] /usr/libexec/gdb/gdb-i386-apple-darwin (dwarf2_build_psymtabs+0xc54) [0xd093c]

[ 6 ] /usr/libexec/gdb/gdb-i386-apple-darwin (macho_symfile_read+0x145) [0x163b15]

[ 7 ] /usr/libexec/gdb/gdb-i386-apple-darwin (syms_from_objfile+0x62d) [0x52259]

[ 8 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_with_addrs_or_offsets_using_objfile+0x338) [0x561e7]

[ 9 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_with_addrs_or_offsets_using_objfile+0x2da) [0x56189]

[ 10 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_name_with_addrs_or_offsets+0x7a) [0x563c9]

[ 11 ] /usr/libexec/gdb/gdb-i386-apple-darwin (symbol_file_add_main_1+0xf2) [0x56e36]

[ 12 ] /usr/libexec/gdb/gdb-i386-apple-darwin (catch_command_errors+0x4d) [0x7ac88]

/SourceCache/gdb/gdb-966/src/gdb/dwarf2read.c:7593: internal-error: could not find partial DIE in cache

A problem internal to GDB has been detected,

further debugging may prove unreliable.

Quit this debugging session? (y or n)

Now, all of this was working earlier, and I don’t know whether it was something I did or a change in LLVM, but that’s not important. The real question is how to track down the problem.

I have seen gdb crash with this back trace when it has seen a subprogram specification DIE at top level, but the actual subprogram definition is not found. The definition DIE may not be found because either it is hiding deep in nested subclass or it may be missing all together in compiler output. One easy way to rule out this is to check all specification DIE’s indentation level in dwarfdump output and check corresponding level of definition die referred by it.

In the past, the way that I have dealt with DWARF-related problems is to try a number of strategies:

  1. Reduce the problem to the smallest reproducible case. In the past I have had some success with this, but not in this case. You see, one of the problems with object-oriented languages is that even simple operations - such as appending an element to an array - can end up pulling in a very large number of classes (For example, the array class might throw an exception if your index is invalid, which pulls in the exception hierarchy and so on…)

I have a special script which attempts to compile a “minimal” test case, without the standard library and with garbage collection disabled. Unfortunately, none of the “small” test cases that I have been able to come up with exhibit the problem, and any time I use certain language features I am forced to link in the standard library which makes the test program huge. I have plenty of example cases which exhibit the problem, but they are all bitcode files on the order of 100K or more in size. And I’m not going to have much luck tracking down a needle in such a large haystack.

  1. Use dwarfdump to try and verify the validity of the debug symbols.

Unfortunately, the information from dwarfdump is not too useful in this case. Here’s what I get:

  • On OS X, with the “small” test cases I created, I get no errors at all.
  • On OS X, with my normal unit tests (with the standard library) I get hundreds of error messages of the following form:

0x00000882: DIE attribute 0x00000883: AT_type/FORM_ref4 has a value 0x00000592 that is not in the current compile unit in the .debug_info section.

This indicates that while DwarfDebug.cpp was preparing dwarf info, it created a DIE 0x00000592 that was referred by another DIE 0x00000883 but somehow DIE 0x00000592 was not emitted. This could be a bug in DwarfDebug.cpp or how debug info is generated by FE.

In DwarfDebug.cpp, you’ll see code like

addDIEEntry(VariableSpecDIE, dwarf::DW_AT_specification, dwarf::DW_FORM_ref4, VariableDIE);

Here VariableSpecDIE is referring VariableDIE, but VariableDIE is missing from the output. There are other uses of DW_FORM_ref4 also. So check in our dwarfdump output what is 0x00000883 and set appropriate breakpoint in debugger and see why it is not reaching to DwarfDebug::emitDIE().

OK I’ve been checking this out some more, and the DIEs don’t look valid to me. Take a look at this output from dwarfdump -v:

0x000000c7: TAG_subprogram [3]
0x000000c8: AT_name( .debug_str[0x000001bd] = “construct” )
0x000000cc: AT_MIPS_linkage_name( .debug_str[0x000001c7] = “tart.reflect.Parameter.construct(tart.core.String)” )
0x000000d0: AT_decl_file( 0x3d ( “/Users/talin/Projects/tart/trunk/lib/std/tart/reflect/Parameter.tart” ) )
0x000000d1: AT_decl_line( 0x0d ( 13 ) )
0x000000d2: AT_type( cu + 0x00000066 => {0x00000103} ( ) )
0x000000d6: AT_external( 0x01 )
0x000000d7: AT_low_pc( 0x0000f780 )
0x000000db: AT_high_pc( 0x0000f7b1 )
0x000000df: AT_frame_base( <0x1> 55 ( reg5 ) )

0x000000e1: NULL

0x000000e2: Compile Unit: length = 0x00000071 version = 0x0002 abbr_offset = 0x00000000 addr_size = 0x04 (next CU at 0x00000157)

0x000000ed: TAG_compile_unit [1] *
0x000000ee: AT_producer( .debug_str[0x00000001] = “0.1 tartc” )
0x000000f2: AT_language( 0x0002 ( DW_LANG_C ) )
0x000000f4: AT_name( .debug_str[0x000001fa] = “range.tart” )
0x000000f8: AT_entry_pc( 0x00004360 )
0x000000fc: AT_stmt_list( 0x00000000 ( 0x00000000 ) )
0x00000100: AT_comp_dir( .debug_str[0x00000205] = “/Users/talin/Projects/tart/trunk/lib/std/tart/core” )
0x00000104: AT_APPLE_major_runtime_vers( 0x01 )

In particular note that the DIE starting at 0x0c7, which is a TAG_subprogram, has a return type (AT_type) which points to 0x103. However if you look further down, you’ll see that there is no DIE at offset 0x103. Instead it looks like it’s pointing into the middle of another DIE.

Not to be a pest, but I’m still stuck on this one.

This means the subprogram type is invalid. Set a breakpoint inside createSubprogramDIE() where addType() is used to add AT_type.

This means the subprogram type is invalid. Set a breakpoint inside createSubprogramDIE() where addType() is used to add AT_type.

OK that was useful. I figured out what at least one of my problems was - an incorrect encoding of type “void”.

However, I’m still see errors in dwarfdump output, although fewer than before. Here’s a sample of one of the errors I am getting:

0x000192a4: DIE attribute 0x000192a5: AT_type/FORM_ref4 has a value 0x000194c7 that is not in the current compile unit in the .debug_info section.

And here’s the relevant DIE that it’s referring to:

0x0001929b: TAG_array_type [12] *

0x0001929c: AT_sibling( cu + 0x00000177 => {0x000192ab} )

0x000192a0: AT_type( cu + 0x000000b2 => {0x000191e6} ( uint32 ) )

0x000192a4: TAG_subrange_type [13]

0x000192a5: AT_type( cu + 0x00000393 => {0x000194c7} ( ) )

0x000192a9: AT_upper_bound( 0x01 )

0x000192aa: NULL

dwarfdump is complaining because the AT_type attribute of the subrange is pointing to an invalid offset. Now, I created this subrange with the call:

diBuilder_.getOrCreateSubrange(0, 1);

As you can see there’s nothing there for me to screw up. So I’m puzzled as to where that AT_type is coming from.

Another strange thing: When I run dwarfdump on the .o file, I get far fewer error messages than when I run it on the final executable that was built from just that one .o file. For example, these errors (and many more) only shows up on the executable:

0x00000889: DIE attribute 0x0000088a: AT_type/FORM_ref4 has a value 0x00001053 that is not in the current compile unit in the .debug_info section.

0x000008b4: DIE attribute 0x000008b5: AT_type/FORM_ref4 has a value 0x000010a1 that is not in the current compile unit in the .debug_info section.

0x000008d6: DIE attribute 0x000008d7: AT_type/FORM_ref4 has a value 0x00001053 that is not in the current compile unit in the .debug_info section.

Looking at the first of these, here’s what the DIE from the executable looks like:

0x0000087d: TAG_class_type [9] *

0x0000087e: AT_sibling( cu + 0x000000ad => {0x000008a0} )

0x00000882: AT_name( .debug_str[0x000001ff] = “tart.reflect.Type” )

0x00000886: AT_byte_size( 0x0c )

0x00000887: AT_decl_file( 0x42 ( “/Users/talin/Projects/tart/trunk/lib/std/tart/reflect/Type.tart” ) )

0x00000888: AT_decl_line( 0x05 ( 5 ) )

0x00000889: TAG_inheritance [10]

0x0000088a: AT_type( cu + 0x00000860 => {0x00001053} ( ) )

0x0000088e: AT_data_member_location( <0x2> 23 00 ( plus-uconst 0x0000 ) )

0x00000891: TAG_member [11]

0x00000892: AT_name( .debug_str[0x00000211] = “_typeKind” )

0x00000896: AT_type( cu + 0x00000063 => {0x00000856} ( NULL ) )

0x0000089a: AT_decl_file( 0x3a ( “/Users/talin/Projects/tart/trunk/lib/std/tart/reflect/Module.tart” ) )

0x0000089b: AT_decl_line( 0x17 ( 23 ) )

0x0000089c: AT_data_member_location( <0x2> 23 08 ( plus-uconst 0x0008 ) )

0x0000089f: NULL

Now, if I find the corresponding DIE from the .o file, here’s what it looks like:

0x000002c4: TAG_class_type [9] *

0x000002c5: AT_sibling( cu + 0x000002fb => {0x000002fb} )

0x000002c9: AT_name( “tart.reflect.Type” )

0x000002db: AT_byte_size( 0x0c )

0x000002dc: AT_decl_file( 0x42 ( “/Users/talin/Projects/tart/trunk/lib/gc1/tart/gc1/Type.tart” ) )

0x000002dd: AT_decl_line( 0x05 ( 5 ) )

0x000002de: TAG_inheritance [10]

0x000002df: AT_type( cu + 0x00000a7b => {0x00000a7b} ( tart.core.Object ) )

0x000002e3: AT_data_member_location( <0x2> 23 00 ( plus-uconst 0x0000 ) )

0x000002e6: TAG_member [11]

0x000002e7: AT_name( “_typeKind” )

0x000002f1: AT_type( cu + 0x00000221 => {0x00000221} ( TypeKind ) )

0x000002f5: AT_decl_file( 0x4b ( “/Users/talin/Projects/tart/trunk/lib/gc1/tart/gc1/GC1.tart” ) )

0x000002f6: AT_decl_line( 0x17 ( 23 ) )

0x000002f7: AT_data_member_location( <0x2> 23 08 ( plus-uconst 0x0008 ) )

0x000002fa: NULL

As you can see, the AT_type from the TAG_inheritance DIE is pointing to a valid DIE in the .o file, but not in the executable. Here’s the type it’s pointing to:

0x00000a7b: TAG_class_type [9] *

0x00000a7c: AT_sibling( cu + 0x00000ab9 => {0x00000ab9} )

0x00000a80: AT_name( “tart.core.Object” )

0x00000a91: AT_byte_size( 0x08 )

0x00000a92: AT_decl_file( 0x12 ( “/Users/talin/Projects/tart/trunk/lib/gc1/tart/gc1/Object.tart” ) )

0x00000a93: AT_decl_line( 0x07 ( 7 ) )

0x00000a94: TAG_member [11]

0x00000a95: AT_name( “__tib” )

0x00000a9b: AT_type( cu + 0x00000a75 => {0x00000a75} ( tart.core.TypeInfoBlock* ) )

0x00000a9f: AT_decl_file( 0x4b ( “/Users/talin/Projects/tart/trunk/lib/gc1/tart/gc1/GC1.tart” ) )

0x00000aa0: AT_decl_line( 0x08 ( 8 ) )

0x00000aa1: AT_data_member_location( <0x2> 23 00 ( plus-uconst 0x0000 ) )

0x00000aa4: TAG_member [11]

0x00000aa5: AT_name( “__gcstate” )

0x00000aaf: AT_type( cu + 0x00000094 => {0x00000094} ( int32 ) )

0x00000ab3: AT_decl_file( 0x4b ( “/Users/talin/Projects/tart/trunk/lib/gc1/tart/gc1/GC1.tart” ) )

0x00000ab4: AT_decl_line( 0x09 ( 9 ) )

0x00000ab5: AT_data_member_location( <0x2> 23 04 ( plus-uconst 0x0004 ) )

0x00000ab8: NULL

Now, the .o file is being produced by llc, and the executable is being produced by:

/usr/bin/c++ -g -o $PRGNAME $PRGNAME.o runtime/libruntime.a

So the question is - why are the DIEs which are correct in the .o file being garbled when they are in the executable?