Question about building line tables

This discussion originally started on a code review thread, but I figured I would continue it here since the patch has landed and I want to do more work as a followup.

So LLDB’s LineTable data structures have the notion of a “terminal entry”. As I understand it, LineTables are structured as a sequence of “entries”. An entry consists of:

Start Line
End Line
Start Address
Byte Length
Is Terminal Entry

and this represents some piece of source code. If you were to examine all line entries in the debug info, you would find that not all address ranges are covered. For example, line 10 might correspond to address 0x12345 - 0x12347, and line 11 might correspond to address 0x12360. So address 0x12348 - 0x12359 do not map to anything. Probably they are not even code, but just space between functions, or other random data in the .text section of a binary that isn’t code.

The problem is, building a line table requires the person parsing the debug info to specify where the terminal entries are. Why? Isn’t it trivial to calculate this automatically? Say you insert a bunch of line entries, now they’re all sorted by address into some array of LineEntry structures. It seems to me like n is a terminal entry if

(entries[n].address + entries[n].length) < entries[n+1].address

is true. Why do we need to store a separate flag for it?

Is there ever a case where the above condition is true but it is not the end of a sequence? Or where the above condition is false but it is the end of a sequence?

This discussion originally started on a code review thread, but I figured I would continue it here since the patch has landed and I want to do more work as a followup.

So LLDB's LineTable data structures have the notion of a "terminal entry". As I understand it, LineTables are structured as a sequence of "entries". An entry consists of:

Start Line
End Line
Start Address
Byte Length
Is Terminal Entry

Not quite correct. There is no end line and there is no byte length. To efficiently store line entries that can be efficiently searched, we need to not have a length and not have an end line.

and this represents some piece of source code. If you were to examine all line entries in the debug info, you would find that not all address ranges are covered. For example, line 10 might correspond to address 0x12345 - 0x12347, and line 11 might correspond to address 0x12360. So address 0x12348 - 0x12359 do not map to anything. Probably they are not even code, but just space between functions, or other random data in the .text section of a binary that isn't code.

The problem is, building a line table requires the person parsing the debug info to specify where the terminal entries are. Why? Isn't it trivial to calculate this automatically? Say you insert a bunch of line entries, now they're all sorted by address into some array of LineEntry structures. It seems to me like n is a terminal entry if

(entries[n].address + entries[n].length) < entries[n+1].address

is true. Why do we need to store a separate flag for it?

Because there is no size in each LineEntry. We rely on the next line entry always terminating the previous one. So one marked with "i am a terminal entry" is required to finish off the previous one.

Is there ever a case where the above condition is true but it is not the end of a sequence? Or where the above condition is false but it is the end of a sequence?

Not sure these questions apply after what I said above.

Does DWARF not store this information? Because it seems like it could be efficiently stored in an interval tree, the question is just whether it is efficient to convert what DWARF stores into that format.

PDB returns line entries in the format I described, with a start address and a byte length, so to determine whether something is a terminal entry I have to add them to some kind of data structure that collapses ranges and then manually scan through for breaks in the continuity of the range.

Is there some way we can make this more generic so that it’s efficient for both DWARF and PDB?

The one thing that might confuse you is we actually store line entries using lldb_private::LineTable::Entry structures which do not have the byte size or the file as a FileSpec:

class LineTable {
protected:
  struct Entry
  {
        lldb::addr_t file_addr; ///< The file address for this line entry
        uint32_t line; ///< The source line number, or zero if there is no line number information.
        uint16_t column; ///< The column number of the source line, or zero if there is no column information.
        uint16_t file_idx:11, ///< The file index into CompileUnit's file table, or zero if there is no file information.
                    is_start_of_statement:1, ///< Indicates this entry is the beginning of a statement.
                    is_start_of_basic_block:1, ///< Indicates this entry is the beginning of a basic block.
                    is_prologue_end:1, ///< Indicates this entry is one (of possibly many) where execution should be suspended for an entry breakpoint of a function.
                    is_epilogue_begin:1, ///< Indicates this entry is one (of possibly many) where execution should be suspended for an exit breakpoint of a function.
                    is_terminal_entry:1; ///< Indicates this entry is that of the first byte after the end of a sequence of target machine instructions.
  };
};

These are 16 bytes each since the file is represented as a file index and we only have the "lldb::addr_t file_addr" for the address.

But we dress them up into lldb_private::LineEntry structures for all other clients that consume line entries and those structures _do_ have information like the user wants to see it:

struct LineEntry
{
    AddressRange range; ///< The section offset address range for this line entry.
    FileSpec file;
    uint32_t line; ///< The source line number, or zero if there is no line number information.
    uint16_t column; ///< The column number of the source line, or zero if there is no column information.
    uint16_t is_start_of_statement:1, ///< Indicates this entry is the beginning of a statement.
                    is_start_of_basic_block:1, ///< Indicates this entry is the beginning of a basic block.
                    is_prologue_end:1, ///< Indicates this entry is one (of possibly many) where execution should be suspended for an entry breakpoint of a function.
                    is_epilogue_begin:1, ///< Indicates this entry is one (of possibly many) where execution should be suspended for an exit breakpoint of a function.
                    is_terminal_entry:1; ///< Indicates this entry is that of the first byte after the end of a sequence of target machine instructions.

};

Each instance is 64 bytes and quite a bit more expensive to copy around.

Does DWARF not store this information? Because it seems like it could be efficiently stored in an interval tree, the question is just whether it is efficient to convert what DWARF stores into that format.

No it stores it just like we do, but in a compressed format that is useless for searching.

PDB returns line entries in the format I described, with a start address and a byte length, so to determine whether something is a terminal entry I have to add them to some kind of data structure that collapses ranges and then manually scan through for breaks in the continuity of the range.

Is there some way we can make this more generic so that it's efficient for both DWARF and PDB?

We need an efficient memory format that LLDB can use to search things, which is how things currently are done: all plug-ins are expected to parse debug info and make a series of lldb_private::LineTable::Entry structs.

We could defer this functionality into the plug-ins directly where you always must say "hey SymbolFile, here is a section offset address, please get me the lldb_private::LineEntry:

bool
SymbolFile::GetLineEntryForAddress (const lldb_private::Address &addr, lldb_private::LineEntry &line_entry);

The thing I don't like about this approach where we don't supply the format we want the line tables to be in is this does make it quite painful to iterate over all line table entries for a compile unit. You would need to get the address range for all functions in a compile unit, then make a loop that would iterate through all addresses and try to lookup each address to find the lldb_private::LineEntry for that address. Right now we just get the LineTable from the compile unit and say "bool LineTable::GetLineEntryAtIndex(uint32_t idx, LineEntry &line_entry);".

Let’s suppose I’ve got this function (ignore the operands to branch instructions, I disassembled a real function and just manually adjusted addresses on the left side only just to create a contrived example).

infinite-dwarf.exemain at infinite.cpp:5 4 5 int main(int argc, char **argv) { 6 int n = 0; infinite-dwarf.exemain:
infinite-dwarf.exe[0x410000] <+0>: 55 push ebp
infinite-dwarf.exe[0x410001] <+1>: 89 e5 mov ebp, esp
infinite-dwarf.exe[0x410003] <+3>: 83 ec 18 sub esp, 0x18
infinite-dwarf.exe[0x410006] <+6>: 8b 45 0c mov eax, dword ptr [ebp + 0xc]
infinite-dwarf.exe[0x410009] <+9>: 8b 4d 08 mov ecx, dword ptr [ebp + 0x8]
infinite-dwarf.exe[0x41000c] <+12>: c7 45 fc 00 00 00 00 mov dword ptr [ebp - 0x4], 0x0
infinite-dwarf.exe[0x410013] <+19>: 89 45 f8 mov dword ptr [ebp - 0x8], eax
infinite-dwarf.exe[0x410016] <+22>: 89 4d f4 mov dword ptr [ebp - 0xc], ecx
infinite-dwarf.exemain + 25 at infinite.cpp:6 5 int main(int argc, char **argv) { 6 int n = 0; 7 while (n < 10) { infinite-dwarf.exe[0x410019] <+25>: c7 45 f0 00 00 00 00 mov dword ptr [ebp - 0x10], 0x0 infinite-dwarf.exemain + 32 at infinite.cpp:7
6 int n = 0;
7 while (n < 10) {
8 std::cout << n << std::endl;
infinite-dwarf.exe[0x410020] <+32>: 83 7d f0 0a cmp dword ptr [ebp - 0x10], 0xa
infinite-dwarf.exemain + 36 at infinite.cpp:7 6 int n = 0; 7 while (n < 10) { 8 std::cout << n << std::endl; infinite-dwarf.exe[0x410024] <+36>: 0f 8d 4a 00 00 00 jge 0x410074 infinite-dwarf.exemain + 42 at infinite.cpp:8
7 while (n < 10) {
8 std::cout << n << std::endl;
9 Sleep(1000);
infinite-dwarf.exe[0x41002a] <+42>: 8b 45 f0 mov eax, dword ptr [ebp - 0x10]
infinite-dwarf.exemain + 45 at infinite.cpp:8 7 while (n < 10) { 8 std::cout << n << std::endl; 9 Sleep(1000); infinite-dwarf.exe[0x41002d] <+45>: 89 e1 mov ecx, esp infinite-dwarf.exe[0x41002f] <+47>: 89 01 mov dword ptr [ecx], eax infinite-dwarf.exe[0x410031] <+49>: b9 80 c1 40 00 mov ecx, 0x40c180 infinite-dwarf.exe[0x410036] <+54>: e8 55 0a 00 00 call 0x410a90 infinite-dwarf.exe[0x41003b] <+59>: 83 ec 04 sub esp, 0x4 infinite-dwarf.exemain + 62 at infinite.cpp:8
7 while (n < 10) {
8 std::cout << n << std::endl;
9 Sleep(1000);
infinite-dwarf.exe[0x41003e] <+62>: 89 e1 mov ecx, esp
infinite-dwarf.exe[0x410040] <+64>: c7 01 50 0d 41 00 mov dword ptr [ecx], 0x410d50
infinite-dwarf.exe[0x410046] <+70>: 89 c1 mov ecx, eax
infinite-dwarf.exe[0x410048] <+72>: e8 e3 0c 00 00 call 0x410d30
infinite-dwarf.exe[0x41004d] <+77>: 83 ec 04 sub esp, 0x4

; function becomes discontiguous here

infinite-dwarf.exemain + 80 at infinite.cpp:9 8 std::cout << n << std::endl; 9 Sleep(1000); 10 n++; infinite-dwarf.exe[0x510050] <+80>: 89 e1 mov ecx, esp infinite-dwarf.exe[0x510052] <+82>: c7 01 e8 03 00 00 mov dword ptr [ecx], 0x3e8 infinite-dwarf.exe[0x510058] <+88>: 8b 0d 04 93 43 00 mov ecx, dword ptr [0x439304] infinite-dwarf.exe[0x51005e] <+94>: 89 45 ec mov dword ptr [ebp - 0x14], eax infinite-dwarf.exe[0x510061] <+97>: ff d1 call ecx infinite-dwarf.exe[0x510063] <+99>: 83 ec 04 sub esp, 0x4 infinite-dwarf.exemain + 102 at infinite.cpp:10
9 Sleep(1000);
10 n++;
11 }
infinite-dwarf.exe[0x510066] <+102>: 8b 45 f0 mov eax, dword ptr [ebp - 0x10]
infinite-dwarf.exe[0x510069] <+105>: 83 c0 01 add eax, 0x1
infinite-dwarf.exe[0x51006c] <+108>: 89 45 f0 mov dword ptr [ebp - 0x10], eax
infinite-dwarf.exemain + 111 at infinite.cpp:7 6 int n = 0; 7 while (n < 10) { 8 std::cout << n << std::endl; infinite-dwarf.exe[0x51006f] <+111>: e9 ac ff ff ff jmp 0x410020 infinite-dwarf.exe[0x510074] <+116>: 31 c0 xor eax, eax infinite-dwarf.exemain + 118 at infinite.cpp:13
12
13 return 0;
14 }
infinite-dwarf.exe[0x510076] <+118>: 83 c4 18 add esp, 0x18
infinite-dwarf.exe[0x510079] <+121>: 5d pop ebp
infinite-dwarf.exe[0x51007a] <+122>: c3 ret

About halfway down, the addresses suddenly increase by 0x100000. So the compiler decided that for some strange reason while unrolling the loop it was just going to start placing code somewhere else entirely. Am I correct in saying that 0x410050 should be a terminal entry in this example?

Yep

If the PDB line tables have sizes you could do:

PDBLineEntry curr = pdb->GetLineEntry(n);
PDBLineEntry next = pdb->GetLineEntry(n+1);

line_entries.insert(curr.addr, curr.line, curr.file, false /*is_terminal*/);

if (curr.addr + curr.size != next.addr)
{
    // Insert terminal entry for end of curr
    line_entries.insert(curr.addr + curr.size, curr.line, curr.file, true /*is_terminal*/);
}

Above is pseudo code, but you get the idea...

Yea, I already whipped up some changes locally that do almost exactly that.