[DWARFv5] The new line-table section header

The next feature on my DWARF 5 list is the line-table header. While this
is pretty easy generate, it is a real bear to parse, so I thought I should
let y'all know what I'm up to and why as I head out to the yak farm. Any
thoughts and suggestions would be very much appreciated.

The v5 directory and file tables no longer have a fixed format; instead,
we have a list of field descriptors followed by the fields for each entry
in the directory or file table. Normally the directory table would have
one descriptor:
    DW_LNCT_path, DW_FORM_string
This tells us each entry contains a pathname encoded as an inline string.
(Which is essentially how the v4 directory table is encoded.) However,
because of the FORM code, we now have whole new worlds of complication
regarding where the actual string might be. We might have DW_FORM_strp
which puts the actual string in the .debug_string section; eventually we
could have DW_FORM_line_str (pointing to .debug_line_str) or even
DW_FORM_strx (indirecting through .debug_str_offsets).

Conveniently, we have the DWARFFormValue class which knows how to decode
data based on what the form code is.

Inconveniently, DWARFFormValue assumes it is looking at a .debug_info
section, and picks up its relocations from a DWARFUnit. But if we're
using DWARFFormValue to decode data from .debug_line, then it needs a
different relocation map.

It's only the string data that causes a problem; all the other kinds
of data in the file table are constants, and retrieving constants
with DWARFFormValue is no problem.

I think the right tactic is a "top-down" approach, starting by teaching
DWARFDebugLine to parse a v5 line-table header but support only
DW_FORM_string for the paths. This should let me use an unmodified
DWARFFormValue to parse the directory and file tables.

From there, teaching DWARFFormValue to handle DW_FORM_strp from the

.debug_line section should be pretty well motivated and it should be
straightforward to see what's really needed in terms of the API.

Once we get that far, I would hope that the line_str and strx<N> forms
would not require much additional effort. Actually Wolfgang is
separately working on the strx<N> forms so with any luck that would
Just Work for the .debug_line section.

Oh yeah, after all that I'd actually generate the v5 header from LLVM.
The idea is that by then, I can use llvm-dwarfdump to validate it and
be very confident that it would all work.

Does all that sound like a plan? The alternative would be to try to
teach DWARFFormValue to handle DW_FORM_strp from .debug_line up front,
but I think we might rather go at this in smaller pieces.

Thanks,
--paulr

The next feature on my DWARF 5 list is the line-table header. While this
is pretty easy generate, it is a real bear to parse, so I thought I should
let y’all know what I’m up to and why as I head out to the yak farm. Any
thoughts and suggestions would be very much appreciated.

Thanks a bunch for sending this email! - I’d love to see more like this when large pieces are undertaken in LLVM for just these reasons, so we can all get a sense of where things are aiming, the motivation, etc.

The v5 directory and file tables no longer have a fixed format; instead,
we have a list of field descriptors followed by the fields for each entry
in the directory or file table. Normally the directory table would have
one descriptor:
DW_LNCT_path, DW_FORM_string
This tells us each entry contains a pathname encoded as an inline string.
(Which is essentially how the v4 directory table is encoded.) However,
because of the FORM code, we now have whole new worlds of complication
regarding where the actual string might be. We might have DW_FORM_strp
which puts the actual string in the .debug_string section; eventually we
could have DW_FORM_line_str (pointing to .debug_line_str)

What’s DW_FORM_line_str/debug_line_str for? (so the line table can be kept while strippnig the rest of the debug info, including its strings?)

or even
DW_FORM_strx (indirecting through .debug_str_offsets).

Conveniently, we have the DWARFFormValue class which knows how to decode
data based on what the form code is.

Inconveniently, DWARFFormValue assumes it is looking at a .debug_info
section, and picks up its relocations from a DWARFUnit. But if we’re
using DWARFFormValue to decode data from .debug_line, then it needs a
different relocation map.

I’m going to assume there’s going to be similar inconvenience on the other side (the emission side).

It’s only the string data that causes a problem; all the other kinds
of data in the file table are constants, and retrieving constants
with DWARFFormValue is no problem.

I think the right tactic is a “top-down” approach, starting by teaching
DWARFDebugLine to parse a v5 line-table header but support only
DW_FORM_string for the paths. This should let me use an unmodified
DWARFFormValue to parse the directory and file tables.

Any idea what form you’ll be using for LLVM’s emisison? LLVM currently only emits strp - figure the same for the line table? Or more likely to use _string unconditionally?

In any case - if/when you have the right format support in llvm-dwarfdump, you could go ahead and implement the output code in LLVM’s codegen, even before llvm-dwarfdump can handle every arcane format that any DWARF producer might decide to use. (& then you can continue implementing those - but it’d get you the LLVM functionality sooner, rather than gating it on having a fully general parser)

This approach has certainly been taken in the past - implementing enough dumping support as needed for LLVM’s generation functionality & expanding as-needed.

What’s DW_FORM_line_str/debug_line_str for? (so the line table can be kept while strippnig the rest of the debug info, including its strings?)

Exactly. In prior versions of DWARF the line-table strings were always embedded directly in .debug_line so it was possible to strip everything else. With v5 we wanted to make sure it was straightforward to keep doing that.

Inconveniently, DWARFFormValue assumes it is looking at a .debug_info
section, and picks up its relocations from a DWARFUnit. But if we’re
using DWARFFormValue to decode data from .debug_line, then it needs a
different relocation map.

I’m going to assume there’s going to be similar inconvenience on the other side (the emission side).

I hope not. Emission of the .debug_line section is already prepared to conjure up relocations (to various .text sections) as needed, and I would anticipate that once we can get the line-table strings to come out in another section, emitting the corresponding relocations would be quite natural.

Any idea what form you’ll be using for LLVM’s emisison? LLVM currently only emits strp - figure the same for the line table? Or more likely to use _string unconditionally?

I’d start out with _string. LLVM currently only emits strp for actual attributes in the .debug_info section; but these pathname strings are (currently for v4) dumped directly into the .debug_line section, and by specifying FORM_string I can keep doing that. Also this is the form that the first round of dumper changes will be able to handle. J

Later on we can change emission to use _line_strp; that entails emitting the actual strings into a different section. Once we do that, it becomes possible for the linker to do string pooling on the path names (the original motivation for this more complicated header format).

I hope to be able to post the first patch next week.

Thanks!

–paulr

The idea is that by then, I can use llvm-dwarfdump to validate it and
be very confident that it would all work.

Staging it this way makes sense to me.

I hope to be able to post the first patch next week.
Thanks!

Looking forward to that! Thanks for outlining the (bumpy) road ahead.

-- adrian