[yaml2obj] GSoC-20: Add DWARF support to yaml2obj

Hi there,

I'm proposing for the GSoC project: Add DWARF support to yaml2obj[1].
I've uploaded my proposal. If you have any suggestion or ideas, feel
free to leave a comment[2].

Thanks!

It's great to see someone interested in improving yaml2obj!

As far as I'm concerned, the main problem with yam2obj for DWARF testcases is that at the moment, it is both too high-level and too low-level at the same time. For writing debug info testcases, yaml2obj at the moment does not add anything on top of assembler. If it is only providing an alternative syntax, and no other features, it isn't that useful. Let me explain what I mean:

1. Too high-level

For testing parsers we need to be able to create malformed input, so we need to be able to manually tweak offsets and headers where necessary. IIRC currently yaml2obj always creates section headers automatically, and can do so for only one DWARF version. It would be valuable to support more than one version of DWARF, and to support them on a per-section basis (it is not uncommon to mix a DWARF 5 .debug_info section with a DWARF 4 line table). Also there needs to be a way to optionally manually specify headers byte-for-byte, for when we do need this.

2. Too low-level

For functionality tests, however, we don't want to hardcode things like byte offsets, because it makes it extremely hard to write/modify tests by hand. It would be awesome if yaml2obj could automatically generate abbreviation sections if the user requests it, if their exact layout isn't relevant to the test. Similarly, having to manually adjust object file metadata, such as Mach-O section load commands or ELF headers every time we're editing a test in a way that changes the size of, e.g., the .debug_info section prevents us from using yaml2obj for any kind of hand-written tests.

The tasks you are describing to add explicit syntax for more DWARF constructs are also good and necessary, but addressing one or both of these problems would be even more important to increase the usefulness of yaml2obj for DWARF testing.

-- adrian

Hello Xing,

I'd like to echo Adrian here. I currently find the assembly much more
readable and maintainable that the yaml dwarf representation, and so I
always write tests that way.

For me personally, the ability to write/edit syntactically correct dwarf
easily is much more important than being able to generate "incorrect"
dwarf -- I'm perfectly happy to continue to write the latter in
assembly, but there is a lot that could be improved about the experience
of writing "correct" dwarf. Ideally, I'd have a mode where I can just
write the logical DIE structure (tags, attributes and their values), and
the tool would split that into the abbrev/loclist/range/etc. sections.

regards,
pavel

+1 to all that & cc’ing a few of the usual suspects as FYI

Do we think that yaml2obj is the best format for this, or would high-level DWARF DIE assembler directives be a more useful abstraction level? If you think about the .loc directive, there is actually some prior art in assembler.

– adrian

+1 to all that & cc’ing a few of the usual suspects as FYI

For me personally, the ability to write/edit syntactically correct dwarf
easily is much more important than being able to generate “incorrect”
dwarf – I’m perfectly happy to continue to write the latter in
assembly, but there is a lot that could be improved about the experience
of writing “correct” dwarf. Ideally, I’d have a mode where I can just

Do we think that yaml2obj is the best format for this, or would high-level DWARF DIE assembler directives be a more useful abstraction level? If you think about the .loc directive, there is actually some prior art in assembler.

.loc is necessary because it depends on encoding bit lengths of intervening instructions, etc. Otherwise the line table would be like other DWARF sections and not involve assembler-awareness.

I don’t think it’d be a great idea to add the complexities of DWARF emission to the assembler (except where it’s fundamentally necessary, like debug_line/.loc) - it’d be a lot of surface area to expose to users of LLVM, support in the future as DWARF standards change, etc.

I also find YAML tests unwieldy but for some tests (especiall malformed)
we may have to use them because it is diffult for an assembly directive to produce invalid output (invalid offset/relocation/string table/etc).

An assembly syntax can be conciser than its YAML counterpart, e.g. to
describe a section:

   assembly: .section .foo,"a",@progbits
   YAML: - Name: foo
           Type: SHT_PROGBITS
           Flags: [ SHF_ALLOC ]

A symbol table entry is similar. A YAML entry usually takes several
lines of code.

Another advantage of assembly syntax is that it is composable. To define a
local symbol:

   label:

To make it global:

   .globl label
   label:

Some directives are more expressive, e.g. .file .loc
An assembler even supports some meta programming features (macros). The
syntax may be strange.

We do need some composable directives to make DWARF tests easier to
write/read.

(Reviving this thread because a code review reminded me I want to reply
here. Sorry for the extremely long turnaround).

+1 to all that & cc'ing a few of the usual suspects as FYI

For me personally, the ability to write/edit syntactically correct dwarf
easily is much more important than being able to generate "incorrect"
dwarf -- I'm perfectly happy to continue to write the latter in
assembly, but there is a lot that could be improved about the experience
of writing "correct" dwarf. Ideally, I'd have a mode where I can just

Do we think that yaml2obj is the best format for this, or would
high-level DWARF DIE assembler directives be a more useful abstraction
level? If you think about the .loc directive, there is actually some
prior art in assembler.

-- adrian

I also find YAML tests unwieldy but for some tests (especiall malformed)
we may have to use them because it is diffult for an assembly directive
to produce invalid output (invalid offset/relocation/string table/etc).

An assembly syntax can be conciser than its YAML counterpart, e.g. to
describe a section:

assembly: .section .foo,"a",@progbits
YAML: - Name: foo
Type: SHT_PROGBITS
Flags: [ SHF_ALLOC ]

A symbol table entry is similar. A YAML entry usually takes several
lines of code.

Another advantage of assembly syntax is that it is composable. To define a
local symbol:

label:

To make it global:

.globl label
label:

Some directives are more expressive, e.g. .file .loc
An assembler even supports some meta programming features (macros). The
syntax may be strange.

We do need some composable directives to make DWARF tests easier to
write/read.

I am agree with David that we shouldn't add first class DWARF-generation
directives to the assembler just for the sake of writing tests.

However, I do see the appeal of assembler metaprogramming, and I have
used it on occasion when generating some DWARF. A separate utility with
some (meta-)programming facilities could be interesting, though it would
be an additional burden to maintain, and I am not sure it is really
needed (in tests, repetition is often better than complex control flow).

I am curious about your comment on invalid relocations et al. I can see
how that is interesting for testing binary utilities (and I don't think
anyone wants to take that away), but I am not sure how useful is that in
the context of DWARF testing, except maybe in a couple of low-level
DWARF tests (which could be done in the traditional elf yaml and the
DWARF could be written as a blob of bytes). If you have some examples
like that, I'd very much like to about it.

regards,
pl

Personally I generate DWARF with a python DWARF generator I wrote so I can make minimal test cases. I often can’t get the compiler to emit what I need to test with DWARF. For example, I needed to create a compile unit with three functions: two of the functions need to look like they are dead stripped (with an address of zero) and they need to come first in the compile unit DW_AT_ranges. Also the compile unit needs to use a DW_AT_ranges attribute for the address ranges of each of the functions and I need to control the ordering. If I compile something like this from a source file, the compiler users a DW_AT_low_pc and DW_AT_high_pc and I have no DW_AT_ranges on the compile unit.

I was reproducing a bug in “llvm-dwarfdump --verify” and I fixed it and in order to check in a test case for this that isn’t the huge example I started with I need to create exact DWARF that is as I detailed above. I can’t get a .S file for my linked executable, and producing such an executable in the first place is not easy. So I find generating hand crafted DWARF easier. Then I just obj2yaml it, and I have my reduced test case. So it can be hard to get the compiler to emit test cases that are close enough to what I need in order to modify them at the .S file level.

That being said, the whole reason I wrote a DWARF generator was because the yaml stuff doesn’t allow you to create a yaml file and edit it freely for reasons everyone else mentioned already (offsets mismatch, section sizes can’t change easily, can’t add attributes easily).

Does anyone have a way to create the kind of binary in a .S file that I mentioned above? The DWARF I need is in https://reviews.llvm.org/D78782 in the llvm/test/tools/obj2yaml/macho-DWARF-debug-ranges.yaml file on line 6 through line 31.

Greg

I believe the compiler will generate a .debug_ranges section if you use -ffunction-sections, since the addresses of sections will be non-contiguous. From there, you should be able to edit the .debug_ranges assembly as needed (replace references to symbols with 0s in the .debug_ranges content) to get the exact behaviour you want (I’m assuming you don’t want to have to hand-edit a .debug_abbrev/.debug_info data structure to manually create a .debug_ranges).

Yeah, that's what I usually do -- get the compiler to emit something
close to what I need, and then edit the assembly to produce the exact
needed input, or to remove things which are completely irrelevant.

pl

that said, any non-trivial example involving .debug_info quickly becomes
difficult for humans to read because of the multiple sections involved--
.debug_info points to .debug_abbrev, possibly .debug_ranges/rnglists or
.debug_loc/loclists, maybe .debug_addr, certainly .debug_str, ....

A "DWARF compiler" or at least a "DWARF assembler" that took care of all
the picky details would be a nice tool to have, IMO.
--paulr