Writing unit tests for DWARF?

One problem that has been vexing me of late: It seems that whenever I run into a problem that requires debugging one of my programs in gdb, before I can do that I have to fix my frontend’s broken generation of debugging info.

The code that generates debugging information is quite fragile - you have to generate metadata for each of your files, classes, and functions, and do so without error, because if you do make a mistake, the only way you’ll find out is because gdb refuses to debug your program. And as I work on the code, occasionally bugs creep in, either from my side or occasionally from the LLVM side. The problem is, that I don’t always check if the debug information is valid, so several weeks can go by and I don’t notice something broke.

What is needed is some way to write a unit test for DWARF information, so that if I broke something I would notice it immediately and could either fix it or roll back. Unfortunately, the various DIDescriptor.Verify() methods are nowhere near strict enough - you can create completely nonsensical DIEs that still pass through Verify(). And even if the Verify() methods were 100% reliable, they only test whether the LLVM metadata is valid - they don’t test whether the actual DWARF embedded in the final executable is correct.

I suppose you could do something with dwarfdump -ka, although it would be better to have something that worked on all platforms. Even dwarfdump itself has different option syntax on Linux vs. OS X. And I don’t think it’s possible right now to generate code that passes through dwarfdump with zero error messages, or at least, I’ve never been able to figure out how to do it.

I was thinking that since lldb needs to know how to interpret all this stuff anyway, perhaps there could be a way to use the same code to validate the debug information for an executable. I know lldb doesn’t run on every platform yet, but I suspect that the parts of lldb which decode DWARF are fairly generic.

The code that generates debugging information is quite fragile - you have to
generate metadata for each of your files, classes, and functions, and do so
without error, because if you do make a mistake, the only way you'll find
out is because gdb refuses to debug your program. And as I work on the code,
occasionally bugs creep in, either from my side or occasionally from the
LLVM side. The problem is, that I don't always check if the debug
information is valid, so several weeks can go by and I don't notice
something broke.

Strongly agree.

What is needed is some way to write a unit test for DWARF information, so
that if I broke something I would notice it immediately and could either fix
it or roll back. Unfortunately, the various DIDescriptor.Verify() methods
are nowhere near strict enough - you can create completely nonsensical DIEs
that still pass through Verify(). And even if the Verify() methods were 100%
reliable, they only test whether the LLVM metadata is valid - they don't
test whether the actual DWARF embedded in the final executable is correct.

Strongly agree. But I go further...

I could help with the verification process (since it's much better to
fail verification than to fail gdb testuite), but I don't know the
design decisions being taken for debug information/metadata, and they
change too frequently to dig the code to learn. There is no API
documentation and the interface (IR metadata) docs are old and
inaccurate.

I'd say, in order of importance, the three things that need to be done ASAP are:

1. Stick to one representation and document it (like LangRef), so
other people could help
2. Enhance Validate() methods to be extremely strict (like Module's),
so it fails straight away
3. Create tests (unit and regression) and run them during check-all,
so we don't regress

The tests are last because it's much easier to catch an assertion than
a silent codegen error.

After the initial period, we iterate those three steps (and not less!)
again and again, until debug information is good. I see the importance
of changing the IR (as I've requested quite a few times) but I
understand that it's better for every one to have a stable IR. Every
new version can have a few changes, not necessarily backward
compatible, but those also need to be documented beforehand (mailing
list, blog, release notes).

If we follow the three steps above in an iterative way, during every
release, we can achieve stability AND feature completeness. But
(IMHO), stability comes first.

cheers,
--renato

Talin,

If there is a magic wand, I would be interested to know!

DIDescriptor.Verify() is not suitable for you needs. It checks structure of encoded debug info after optimizer has modified the IR. Its main goal is inform Dwarf writer, at the end of code gen, which IR construct it should ignore.

If you want to test code gen you have to link compiled code and run it regularly. That's what various build bots for llvm does. Same way, if you want to validate generated debug info you have to go through the debugger.

That said, there is a new unit test harness available. All it needs is more unit tests...

  http://llvm.org/docs/TestingGuide.html#quickdebuginfotests

Renato,

could help with the verification process (since it's much better to
fail verification than to fail gdb testuite), but I don't know the
design decisions being taken for debug information/metadata, and they
change too frequently to dig the code to learn.

I think you are mistaken here. I maintain and support debug info for two front ends (llvm-gcc and clang). Go ahead and check svn archives for last one year and see how many times I had to update llvm-gcc FE.

There is no API
documentation and the interface (IR metadata) docs are old and
inaccurate.

I'd say, in order of importance, the three things that need to be done ASAP are:

1. Stick to one representation and document it (like LangRef), so
other people could help

In last 5 or so llvm releases, encoded debug info representation in llvm IR has changed only once (using metadata, instead of global variables). All other changes are incremental *and* backward compatible.

Regarding documentation, it is on my list. However, your argument has same disconnect as some one who looks at LangReg and says I do not know what exactly FE has to generate to produce a working program. Well, what you need is a How To Write a Front End document.

2. Enhance Validate() methods to be extremely strict (like Module's),
so it fails straight away

See my response regarding Verify().

3. Create tests (unit and regression) and run them during check-all,
so we don't regress

I have already mentioned debuginfo-tests at least once to you earlier.

Could dwarfdump --verify be used to check the debug info?

  • Jan

I think you are mistaken here. I maintain and support debug info for two front ends (llvm-gcc and clang). Go ahead and check svn archives for last one year and see how many times I had to update llvm-gcc FE.

Hi Devang,

First, I'm not attacking anyone. I said before and will say again: the
work you've done is great. I know how complex it is to build something
stable and keep it that way, and my comments were about how hard it
was for me to help you in that matter.

Take my last patch on Dwarf. I've run the tests, added my own, tested
on a Mac and still, we found a problem only after the commit. I'm not
saying that things like this won't happen, but it was really hard for
me to test it and make sure the patch would actually work.

In last 5 or so llvm releases, encoded debug info representation in llvm IR has changed only once (using metadata, instead of global variables). All other changes are incremental *and* backward compatible.

Not entirely true. The metadata style is the same, but the mechanism
used to build it (DIBuilder) was changed (instead of DIFactory)
without warning in Clang. That, per se, wouldn't be a problem, if the
metadata generated by both of them were identical, which they were
not.

As I said before, on December we merged the LLVM tree and it broke our
debug generation. It took me until February to be able to have time to
fix it, but when I did, lots of arguments were different. Some had
their "file" nulled, some integer arguments became boolean (or
vice-versa), and some new arguments appeared out of the blue.

However, migrating to DIBuilder took me only a couple of days and
everything went back to normal again. So, while the infrastructure
actually worked in the end, it took me by surprise and I had to guess
what was going on to fix it properly.

Regarding documentation, it is on my list. However, your argument has same disconnect as some one who looks at LangReg and says I do not know what exactly FE has to generate to produce a working program. Well, what you need is a How To Write a Front End document.

The debug documentation is not up-to-date. Metadata generated by
DIBuilder doesn't look like what's in the docs. And the document
describes some types and declarations, but it doesn't explain the
relationship between them and also doesn't describe all of them. So,
different from LangRef, the debug doc is not a spec. It's just a
document.

I tried to write how to write a front-end for Dwarf (wiki), but as I
didn't have enough knowledge on how to really use it, I couldn't go
too far.

2. Enhance Validate() methods to be extremely strict (like Module's),
so it fails straight away

See my response regarding Verify().

So, IR has a natural verification process, while you're building it.
Lots of assertions will prevent you from building rubbish, and that
makes up for the lack of information in LangRef. After building IR,
the validation process will catch up most of what was left over and
only a few bugs slip through to the codegen process, which also has
loads of assertions. So, the amount of bugs that get through to
execution time are as low as possible.

But it's way harder to verify Metadata, because of it's inherent
variant nature. I get that and am NOT asking for a magic wand (though,
if you have it... ;). And Dwarf also doesn't help, because there is a
lot you can do with Dwarf that is legal but won't amount to anything
in a debugger.

What I'm proposing is a simple rule-set, enforced by a validation
pass, that will reject dubious metadata. We could start as an optional
pass, being very restrictive and failing most known code and unit
tests. With time, we can extend and add corner cases to this
validation until we're comfortable and turn it on by default. I
personally think that it's much easier to relax strict asserts than to
rely on gdb for testing.

cheers,
--renato

Yes, it could be used to validate DWARF structure of debug info. It does not check whether the information communicated through dwarf is correct or not. E.g. if dwarf info says variable is at frame pointer + x offset then you need debugger to verify that, dwarfdump won’t help you.

I think you are mistaken here. I maintain and support debug info for two front ends (llvm-gcc and clang). Go ahead and check svn archives for last one year and see how many times I had to update llvm-gcc FE.

Hi Devang,

First, I'm not attacking anyone.

I understand. But you're missing the point of my comment :slight_smile: If the IR used to encode debug is changing rapidly, as you say, then I'd be force to frequently modify llvm-gcc FE. However, I have not modified llvm-gcc FE in last year or so, so I'd say the encoded IR has been stable. In last 6+ months, llvm-gcc build bot running gdb testsuite is consistently reporting same number of passes and fails (if you ignore inherent gdb testsuite stability issues).

I said before and will say again: the
work you've done is great. I know how complex it is to build something
stable and keep it that way, and my comments were about how hard it
was for me to help you in that matter.

Take my last patch on Dwarf. I've run the tests, added my own, tested
on a Mac and still, we found a problem only after the commit. I'm not
saying that things like this won't happen, but it was really hard for
me to test it and make sure the patch would actually work.

In other words, someone is changing target independent code generation and expects llvm regression tests to catch all bugs. If that's true, we don't need any build bots linking and executing and running llvm generated code.

In last 5 or so llvm releases, encoded debug info representation in llvm IR has changed only once (using metadata, instead of global variables). All other changes are incremental *and* backward compatible.

Not entirely true. The metadata style is the same, but the mechanism
used to build it (DIBuilder) was changed (instead of DIFactory)
without warning in Clang. That, per se, wouldn't be a problem, if the
metadata generated by both of them were identical, which they were
not.

Again, you're mistaken. llvm-gcc and dragon-egg still uses DIFactory and debug info quality has remained same. This says the IR used to encode debug has not been impacted by DIBuilder vs. DIFactory.

Note, DIBuilder etc.. are utilities sued to produce IR, not the interface defined by the IR. In other words, replacement of of OldIRBuilder interface with NewIRBuilder has nothing to do with stability of llvm IR documented by LangRef.html.

What I'm proposing is a simple rule-set, enforced by a validation
pass, that will reject dubious metadata. We could start as an optional
pass, being very restrictive and failing most known code and unit
tests. With time, we can extend and add corner cases to this
validation until we're comfortable and turn it on by default. I
personally think that it's much easier to relax strict asserts than to
rely on gdb for testing.

dwarfdump --verify will do this.

In other words, someone is changing target independent code generation and expects llvm regression tests to catch all bugs. If that's true, we don't need any build bots linking and executing and running llvm generated code.

Ok, that was a bad example... :wink:

Again, you're mistaken. llvm-gcc and dragon-egg still uses DIFactory and debug info quality has remained same. This says the IR used to encode debug has not been impacted by DIBuilder vs. DIFactory.

I see, so that comes back to my original point. I couldn't build a
complete debug infrastructure with DIFactory because I was lost on
many implementation details of the order and types of metadata
information in each IR statement. That's probably the reason why, in
my case (and probably Talin's), it all blew up.

Note, DIBuilder etc.. are utilities sued to produce IR, not the interface defined by the IR. In other words, replacement of of OldIRBuilder interface with NewIRBuilder has nothing to do with stability of llvm IR documented by LangRef.html.

Yes, I know. I'm more concerned with the 'what' and not the 'how'. For
me, an up-to-date documentation on what's strictly needed to produce
legal Dwarf with a clear, short, explanation for each field and how
they relate to each other (as this is more important for debug than
instructions), are of a higher priority than a full-blown validation
system.

dwarfdump --verify will do this.

Is this being used in LLVM tests? This is an idea.

I had a look at your debug tests in clang and they're similar to what I do here.

The problem with debug tests is that it doesn't depend only on the
compiler, but on the debugger for each host/target platform
combinations. Though, dwarfdump could help us grep out the basic stuff
without the need to resort to a debugger to check for Dwarf structure,
just correct locations and line information.

I'm using LIT to also check Dwarf structure, but I have to say that my
success is limited. While I could get far by creating variables on
metadata lines and checking they point to the right types, every time
one tiny thing changes, I have to refactor most of the tests.

I did the same with Dwarf, checking for addresses of types and later
seeing if the variable refers to it, checking if the location points
to debug_loc or is just an expression, etc. But debug information is
far too volatile to make that approach reasonable in the long run...
:frowning:

cheers,
--renato

dwarfdump --verify will do this.

Is this being used in LLVM tests? This is an idea.

It is not used in llvm/test tests.

I had a look at your debug tests in clang and they're similar to what I do here.

The problem with debug tests is that it doesn't depend only on the
compiler, but on the debugger for each host/target platform
combinations. Though, dwarfdump could help us grep out the basic stuff
without the need to resort to a debugger to check for Dwarf structure,
just correct locations and line information.

Yes, It'd be good to have a setup to build SingleSource and MultiSource tests with debug info and run dwarfdump --verify on them.

I tried some dwarfdump on a few examples I had and the comparison with
codesourcery's gcc is impossible, the resulting Dwarf is very
different.

For instance, GCC declares the types at the beginning of the tree
while LLVM only does when needed (metadata-style). The relocation
sections in GCC are huge and they also use debug_loc in many more
cases than LLVM, for instance extern functions, global variables and
the cases I mentioned in my example before. Of course, Dwarf produced
by Armcc is also different (though, closer to what GCC does, for
obvious reasons).

One way we could do this, slowly and painfully, but surely, is to
generate Dwarf, use the debugger to make sure that Dwarf actually
produces what GDB is expecting (you probably have many cases already)
and take a snapshot of that Dwarf. Once we understand how that Dwarf
works and what are the required tags, we create a dwarfdump test that
will FileCheck on those.

This is more or less how I'm doing my local IR/Dwarf/GDB tests. It
takes a while, but have saved me from some regressions already... :wink:

Renato,

Yes, It'd be good to have a setup to build SingleSource and MultiSource tests with debug info and run dwarfdump --verify on them.

I tried some dwarfdump on a few examples I had and the comparison with
codesourcery's gcc is impossible, the resulting Dwarf is very
different.

I did not mean comparing dwarfdump output. It is never going to work. Sorry for the confusion. I meant letting dwarfdump verify the structure of dwarf info.

For instance, GCC declares the types at the beginning of the tree
while LLVM only does when needed (metadata-style). The relocation
sections in GCC are huge and they also use debug_loc in many more
cases than LLVM, for instance extern functions, global variables and
the cases I mentioned in my example before. Of course, Dwarf produced
by Armcc is also different (though, closer to what GCC does, for
obvious reasons).

You'll find that dwarf produced by llvm-gcc and clang is also different (even from the days when clang used DIFactory). And guess what clang generate DIE (Debug Info Entries) ordering is likely to change again in near future!

One way we could do this, slowly and painfully, but surely, is to
generate Dwarf, use the debugger to make sure that Dwarf actually
produces what GDB is expecting (you probably have many cases already)
and take a snapshot of that Dwarf. Once we understand how that Dwarf
works and what are the required tags, we create a dwarfdump test that
will FileCheck on those.

Instead, isn't it easier and straight forward to do a FileCheck on debugger output in the first place ?

I did not mean comparing dwarfdump output. It is never going to work. Sorry for the confusion. I meant letting dwarfdump verify the structure of dwarf info.

Yes, using dwarfdump to verify is fine, but producing correct Dwarf is
not the same as producing THE correct Dwarf you need. You still need
some way of grepping for the symbols you want to have generated, or
the test is not testing for the right thing.

You could have regressions and not break the dwarf, but break what the
dwarf represents.

Instead, isn't it easier and straight forward to do a FileCheck on debugger output in the first place ?

Oh, but that is only one level of testing, and it doesn't guarantee
your generating correct Dwarf, just "gdb compatible" Dwarf.

I'm doing all three levels (IR, Dwarf and gdb) and it's much easier to
see it fail in the IR level, or even Dwarf than to debug problems
using gdb output.

But I need some better validation for both IR and Dwarf.

One way of testing without comparing IR and Dwarf would be to have
several tests, ALL of them validating with dwarfdump AND stepping
through with gdb and testing every single detail in them. That way,
you assure that the Dwarf generated is gdb compatible and avoid
regressions in that area. If all you care is gdb compatibility, then
you're safe. But when you have a regression, you'll have two major
problems:

1. You won't know where the regression started. There could have been
multiple regressions and only the last one caused validation/gdb to
fail, and the last one could even be an unrelated commit that changes
ELF sections layout, for instance.

2. Even if there was only one regression, finding the bug would be
tiresome. You won't have a "good version" of the IR or the Dwarf to
compare and see what went wrong. Someone not used to Dwarf would take
a long time to figure out what went wrong and how to fix it.

Depends on the level of compatibility you want...

We have setup to test all these three levels you mention above. All you need is several thousands tests. Do you think that it will show up magically one day ?

Hehehe, it might... :wink:

I'm trying to get some debug tests here (maybe a few thousand, I don't
know yet). I'll let you know how it goes when we get something.

That’d be great!