DWARF 2/3 backwards compatibility?

My GDB Testsuite runs have been failing more tests, and now an internal test has started failing on some missing dwarf 3 records.

Changing compile flags to emit dwarf 2 didn't help.

In the past there has been some effort to pass GDB Testsuite, including Dwarf 2 backwards compatibility.

What is the plan/mechanism for maintaining Dwarf 2/3 compatibility?

My GDB Testsuite runs have been failing more tests, and now an internal

test

has started failing on some missing dwarf 3 records.

What is the error message or the nature of the failures you are seeing ?

Pranav
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by
The Linux Foundation

The error we are getting is:
“Undefined Form Value: 25”

We have customers that have a variety of debuggers, and we need to be able to preserve Dwarf3 compatibility. GDB, Lauterbach, etc.

The patch that caused a problem for us is:
DW_FORM_flag_present caused the problem. The old DW_FORM_flag works for us.

While the DD->useDarwinGDBCompat() option is a workaround in this case, Darwin GDB is a specific version of GDB.
So useDarwinGDBCompat means Darwin GDB Compatibility, not GDB Compatibility or Dwarf xxx Compatibility, etc. Going in that direction for now means that we will hit this problem again. Eventually useDarwinGDBCompat should diverge to things that don’t apply to all GDB’s. Would like to discuss: 1) What level of interest is there in Dwarf backward compatibility; 2) What levels of Debugger backward compatibility are needed. 3) What framework of flags/options would be a container for specific backward compatibility features. 4) What type of testing to ensure backward compatibility, GDB Testsuite? Thank you! On 10/17/2012 10:53 AM, Pranav Bhandarkar wrote:

Would like to discuss:
1) What level of interest is there in Dwarf backward compatibility;

I don't have a lot of interest in keeping dwarf backward compatibility
past my ability to test things, hence the option for darwin's gdb.
That said if people want to have more strict options I can easily keep
the dwarf stuff I'm working on inside of that framework.

2) What levels of Debugger backward compatibility are needed.

Probably "old gdb" which for a lot of our platforms is pretty much the
darwin gdb. If we absolutely need to we can wall things off based on
dwarf version, but that's a bit of a pain. At that point I'd probably
want to throw something into either that compile unit metadata or the
module itself on what level of dwarf I can use. Bill's work on module
level attributes would probably come in a little handy here.

3) What framework of flags/options would be a container for specific
backward compatibility features.

DWARF version is probably the easiest for now. If we ever hit the
point of emitting another form of debug info we can worry about that.

4) What type of testing to ensure backward compatibility, GDB Testsuite?

This is key, if there's no way I can test that I haven't broken a
platform (or at the least something yelling at me that I have broken a
platform) then there's not much I can do about it.

-eric

Very few people have, and Dwarf is not an easy target, so the cost of
having a fully compatible Dwarf (2 and 3) is not low.

Also, as Eric mentions, the GDB test suite is not the most gracious
thing to setup and keep tidy for the many platforms LLVM supports.

With time, you might get to a point where Dwarf is a first-class
citizen and there will be basic compatibility checks in "make check"
and a build bot for GDB test suite on some platforms, but to get
there, you'll go through a lot of pain.

And I don't think you'll be able to recruit too many people for that
crusade, either... :wink:

Dwarf implementation in LLVM, as far as I know, is ad-hoc, so you'll
have to do some refactoring (for good) on the debug emission code.

Eric,

What about the LLDB, does it support Dwarf? Is it good enough for C *and* C++?

With time, you might get to a point where Dwarf is a first-class
citizen and there will be basic compatibility checks in "make check"
and a build bot for GDB test suite on some platforms, but to get
there, you'll go through a lot of pain.

Hopefully not too much. Paul has been helping and I add tests whenever
I fix something, but a quality suite is something else. Unfortunately
as the biggest consumer debugger testsuites have a tendency to be the
primary quality test for debug info.

Dwarf implementation in LLVM, as far as I know, is ad-hoc, so you'll
have to do some refactoring (for good) on the debug emission code.

I've tried to formally implement a few things, but in general it was
ad-hoc for so long this is true.

Eric,

What about the LLDB, does it support Dwarf? Is it good enough for C *and* C++?

As a quality suite? Probably not. There are so many more tests in the
gdb suite that it's hard to top as a first cut for quality. lldb is
another good testsuite when we try to push the boundaries though so
using both test suites is good. Unfortunately I'm not quite sure what
the state of the lldb testsuite is on elf targets these days and
haven't had time to play with figuring out how to build it.

-eric

Rick Foos wrote:

The error we are getting is:
“Undefined Form Value: 25”
...
DW_FORM_flag_present caused the problem. The old DW_FORM_flag works for us.

I see this error from GDB 7.0 but GDB 7.2 is okay with it.
Now you know as much as I do. :slight_smile:

Eric Christopher wrote:

[in reply to what Renato Golin wrote:]
> With time, you might get to a point where Dwarf is a first-class
> citizen and there will be basic compatibility checks in "make check"
> and a build bot for GDB test suite on some platforms, but to get
> there, you'll go through a lot of pain.
>

Hopefully not too much. Paul has been helping and I add tests whenever
I fix something, but a quality suite is something else. Unfortunately
as the biggest consumer debugger testsuites have a tendency to be the
primary quality test for debug info.

I have thought that a GDB bot would be a good idea, but then it would be
a specific version of GDB, and opinions differ on what would be the
"right" version. I suppose any version is better than nothing...

I had a "quality suite" at a previous job; it was the result of many PY
of effort. It was also debugger-based, which is a mixed blessing; you
get a lot of DWARF-parsing code for free, but then you get a lot of
debugger bugs for free too! And you don't get to test the DWARF
directly, you get to test how the debugger uses the DWARF. Not really
optimal, but still--a whole lot better than nothing.

> Dwarf implementation in LLVM, as far as I know, is ad-hoc, so you'll
> have to do some refactoring (for good) on the debug emission code.

I've tried to formally implement a few things, but in general it was
ad-hoc for so long this is true.

Yeah, clearly some DWARF 3 bits are creeping in, and more will come.
(I have some UTF-8 work waiting for a chance to be cleaned up and
submitted, for example. DW_AT_use_UTF8 is a DWARF 3 attribute.)
But if some people really are stuck in DWARF 2 land, we might have to
do something intelligent about it. (I'd rather not, I'd rather have
LLVM assert that DWARF 3 is where it's at. I mean really, DWARF 3
was published in 2005!)

--paulr

I'd like to summarize what we've discussed for a moment, and propose a patch
tomorrow. It will clear a problem we have, and provide a way to handle a lot
more of these that will come up as we go to Dwarf 5, and the next GDB.

The framework is a set of GCC like gdwarf[2,3,4,5] flags. I can see cases
where compatibility is 2,3 or 2,3,4, so maybe the flag is a number to allow
<=3 tests, have to sleep on that. This flag must go to the backend.

Then the patch will change the conditional for DW_FORM_flag_present that
caused the problem back to the old DW_FORM_flag when -gdwarf2 or 3 is set.

In the future, there is a now a place to address compatibility issues that
come up.

I had a "quality suite" at a previous job; it was the result of many PY
of effort. It was also debugger-based, which is a mixed blessing; you
get a lot of DWARF-parsing code for free, but then you get a lot of
debugger bugs for free too! And you don't get to test the DWARF
directly, you get to test how the debugger uses the DWARF. Not really
optimal, but still--a whole lot better than nothing.

The trade off also goes in the other direction. If you had a strict
Dwarf parser green, that would mean next to nothing as to what that
Dwarf would represent in the debugger(s).

AFAIK, most Dwarf compatible debuggers are also GDB compatible, which
means that even the idiotic things that GDB does will probably be
understood by other debuggers.

I mean really, DWARF 3 was published in 2005!)

Go tell that to embedded folks and their certifications! :wink:

But yeah, focusing on Dwarf 3 would be the best way forward, adding a
little bit for compatibility (rather than making Dwarf 2 a full-class
citizen).

Agreed. It should at least serve as comparison between two branches,
but hopefully, being actively monitored.

Maybe would be good to add a directory (if there isn't one yet) to the
testsuite repository, or at least the code necessary to make it run on
LLVM.

I don't think GDB testsuite should block a commit, it can vary by a few
tests, they rarely if ever all pass 100%. Tracking the results over time can
catch big regressions, as well as the ones that slowly increase the failed
tests.

Agreed. It should at least serve as comparison between two branches,
but hopefully, being actively monitored.

Maybe would be good to add a directory (if there isn't one yet) to the
testsuite repository, or at least the code necessary to make it run on
LLVM.

The clang-tests repository (
http://llvm.org/viewvc/llvm-project/clang-tests/ ) includes an Apple
GCC 4.2 compatible version of the GCC and GDB test suites that Apple
run internally. I'm working on bringing up an equivalent public
buildbot at least for the GDB suite here (
http://lab.llvm.org:8011/builders/clang-x86_64-darwin10-gdb-gcc ) -
just a few timing out tests I need to look at to get that green.
Apparently it's fairly stable.

Beyond that I'll be trying to bring up one with the latest suite (7.4
is what I've been playing with) on Linux as well.

- David

I had a "quality suite" at a previous job; it was the result of many PY
of effort. It was also debugger-based, which is a mixed blessing; you
get a lot of DWARF-parsing code for free, but then you get a lot of
debugger bugs for free too! And you don't get to test the DWARF
directly, you get to test how the debugger uses the DWARF. Not really
optimal, but still--a whole lot better than nothing.

The trade off also goes in the other direction. If you had a strict
Dwarf parser green, that would mean next to nothing as to what that
Dwarf would represent in the debugger(s).

AFAIK, most Dwarf compatible debuggers are also GDB compatible, which
means that even the idiotic things that GDB does will probably be
understood by other debuggers.

As long as we keep to standard dwarf I'm OK.

But yeah, focusing on Dwarf 3 would be the best way forward, adding a
little bit for compatibility (rather than making Dwarf 2 a full-class
citizen).

Going to disagree here, the state of the art in debuggers isn't
stopping back at dwarf 3 and we shouldn't either. I've already added
some features from dwarf 4 into clang and will be adding dwarf 5
features as we work them through standardization. That said a flag to
delineate dwarf versions is fine and we can work on making sure
features don't bleed over.

-eric

I'd like to summarize what we've discussed for a moment, and propose a patch
tomorrow. It will clear a problem we have, and provide a way to handle a lot
more of these that will come up as we go to Dwarf 5, and the next GDB.

Sounds good.

The framework is a set of GCC like gdwarf[2,3,4,5] flags. I can see cases
where compatibility is 2,3 or 2,3,4, so maybe the flag is a number to allow
<=3 tests, have to sleep on that. This flag must go to the backend.

Yep. Sounds like what we were talking about yesterday. It'll need to
be both in the front and back end because there are features used in
the front end as well as encoding differences in the backend.

Then the patch will change the conditional for DW_FORM_flag_present that
caused the problem back to the old DW_FORM_flag when -gdwarf2 or 3 is set.

There are a few other bits that it should change as well, I'm guessing
you haven't tried compiling too much c++11 code for your restricted
dwarf set yet :slight_smile:

-eric

I don't think GDB testsuite should block a commit, it can vary by a few
tests, they rarely if ever all pass 100%. Tracking the results over time can
catch big regressions, as well as the ones that slowly increase the failed
tests.

Agreed. It should at least serve as comparison between two branches,
but hopefully, being actively monitored.

Maybe would be good to add a directory (if there isn't one yet) to the
testsuite repository, or at least the code necessary to make it run on
LLVM.

The clang-tests repository (
http://llvm.org/viewvc/llvm-project/clang-tests/ ) includes an Apple
GCC 4.2 compatible version of the GCC and GDB test suites that Apple
run internally. I'm working on bringing up an equivalent public
buildbot at least for the GDB suite here (
http://lab.llvm.org:8011/builders/clang-x86_64-darwin10-gdb-gcc ) -
just a few timing out tests I need to look at to get that green.
Apparently it's fairly stable.

Beyond that I'll be trying to bring up one with the latest suite (7.4
is what I've been playing with) on Linux as well.

Since you're going to bring a bot up in zorg, I'll stop working on bringing mine testsuite runner forward. A couple thoughts:

1) I've been running on the latest test suite, polling once a day. I think Eric and anyone working dwarf 4/5 should be running against the upstream testsuite. (I have no problems with running 7.4 too)

It's been stable to run at the tip of GDB this way, the test results aren't varying much.

2) A surprise benefit of running this way is that hundreds of obsolete tests, or broken tests are getting removed. This hasn't resulted in any broken backwards compatibility here at least. Saves tons of time debugging tests that don't work, and developing around compatible things that reasonable people have decided no longer matter.

3) Testsuite runs against two compilers at a time makes it easier to see regressions. By comparing against a known stable compiler, or GCC, regressions are visible by summary numbers.

4) I have plots of the summary numbers online with a window of a month or two. The trend allows you to see regressions occurring, and remaining as regressions. Sometimes GDB Testsuite or a compiler has a bad day. The trend let's you see a stable regression, and when you get a round toit, tells you when the regression started.

<soapbox>
I've been doing this with Jenkins. It's fairly easy to set up, and does the plotting. Developers can grab a copy of the script to duplicate a run on their broken compiler. Running the testsuite under JNLP increased the number of executed tests - don't know why just did.
</soapbox>

I don't think GDB testsuite should block a commit, it can vary by a few
tests, they rarely if ever all pass 100%. Tracking the results over time
can
catch big regressions, as well as the ones that slowly increase the
failed
tests.

Agreed. It should at least serve as comparison between two branches,
but hopefully, being actively monitored.

Maybe would be good to add a directory (if there isn't one yet) to the
testsuite repository, or at least the code necessary to make it run on
LLVM.

The clang-tests repository (
http://llvm.org/viewvc/llvm-project/clang-tests/ ) includes an Apple
GCC 4.2 compatible version of the GCC and GDB test suites that Apple
run internally. I'm working on bringing up an equivalent public
buildbot at least for the GDB suite here (
http://lab.llvm.org:8011/builders/clang-x86_64-darwin10-gdb-gcc ) -
just a few timing out tests I need to look at to get that green.
Apparently it's fairly stable.

Beyond that I'll be trying to bring up one with the latest suite (7.4
is what I've been playing with) on Linux as well.

Since you're going to bring a bot up in zorg, I'll stop working on bringing
mine testsuite runner forward.

I'm still interested in any details you have about issues you've
resolved/learnt, etc.

A couple thoughts:

1) I've been running on the latest test suite, polling once a day. I think
Eric and anyone working dwarf 4/5 should be running against the upstream
testsuite. (I have no problems with running 7.4 too)

Interesting thought. (just so we're all on the same page when you say
"test suite" you're talking about the GDB dejagnu test suite (the same
one (well, more recent version of it) that's in clang-tests)) Though I
hesitate to have such a moving target, I can see how it could be
useful.

It's been stable to run at the tip of GDB this way, the test results aren't
varying much.

With the right heuristics I suppose this could be valuable, but will
require more work to find the right signal in the (even small) noise.

2) A surprise benefit of running this way is that hundreds of obsolete
tests, or broken tests are getting removed. This hasn't resulted in any
broken backwards compatibility here at least. Saves tons of time debugging
tests that don't work, and developing around compatible things that
reasonable people have decided no longer matter.

Fair point.

3) Testsuite runs against two compilers at a time makes it easier to see
regressions. By comparing against a known stable compiler, or GCC,
regressions are visible by summary numbers.

I assume GDB runs their own test suite against some version (or the
simultaneous latest) of GCC? If we can't scrape those existing results
we can reproduce them (running the full suite with both GCC & Clang
side-by-side).

4) I have plots of the summary numbers online with a window of a month or
two. The trend allows you to see regressions occurring, and remaining as
regressions. Sometimes GDB Testsuite or a compiler has a bad day. The trend
let's you see a stable regression, and when you get a round toit, tells you
when the regression started.

Yep, also if we're trying to address all these issues, would be to
prioritize the very stable failures (where Clang fails a test that GCC
passes & does so consistently for a long time) first. Then look at the
unstable ones last - figure out which compiler's to blame, "XFAIL:
clang" them or whatever is necessary.

<soapbox>
I've been doing this with Jenkins. It's fairly easy to set up, and does the
plotting. Developers can grab a copy of the script to duplicate a run on
their broken compiler. Running the testsuite under JNLP increased the number
of executed tests - don't know why just did.
</soapbox>

I wouldn't mind seeing your jenkins setup/config/tweaks/etc as a
reference point, if you've got it lying around.

- David

Renato Golin wrote:

> I had a "quality suite" at a previous job; it was the result of many PY
> of effort. It was also debugger-based, which is a mixed blessing; you
> get a lot of DWARF-parsing code for free, but then you get a lot of
> debugger bugs for free too! And you don't get to test the DWARF
> directly, you get to test how the debugger uses the DWARF. Not really
> optimal, but still--a whole lot better than nothing.

The trade off also goes in the other direction. If you had a strict
Dwarf parser green, that would mean next to nothing as to what that
Dwarf would represent in the debugger(s).

Well, having IR-level testing tells you next to nothing as to what your
program would actually do when you compile and run it. But it seems
to me that we have a huge pile of IR-level tests, so _somebody_ must
think they are useful. :slight_smile:

Sure, the acid test is whether the debugger does the right thing. I'm
not saying debugger-based tests are worthless, I'm saying that _just_
having debugger-based tests is not _optimal_. DWARF-level testing
would let you do things that debugger-based tests would find anywhere
from awkward to impossible.

That said, what's easiest is probably to get some form of GDB bot up
and running, and the benefit is likely to be worth the pain.

AFAIK, most Dwarf compatible debuggers are also GDB compatible, which
means that even the idiotic things that GDB does will probably be
understood by other debuggers.

I don't think what GDB _does_ will be understood by other debuggers. :slight_smile:
And the debugger for some of the platforms I have to deal with is
certainly not GDB, so I care more about generating valid DWARF than
I do about getting GDB running smoothly. I do recognize that the
majority of LLVM targets care more about GDB, though.

Cheers,
--paulr

Renato Golin wrote:

> I had a "quality suite" at a previous job; it was the result of many PY
> of effort. It was also debugger-based, which is a mixed blessing; you
> get a lot of DWARF-parsing code for free, but then you get a lot of
> debugger bugs for free too! And you don't get to test the DWARF
> directly, you get to test how the debugger uses the DWARF. Not really
> optimal, but still--a whole lot better than nothing.

The trade off also goes in the other direction. If you had a strict
Dwarf parser green, that would mean next to nothing as to what that
Dwarf would represent in the debugger(s).

Well, having IR-level testing tells you next to nothing as to what your
program would actually do when you compile and run it. But it seems
to me that we have a huge pile of IR-level tests, so _somebody_ must
think they are useful. :slight_smile:

It's not so much an argument from correctness as convenience. IR level
regression tests test features in relative isolation and with relative
speed compared to full scenario tests. That means faster dev iteration
with higher confidence (you can run all the regression tests quickly -
so you can run them on every commit without slowing down too much, as
well as investigate failures quickly because they point to specific
issues directly rather than "something is broken in this big sequence
of steps")

They have the drawback that they can be overconstrained (the most
obvious examples of this include ordering (there's been some
discussion recently about FileCheck allowing unordered sequences of
lines) but possibly other situations where emitting different
information (debug info or machine code, etc) might not have a
noticable effect on the output (maybe the debugger doesn't care about
that particular thing being A or B or first or last, etc))

Sure, the acid test is whether the debugger does the right thing. I'm
not saying debugger-based tests are worthless, I'm saying that _just_
having debugger-based tests is not _optimal_. DWARF-level testing
would let you do things that debugger-based tests would find anywhere
from awkward to impossible.

That said, what's easiest is probably to get some form of GDB bot up
and running, and the benefit is likely to be worth the pain.

This is certainly true - even if we had a great debug info regression
suite we'd still want this. Given that we don't have a great debug
info regression suite we /really/ want this. But we also want the
infrastructure to write good debug info regression tests - sooner the
better, so we can have a slightly less bad debug info regression suite
as we go along :slight_smile:

Well, having IR-level testing tells you next to nothing as to what your
program would actually do when you compile and run it. But it seems
to me that we have a huge pile of IR-level tests, so _somebody_ must
think they are useful. :slight_smile:

When creating Dwarf tests I did it at all levels: IR checking for
metadata, ELF checking for Dwarf and GDB execution checking for
correct behaviour.

All that as LIT driven, so a "make check" would give me the results in
a few seconds, with the benefit of a good dev iteration, as David
mentioned.

Unfortunately, none of that was in the open, so I can't share... :frowning:

That said, what's easiest is probably to get some form of GDB bot up
and running, and the benefit is likely to be worth the pain.

Since GDB already has a good and standard test infrastructure, it'd
likely get a good chunk of bad Dwarf our of the way before you start
worrying about Lauterbach's specifics.

Do you have a good (maybe open) Dwarf validation suite available?
There is no such thing as too many tests... :wink:

I don't think GDB testsuite should block a commit, it can vary by a few
tests, they rarely if ever all pass 100%. Tracking the results over time
can
catch big regressions, as well as the ones that slowly increase the
failed
tests.

Agreed. It should at least serve as comparison between two branches,
but hopefully, being actively monitored.

Maybe would be good to add a directory (if there isn't one yet) to the
testsuite repository, or at least the code necessary to make it run on
LLVM.

The clang-tests repository (
http://llvm.org/viewvc/llvm-project/clang-tests/ ) includes an Apple
GCC 4.2 compatible version of the GCC and GDB test suites that Apple
run internally. I'm working on bringing up an equivalent public
buildbot at least for the GDB suite here (
http://lab.llvm.org:8011/builders/clang-x86_64-darwin10-gdb-gcc ) -
just a few timing out tests I need to look at to get that green.
Apparently it's fairly stable.

Beyond that I'll be trying to bring up one with the latest suite (7.4
is what I've been playing with) on Linux as well.

Since you're going to bring a bot up in zorg, I'll stop working on bringing
mine testsuite runner forward.

I'm still interested in any details you have about issues you've
resolved/learnt, etc.

A couple thoughts:

1) I've been running on the latest test suite, polling once a day. I think
Eric and anyone working dwarf 4/5 should be running against the upstream
testsuite. (I have no problems with running 7.4 too)

Interesting thought. (just so we're all on the same page when you say
"test suite" you're talking about the GDB dejagnu test suite (the same
one (well, more recent version of it) that's in clang-tests)) Though I
hesitate to have such a moving target, I can see how it could be
useful.

Yes, the sourceware.org site. I hesitated as well, but I tried it and it's OK.

It's been stable to run at the tip of GDB this way, the test results aren't
varying much.

With the right heuristics I suppose this could be valuable, but will
require more work to find the right signal in the (even small) noise.

I wrote a 10 line awk script to create a CSV file out of the test summaries to make a one-to-one comparison of tests. It's over 70 lines now...It's not a matter that I should have used Python, Everything is an exception to the rules. You can get close, but I can't say you can get perfect signals.

From a compiler developer point of view, the spreadsheet was worthless. We're not testing GDB, but rather what the compiler feeds to GDB.

Take the log file, check out the suite, rerun a failing test, use dwarfdump and llvm-dwarfdump, find the "bad" dwarf records produced by the compiler.

All the eventual bugs are about dwarf records, and a gdb testsuite test to duplicate.

A bad/confused dwarf record fails multiple tests without a way to map a failure back to dwarf.

In the end, a fine grained signal doesn't do what you might want.

2) A surprise benefit of running this way is that hundreds of obsolete
tests, or broken tests are getting removed. This hasn't resulted in any
broken backwards compatibility here at least. Saves tons of time debugging
tests that don't work, and developing around compatible things that
reasonable people have decided no longer matter.

Fair point.

3) Testsuite runs against two compilers at a time makes it easier to see
regressions. By comparing against a known stable compiler, or GCC,
regressions are visible by summary numbers.

I assume GDB runs their own test suite against some version (or the
simultaneous latest) of GCC? If we can't scrape those existing results
we can reproduce them (running the full suite with both GCC& Clang
side-by-side).

gdb-testers@sourceware.org has a run every night. Yes, I reproduce. A non-x86 target has very different results, so I look for a good gcc cross compiler to establish a baseline.

In the case of clang, all the arch's share the Dwarf processing. So an x86 run covers a lot more of the dwarf processing than worrying too much about a cross compiler run. (But some worry about limiting fixes to regressions from a cross-compiled GCC, so have to run that as well)

4) I have plots of the summary numbers online with a window of a month or
two. The trend allows you to see regressions occurring, and remaining as
regressions. Sometimes GDB Testsuite or a compiler has a bad day. The trend
let's you see a stable regression, and when you get a round toit, tells you
when the regression started.

Yep, also if we're trying to address all these issues, would be to
prioritize the very stable failures (where Clang fails a test that GCC
passes& does so consistently for a long time) first. Then look at the
unstable ones last - figure out which compiler's to blame, "XFAIL:
clang" them or whatever is necessary.

I avoid doing that the XFAIL thing. Plotting all the lines from the summary makes more sense, and it's what you see when you run the test manually.

When you move a Fail artificially to Xfail, the plot just has a few V's in it where the Fail line drops, and the Xfail line goes up. No new information.

I prefer leaving the actual summary numbers in place. All the data you need is there.

As you might have guessed, I like tests that fail, and want to get rid of the ones the pass too often :slight_smile:

<soapbox>
I've been doing this with Jenkins. It's fairly easy to set up, and does the
plotting. Developers can grab a copy of the script to duplicate a run on
their broken compiler. Running the testsuite under JNLP increased the number
of executed tests - don't know why just did.
</soapbox>

I wouldn't mind seeing your jenkins setup/config/tweaks/etc as a
reference point, if you've got it lying around.

I'll see what I can send, or just as easy to walk through it. Jenkins isn't really like buildbot. Do you have Jenkins running there?

When creating Dwarf tests I did it at all levels: IR checking for
metadata, ELF checking for Dwarf and GDB execution checking for
correct behaviour.

Agreed. This is how I've been doing it for a while now.

All that as LIT driven, so a "make check" would give me the results in
a few seconds, with the benefit of a good dev iteration, as David
mentioned.

Quite.

Since GDB already has a good and standard test infrastructure, it'd
likely get a good chunk of bad Dwarf our of the way before you start
worrying about Lauterbach's specifics.

The gdb testsuite is pretty good as a "what's expected" set of tests,
however, one thing to keep in mind is that a lot of the checks aren't
particularly fuzzy. I.e. it checks what's expected but it's not
necessarily valid dwarf that it's looking for but a particular
behavior.

-eric