Validating LLVM

Back during the LLVM developer's meeting, I talked with some of you about a
proposal to "validate" llvm. Now that 2.4 is almost out the door, it seems a
good time to start that discussion.

I've written up a detailed proposal and attached it to this message. The goal
is to ease LLVM use by third parties. We've got consideral experience with
LLVM and the community development model here and see its benefits as well as
challenges. This proposal attempts to address what I feel is the main
challenge: testing and stability of LLVM between releases.

Please take a look and send feedback toi the list. I'd like to get the
process moving early in the 2.4 cycle.

Thanks for your input and support.

                                         -Dave

LLVMValidationProposal.txt (8.53 KB)

Interesting proposal. I think this could be a great thing. When the system is up and running, I can scrounge up some darwin validator machines for the machine pool,

-Chris

David Greene <dag@cray.com> writes:

Back during the LLVM developer's meeting, I talked with some of you
about a proposal to "validate" llvm. Now that 2.4 is almost out the
door, it seems a good time to start that discussion.

I applaud your initiative. Discussing this issue is badly needed. From
my point of view, LLVM still has an academical/exploratory character
that makes it incompatible with a long term commitment from the POV of
some industry users (those for whom LLVM would be a critical component)
unless those users have enough resources for maintaining their own LLVM
branch.

IMO a validation process based on running test suites is not enough. As
you know very well, tests can demonstrate failures, but can not
demonstrate correctness. An approach based on having stable (bug-fix
only) and development branches is more adequate. This way, each user can
devote work to validate LLVM for its own purposes, apply fixes to the
stable branch and then have some hope of achieving a point where LLVM is
good enough, instead of an endless upgrading where you fix known bugs
while knowing that new ones are being introduced.

This conflicts with current practice of going forward at full throttle,
when it is not rare that developers recommend using ToT just a few weeks
after a release.

Hopefully when clang matures new requirements on middle-term stability
will be enforced.

IMO a validation process based on running test suites is not enough. As

Not enough for some, I agree. For others, it helps a lot. It would help us
tremendously, for example, but then, we do maintain our own branch.

you know very well, tests can demonstrate failures, but can not
demonstrate correctness. An approach based on having stable (bug-fix
only) and development branches is more adequate. This way, each user can
devote work to validate LLVM for its own purposes, apply fixes to the
stable branch and then have some hope of achieving a point where LLVM is
good enough, instead of an endless upgrading where you fix known bugs
while knowing that new ones are being introduced.

A stable and development branch would also help. You still need to validate
the stable branch, however. So I think the proposal still applies regardless
of how the repository is organized.

This conflicts with current practice of going forward at full throttle,
when it is not rare that developers recommend using ToT just a few weeks
after a release.

Right. It would be a shift in development process.

Hopefully when clang matures new requirements on middle-term stability
will be enforced.

It's hard to "enforce" anything in the open source world. That's something
that third parties just have to come to understand. So we should try to
introduce processes that can help achieve what we want without depending
on anyone else to conform to our idea of how development should happen.

                                               -Dave

Yeah, mirrors in many ways something I've thought about for a while now. Roughly, cron (or while :; do) testers that figure out quality and create tags when that quality is met. Release branching can then just happen from the `prerelease' tag, and largely start from a known good quality. People can the figure out what naming scheme they want and which tests they want to run, and contribute by testing and creating tags. The existance of specific combinations of tags at the same versions can be used to create rollup tags. One can imagine a C on x86 tag, a C++ on x86_64 tag, a llvm-gcc on mips tag and so on. Though, in my version, I would have the cron job move forward their tag so that developers have a stable tag to select, if they just want say C on ppc.

If someone wanted to play on sparc (to pick a less well maintained port that at the top, might not work, but did at some point in time, they could start with the C on sparc tag and reproduce that working. They wouldn't have to guess if it builds or not, they'd just know it should (given then definition of the tag).

The prerelease tag could be comprised of the mips tag, the x86 tag, the llvm x86 tag, ....

One could even include things like the freebsd world build and boot and self vaildate tag. Might take a few days to run, but, doing this to feed into a preleasse style tag might be nice.

I don't know just how useful this would be.

Lately our random C program generator has seemed quite successful at catching regressions in llvm-gcc that the test suite misses. I'd suggest running some fixed number of random programs as part of the validation suite. On a fastish quad core I can test about 25,000 programs in 24 hours. Our hacked valgrind (which looks for volatile miscompilations) is a bottleneck, leaving this out would speed up the process considerably.

We've never tested llvm-gcc for x64 using random testing, doing this would likely turn up a nice crop of bugs.

I just started a random test run of llvm-gcc 2.0-2.4 that should provide some interesting quantitative results comparing these compilers in terms of crashes, volatile miscompilations, and regular miscompilations. However it may take a month or so to get statistical significance since 2.3 and 2.4 have quite low failure rates.

John Regehr

creating tags. The existance of specific combinations of tags at the
same versions can be used to create rollup tags. One can imagine a C

I'm not entirely sure what you mean here. By "versions" do you mean
svn revisions?

on x86 tag, a C++ on x86_64 tag, a llvm-gcc on mips tag and so on.
Though, in my version, I would have the cron job move forward their
tag so that developers have a stable tag to select, if they just want
say C on ppc.

So you're saying have one tag that keeps the same name and indicates the
highest validated revision? That makes sense to me.

The prerelease tag could be comprised of the mips tag, the x86 tag,
the llvm x86 tag, ....

Yep.

One could even include things like the freebsd world build and boot
and self vaildate tag. Might take a few days to run, but, doing this
to feed into a preleasse style tag might be nice.

llvm-gcc already does a bootstrap. Is that what you mean?

                                                  -Dave

Lately our random C program generator has seemed quite successful at
catching regressions in llvm-gcc that the test suite misses. I'd suggest
running some fixed number of random programs as part of the validation
suite. On a fastish quad core I can test about 25,000 programs in 24

The problem with random tests is that they're just that -- random. You can't
have a known suite to validate with. Now, if we gbenerate some tests
that cause things to fail and then add those to the LLVM test suite, I'd be
all for it.

We've never tested llvm-gcc for x64 using random testing, doing this would
likely turn up a nice crop of bugs.

Definitely. Random testing is certainly useful. Once random tests are added
to a testsuite, we can use them for validation. But I wouldn't want to
require a validation to pass some set of random tests that shifts each test
cycle.

                                                -Dave

to a testsuite, we can use them for validation. But I wouldn't want to
require a validation to pass some set of random tests that shifts each test
cycle.

This is easy to fix: just specify a starting seed for the PRNG.

However I think you should get past your prejudice against tests that shift each cycle, since changing tests have the advantage of increased test coverage. Different parts of a test suite have different purposes, and of course random programs would not replace any part of the existing collection of fixed test cases. I woudn't be making this argument if I hadn't seen for myself how one week random testing gives you nothing, the next week a whole pile of previously unknown failures.

Alternatively we are working to generalize our program generator a bit so that it does a DFS or BFS to generate all programs smaller than some size bound (obviously we need to fudge on integer constants, for example by picking from a predetermined set of interesting constants). Once we do this it may be worth adding the resulting test programs to LLVM's test suite.

John Regehr

to a testsuite, we can use them for validation. But I wouldn't want to
require a validation to pass some set of random tests that shifts each test
cycle.

This is easy to fix: just specify a starting seed for the PRNG.

...which defeats much of the point of random testing.

However I think you should get past your prejudice against tests that
shift each cycle, since changing tests have the advantage of increased
test coverage. Different parts of a test suite have different purposes,
and of course random programs would not replace any part of the existing
collection of fixed test cases. I woudn't be making this argument if I
hadn't seen for myself how one week random testing gives you nothing, the
next week a whole pile of previously unknown failures.

I don't think anyone is arguing against the utility of random test
generation, the issue is that the results aren't really appropriate
for validation where you are trying to make a comparison between
builds. A system I've used previously was to run and report randomly
generated tests along with validation testing, but not consider the
random test results when tagging a build as valid. Instead, when
random tests failed, the generated test case was saved, and could be
added (preferably in a reduced form) as a static test for future
validation runs. That way you get the benefits of random testing
without spurious changes in validation status dependent on randomly
generated tests.

validation runs. That way you get the benefits of random testing
without spurious changes in validation status dependent on randomly
generated tests.

Sounds great. I'm not trying to push random testing on people who don't want it, but I think it is useful to the LLVM project and it's not clear that I have the resources to keep doing it myself indefinitely.

The tools are, I think, to the point where they can be pushed into an automated build/test loop, which is what I'm aiming for. If this testing is done continuously and for several targets then more regressions can be squashed while they're fresh.

John Regehr

I think we're all in violent agreement here.

I absolutely see great value in random test generation. I would support
some regular random testing and incorporating tests that trigger failures into
the static LLVM testsuite. This should be done on a regular basis.

                                                           -Dave

I think we're all in violent agreement here.

Ok so basically I am volunteering to help someone integrate random testing into some kind of build/test loop :).

John Regehr

Hi Dave,

Here are my opinions:

I like the idea of regular validation tagging. However, I think that it should be as automated as possible. I'm worried that validation testing will be pushed off of people's plates indefinitely – even for people who care deeply about a particular platform.

Here are the minimal set of tests that should go into a validation test. (All of these should be done in "Release" mode.)

* Regression Tests - This catches obvious errors, but not major ones.
* Full Bootstrap of LLVM-GCC - LLVM-GCC is a complex program. It's our second indication if something has gone horribly awry.
* Nightly Testsuite - A very good suite of tests; much more extensive than a simple bootstrap.
* LLVM-GCC Testsuite - Many thousands of great tests to test the many facets of the compiler. WAY too few people run these.

As far as I know, Dale's the only one who's been slogging through the LLVM-GCC testsuite, finding and fixing errors. I think that there are still many failures that should be addressed.

All four of the above should be run on at least a nightly basis (more frequently for some, like the regression tests). Each of these are automated, making that easy. If there are no regressions from the above four, we could tag that revision as being potentially "valid".

-bw

Bill Wendling <isanbard@gmail.com> writes:

[snip]

All four of the above should be run on at least a nightly basis (more
frequently for some, like the regression tests). Each of these are
automated, making that easy. If there are no regressions from the
above four, we could tag that revision as being potentially "valid".

If a new test case is created (coming from a bug report or a code
review, not from adding a new feature) and it fails for a previously
"valid" revision, is the tag removed?

There would probably be some sort of concept of "last known good". In the above case, either the tag could be removed and/or a new valid revision tagged. It was "good" up until that point, at least. :slight_smile:

-bw

I'd add one more item here,

* Nightly Testsuite run using llvm-gcc built llvm! If, LLVM-GCC is a complex C program then LLVM is a complex C++ program whose correctness can be validated using such test suite runs.

> Please take a look and send feedback toi the list. I'd like to get
> the
> process moving early in the 2.4 cycle.

Hi Dave,

Here are my opinions:

I like the idea of regular validation tagging. However, I think that
it should be as automated as possible. I'm worried that validation
testing will be pushed off of people's plates indefinitely – even for
people who care deeply about a particular platform.

Yes, automation is key. I think it is very possible to do with this proposal.

Here are the minimal set of tests that should go into a validation
test. (All of these should be done in "Release" mode.)

* Regression Tests - This catches obvious errors, but not major ones.

For clarity, you mean "make check" on the LLVM tools, right?

* Full Bootstrap of LLVM-GCC - LLVM-GCC is a complex program. It's our
second indication if something has gone horribly awry.

Yep.

* Nightly Testsuite - A very good suite of tests; much more extensive
than a simple bootstrap.

How does this differ from "make check" or llvm-test?

* LLVM-GCC Testsuite - Many thousands of great tests to test the many
facets of the compiler. WAY too few people run these.

By this do you mean llvm-test or the testsuite that ships with gcc? To
my knowledge, LLVM has never passed the gcc testsuite ("make check"
on llvm-gcc).

As far as I know, Dale's the only one who's been slogging through the
LLVM-GCC testsuite, finding and fixing errors. I think that there are
still many failures that should be addressed.

Depending on how you're defining LLVM-GCC I may also be running those tests
regularly.

All four of the above should be run on at least a nightly basis (more
frequently for some, like the regression tests). Each of these are
automated, making that easy. If there are no regressions from the
above four, we could tag that revision as being potentially "valid".

Right. I would add one thing. We want to run these suites with Debug,
Release, Release+Asserts and Debug+ExpensiveChecks builds. No one
but me seems to run Debug+ExpensiveChecks tests because I see things
break regularly. It's a valuable tool to find subtle C++ errors.

                                                 -Dave

I would say that once a validation tag is created, it stays. We don't want to
be in the business of going back through lots of tag history and checking
against new tests that didn't exist when the tags were originally created.

Perhaps for a "last known good" tag we could re-check but I'm even hesitent
to do that. It seems like a lot of extra work for little real gain.

                                                           -Dave

That's not a bad idea. To be absolutely totally complete we'd want to do a
third llvm build with an llvm-gcc linked with an llvm-gcc-built llvm.
Think the gcc bootstrap applied to LLVM. But without infrastructure to
compare object files, I'm not sure how useful this is.

I think these are things that can be added incrementally. My goal is to get
something going soon and enhance it as we go.

                                                -Dave