How LLVM guarantee the qualify of the product from the limited test suit?

Hi, All,

After searching the whole project, I only find about ~10000 cases from “llvm/test” for each commit, and a separate testsuit wrote with high level language(i.e. C/C++) to verify the quality and performance. As a general Backend, you know, it must be strong enough to cope with all the IR generated by Frontend. I cannot believe what I see. Did I miss something ? As far as I know, many commerce compilers usually contains hundreds of thousands test cases.

Further, I notice that, under llvm project repo, there is also a clang-tests that using gcc test suits(http://llvm.org/svn/llvm-project/clang-tests/trunk/gcc-4_2-testsuite/). Is that test suit used by llvm to guarantee the quality before the release ?

Any input is appreciated. Thank you.

After searching the whole project, I only find about ~10000 cases from
"llvm/test" for each commit, and a separate testsuit wrote with high level
language(i.e. C/C++) to verify the quality and performance. As a general
Backend, you know, it must be strong enough to cope with all the IR
generated by Frontend. I cannot believe what I see. Did I miss something ?

Hi,

You're missing all the Clang, compiler-rt, libc++ tests on their own
repositories. That'll add a few more tens of thousands of tests into
the bundle. But that's just the regression / base validation, the
release has a second stage which is less visible than, I believe, for
GCC. We also run the test-suite and benchmarks on trunk validation,
which is something I believe GCC doesn't do, so releases have a lot
less bugs than GCC to begin with. GCC focus on fixing all bugs at
release time, while LLVM focus on fixing them as they happen, which is
much easier and more stable for all developers.

Further, I notice that, under llvm project repo, there is also a clang-tests
that using gcc test
suits(http://llvm.org/svn/llvm-project/clang-tests/trunk/gcc-4_2-testsuite/).
Is that test suit used by llvm to guarantee the quality before the release ?

The GCC test-suite, AFAIK, has very poor quality on what's considered
a pass or a failure, and it's common to release GCC with thousands of
failures on those tests. Some people may run it, but I honestly don't
trust it myself, nor have the time to sift through every single test
to make sure my errors are compiler errors or test errors. You can't
assume that just because GCC runs *more* tests, that what they're
testing is more *thorough*. There are also lots of tests that have
erratic behaviour, which only adds noise to the process.

The release process also involves passing standard compiler benchmarks
from the part of the base testers, and higher level applications (like
Chromium) from the community. Different targets may get different
community interest, but most targets have an additional phase inside
companies like, ARM, Mips, Intel, Apple, Google, Sony, Qualcomm, etc.
They all have internal work-loads that represent a larger piece of
real world code that the test-suite can offer. Whenever those work
loads fail, we get bug reports. It's also good practice to add a
snippet to the test-suite or the regression tests in these cases.

As a separate quality control, there are a few efforts tracking
trunk/releases to build the Linux kernel, Debian, FreeBSD, Mandriva,
OpenEmbedded and other large scale projects. Whenever something breaks
on those projects, bugs are reported and fixed on the next stable
release possible.

I think it's a pretty solid validation process for both trunk and releases.

cheers,
--renato

I've found the gcc test suite pretty useful for my out-of-tree
research backend, though it does require some initial work in
disabling tests that exercise GCC features unsupported by Clang. I
haven't yet switched to using it, but Ed Jones at Embecosm did some
work on making the gcc test suite easier to use with Clang
http://www.embecosm.com/2015/04/21/flexible-runtime-testing-of-llvm-on-embedded-systems/

Alex

From: llvm-dev [mailto:llvm-dev-bounces@lists.llvm.org] On Behalf Of
Renato Golin via llvm-dev
Sent: Monday, November 09, 2015 1:45 AM
To: bluedream_zqs@sina.com
Cc: llvm-dev
Subject: Re: [llvm-dev] How LLVM guarantee the qualify of the product from
the limited test suit?

> After searching the whole project, I only find about ~10000 cases from
> "llvm/test" for each commit, and a separate testsuit wrote with high
level
> language(i.e. C/C++) to verify the quality and performance. As a general
> Backend, you know, it must be strong enough to cope with all the IR
> generated by Frontend. I cannot believe what I see. Did I miss something
?

Hi,

You're missing all the Clang, compiler-rt, libc++ tests on their own
repositories. That'll add a few more tens of thousands of tests into
the bundle. But that's just the regression / base validation, the
release has a second stage which is less visible than, I believe, for
GCC. We also run the test-suite and benchmarks on trunk validation,
which is something I believe GCC doesn't do, so releases have a lot
less bugs than GCC to begin with. GCC focus on fixing all bugs at
release time, while LLVM focus on fixing them as they happen, which is
much easier and more stable for all developers.

> Further, I notice that, under llvm project repo, there is also a clang-
tests
> that using gcc test
> suits(http://llvm.org/svn/llvm-project/clang-tests/trunk/gcc-4_2-
testsuite/).
> Is that test suit used by llvm to guarantee the quality before the
release ?

The GCC test-suite, AFAIK, has very poor quality on what's considered
a pass or a failure, and it's common to release GCC with thousands of
failures on those tests. Some people may run it, but I honestly don't
trust it myself, nor have the time to sift through every single test
to make sure my errors are compiler errors or test errors. You can't
assume that just because GCC runs *more* tests, that what they're
testing is more *thorough*. There are also lots of tests that have
erratic behaviour, which only adds noise to the process.

The release process also involves passing standard compiler benchmarks
from the part of the base testers, and higher level applications (like
Chromium) from the community. Different targets may get different
community interest, but most targets have an additional phase inside
companies like, ARM, Mips, Intel, Apple, Google, Sony, Qualcomm, etc.
They all have internal work-loads that represent a larger piece of
real world code that the test-suite can offer. Whenever those work
loads fail, we get bug reports. It's also good practice to add a
snippet to the test-suite or the regression tests in these cases.

Hear, hear. The llvm/test/... and clang/test/... suites are no more
than "smoke tests" in my opinion. We do a LOT more than that internally.
--paulr

The GCC test-suite, AFAIK, has very poor quality on what's considered
a pass or a failure,

???
What makes you say this

and it's common to release GCC with thousands of
failures on those tests.

Also not correct.

https://gcc.gnu.org/gcc-4.4/criteria.html

It is a zero regression policy for primary platforms.

Look, I love LLVM as much as the next guy, but in the 15 years i worked on
GCC, through lots of major and minor releases, i can't remember a single
release with "thousands" of failures.

Hi Daniel,

This was not meant as a GCC vs LLVM rant. I have no affiliation nor an agenda.

I was merely stating the quality of compiler test suites, and how
valuable it would be to use the GCC tests in LLVM (or vice-versa). I
agree with Paul that the LLVM tests are little more than smoke screen,
and from what I've seen, the GCC tests are just a bigger smoke screen.
I would first try to understand what in the GCC suite is complementary
to ours, and what's redundant, before dumping it in.

I may be wrong, and my experience is largely around Linaro (kernel,
toolchain, android), so it may very well be biased. These are the data
points I have for my statements:

1. GCC trunk is less stable than LLVM because the lack of general buildbots.
* Testing a new patch means comparing the test results (including
breakages) against the previous commit, and check the differences.
This is a poor definition of "pass", especially when the number of
failures is large.
* On ARM and AArch64, the number of failures is around a couple of
thousand (don't know the exact figure). AFAIK, these are not marked
XFAIL in any way, but are known to be broken for one reason or
another.
* The set of failures is different for different sub-architectures
and ARM developers have to know what's good and what's not based on
that. If XFAILs were used more proficiently, they wouldn't have this
problem. I hear some people don't like to XFAIL because they want to
"one day fix the bug", but that's a personal opinion on the validity
of XFAILs.
* Linaro monthly releases go out with those failures, and the fact
that they keep on going means the FSF releases do, too. This is a huge
cost on the release process, since it needs complicated diff programs
and often incur in manual analysis.
* Comparing the previous release against the new won't account for
new features/bugs that are introduced, and not all bugs get to
bugzilla. We have the same problem in LLVM, but our developers know
more or less what's being done. Not all of us track every new feature
introduced by GCC, so tracking their new bugs would be a major task.

2. Linux kernel and Android builds with new GCC have increasing trouble.
* I heard from both kernel and android engineers that every new GCC
release shows more failures than the previous difference on their
code. Ie. GCC 4.8->4.9 had a bigger delta than 4.7->4,8.
* The LLVMLinux group reported more trouble moving between two GCC
releases than porting to LLVM.
* Most problems are due to new warnings and errors, but some are bugs
that weren't caught by the regression nor the release processes.

I understand it's impossible to catch all bugs, and that both the
Linux Kernel and Android are large projects, but this demonstrates
that the GCC release process is as good (or bad) as our own, but in a
different mindset (focus on release validation, rather than trunk) and
by a different community (which most of us don't track).

My conclusion is that, if we're ever going to incorporate the GCC
test-suite, it'll take a lot of time fudging it to be a pass/fail, and
for every new version of it, we'll have the same amount of work.
Reiterating Paul's points, I believe those tests to not have
sufficient value to be worth the continuous effort. That means we'll
have to rely on companies to do secondary screening for LLVM,
something that I believe GCC would rather not do, but we seem to be ok
with.

Then again, I may be completely wrong.

cheers,
--renato

I was merely stating the quality of compiler test suites, and how
valuable it would be to use the GCC tests in LLVM (or vice-versa). I
agree with Paul that the LLVM tests are little more than smoke screen,
and from what I've seen, the GCC tests are just a bigger smoke screen.

Possible language issue here....

A "smoke screen" is something to obscure/conceal whatever is behind it,
by making it hard to see through the smoke.
A "smoke test" is the initial power-on to see if your new hardware
instantly starts billowing smoke (catches on fire).

I view the Clang/LLVM regression tests as a "smoke test" for Clang/LLVM;
the initial set of tests that tells you whether it is worth proceeding
to more thorough/expensive testing. Not a smoke screen!
Thanks,
--paulr

A "smoke screen" is something to obscure/conceal whatever is behind it,
by making it hard to see through the smoke.

Hum, not this...

A "smoke test" is the initial power-on to see if your new hardware
instantly starts billowing smoke (catches on fire).

That's what I understood, no idea why I used the other term...

Sorry!
--renato

1. GCC trunk is less stable than LLVM because the lack of general
buildbots.

GCC has plenty of buildbots, it has no revert-on-breakage policy.

* Testing a new patch means comparing the test results (including
breakages) against the previous commit, and check the differences.
This is a poor definition of "pass", especially when the number of
failures is large

This is an artifact of the lack of a revert-on-breakage policy.

.
* On ARM and AArch64, the number of failures is around a couple of
thousand (don't know the exact figure). AFAIK, these are not marked
XFAIL in any way, but are known to be broken for one reason or
another.

That sounds like a failure on the part of the ARM developers.

* Linaro monthly releases go out with those failures, and the fact
that they keep on going means the FSF releases do, too.

I expect this would change if someone pushed.

Here is, for example, the failure list for i686-pc-linux-gnu for each 4.9
release:
https://gcc.gnu.org/gcc-4.9/buildstat.html

This is a huge

cost on the release process, since it needs complicated diff programs
and often incur in manual analysis.

You say all this as if it is a GCC testsuite issue.

It sounds completely like a process issue that hasn't been raised and dealt
with.

IE something that could easily happen to LLVM.

GCC has plenty of buildbots, it has no revert-on-breakage policy.

I stand corrected.

That sounds like a failure on the part of the ARM developers.

Or in my knowledge. :slight_smile:

I expect this would change if someone pushed.

Probably.

Here is, for example, the failure list for i686-pc-linux-gnu for each 4.9
release:
Build status for GCC 4.9 - GNU Project

These sound a lot more thorough that we're doing at the moment. I'm
not surprised with that number of failures, but I (personally)
wouldn't care much if a sequence of arbitrary passes (ot in -ON)
failed. I'd easily mark it as XFAIL, but as I said, that's personal.

You say all this as if it is a GCC testsuite issue.

Didn't mean to.

It sounds completely like a process issue that hasn't been raised and dealt
with. IE something that could easily happen to LLVM.

Absolutely!

My point is that we already have our complexity, just like GCC.
Merging the two complex validation systems *may* bring more harm than
good, that's all.

It would be a worthy project if someone was willing to take on the
hard work, but I don't know many people / companies where this would
justify, especially because I don't know how much of their suite is
redundant with ours, or within itself, to warrant an extra run.

Hope that's more clear... :slight_smile:

cheers,
-renato

Hi Renato,

It’s a non-technical concern, but please keep in mind that the GCC testsuite is GPL licensed. You’re effectively discussing a fork of their testsuite (to which I think that there are a number of technical challenges) but before actually landing it, we’d want to carefully discuss whether it is included as part of the LLVM project. If you want to pursue this, doing this on github (or some similar service) would be a better place to get started.

-Chris

Hi Chris,

That's a good point, which I hadn't considered, but I wasn't myself
proposing we merge at all (for technical reasons). Here, I used
"incorporate" as in "the testing harness" not as in "the source tree".
Sorry (again) for the misuse of terms.

My alternative to the original proposal, would be to have bots running
DejaGnu with Clang instead of GCC, but even on that non-legal
scenario, the work to separate signal from noise would still be too
much to warrant any less serious attempts.

cheers,
--renato