Rotten Green Tests project

This note describes the first part of the Rotten Green Tests project.

"Rotten Green Tests" is the title of a paper presented at the 2019
International Conference on Software Engineering (ICSE). Stripped to
its essentials, the paper describes a method to identify defects or
oversights in executable tests. The method has two steps:

(a) Statically identify all "test assertions" in the test program.
(b) Dynamically determine whether these assertions are actually
executed.

A test assertion that has been coded but is not executed is termed a
"rotten green" test, because it allows the test to be green (i.e.,
pass) without actually enforcing the assertion. In many cases it is
not immediately obvious, just by reading the code, that the test has a
problem; the Rotten Green Test method helps identify these.

The paper describes using this method on projects coded in Pharo
(which appears to be a Smalltalk descendant) and so the specific tools
are obviously not applicable to a C++ project such as LLVM. However,
the concept can be easily transferred.

I applied these ideas to the Clang and LLVM unittests, because these
are all executable tests that use the googletest infrastructure. In
particular, all "test assertions" are easily identified because they
make use of macros defined by googletest; by modifying these macros,
it should be feasible to keep track of all assertions, and report
whether they have been executed.

The mildly gory details can be saved for the code review and of course
an LLVM Dev Meeting talk, but the basic idea is: Each test-assertion
macro will statically allocate a struct identifying the source
location of the macro, and have an executable statement recording
whether that assertion has been executed. Then, when the test exits,
we look for any of these records that haven't been executed, and
report them.

I've gotten this to work in three environments so far:
1) Linux, with gcc as the build compiler
2) Linux, with clang as the build compiler
3) Windows, with MSVC as the build compiler

The results are not identical across the three environments. Besides
the obvious case that some tests simply don't operate on both Linux
and Windows, there are some subtleties that cause the infrastructure
to work less well with gcc than with clang.

The infrastructure depends on certain practices in coding the tests.

First and foremost, it depends on tests being coded to use the
googletest macros (EXPECT_* and ASSERT*) to express individual test
assertions. This is generally true in the unittests, although not as
universal as might be hoped; ClangRenameTests, for example, buries a
handful of test assertions inside helper methods, which is a plausible
coding tactic but makes the RGT infrastructure less useful (because
many separate tests funnel through the same EXPECT/ASSERT macros, and
so RGT can't discern whether any of those higher-level tests are
rotten).

Secondly, "source location" is constrained to filename and line number
(__FILE__ and __LINE__), therefore we can have at most one assertion
per source line. This is generally not a problem, although I did need
to recode one test that used macros to generate assertions (replacing
it with a template). In certain cases it also means gcc doesn't let
us distinguish multiple assertions, mainly in nested macros, for an
obscure reason. But those situations are not very common.

There are a noticeable number of false positives, with two primary
sources. One is that googletest has a way to mark a test as DISABLED;
this test is still compiled, although never run, and all its
assertions will therefore show up as rotten. The other is due to the
common LLVM practice of making environmental decisions at runtime
rather than compile time; for example, using something like 'if
(isWindows())' rather than an #ifdef. I've recoded some of the easier
cases to use #ifdef, in order to reduce the noise.

Some of the noise appears to be irreduceable, which means if we don't
want to have bots constantly red, we have to have the RGT reporting
off by default.

Well... actually... it is ON by default; however, I turn it off in
lit. So, if you run `check-llvm` or use `llvm-lit` to run unittests,
they won't report rotten green tests. However, if you run a program
directly, it will report them (and cause the test program to exit with
a failure status). This seemed like a reasonable balance that would
make RGT useful while developing a test, without interfering with
automation.

The overall results are quite satisfying; there are many true
positives, generally representing coding errors within the tests.
A half-dozen of the unittests have been fixed, with more to come, and
the RGT patch itself is at: ⚙ D97566 [WIP][RGT] Rotten Green Test checking for LLVM and Clang unittests

Thanks,
--paulr

Initial gut reaction would be this is perhaps a big enough patch/divergence from upstream gtest that it should go into upstream gtest first & maybe sync’ing up with a more recent gtest into LLVM? Though I realize that’s a bit of a big undertaking (either/both of those steps). How does this compare to other local patches to gtest we have?

Could one also run llvm-cov/gcov and look for unexecuted lines? What
is the advantage of this approach?

Michael

The overall concept seems interesting to me. Anything that helps reduce problems in tests that could obscure bugs etc is worth a good consideration, in my opinion.

Thanks for the feedback, David!

Initial gut reaction would be this is perhaps a big enough
patch/divergence from upstream gtest that it should go into
upstream gtest first

I was not aware that googletest itself was open source, as
opposed to something Google-internal that you guys had permission
to put into LLVM. Although it makes sense for it to be open
source on its own.

As I mentioned in the writeup, it would be great to get feedback
from people who actually understood googletest, try it on other
platforms I don't have available (e.g., Mac) and ideally help solve
a couple of the problems I have run into.

In the meantime, obviously I can keep patching up our unittests
where RGT has found something.

& maybe sync'ing up with a more recent gtest into LLVM? Though
I realize that's a bit of a big undertaking (either/both of those
steps).

A quick skim through the LLVM git log suggests that we've done
some things to connect with LLVM-ish ways of doing things, e.g.,
using raw_ostream instead of std::ostream. Other things, such
as new targets/environment (I saw mention of MinGW) probably also
should have been done upstream. Perhaps it is not obvious to
non-Googlers that there *is* an upstream project (did I mention
I was not aware?).

The last upgrade was to 1.8.0 in Jan 2017 by Chandler. I see
upstream has tagged 1.8.1 and 1.10.0 since then (no 1.9.0).
Chandler's log message does say "minimizing local changes" which
sounds somewhat hopeful. There have been changes since then.
Looking at a handful of diffs, I see some things guarded by a
conditional with LLVM in the name, and more where the change
was simply made with no effort to identify it as local. This
suggests we (LLVM) have made some effort to make future merges
not-horrible, but it would still be a chunk of work.

As a 10% project, I think I'd prefer to concentrate on fixing
our own unittests first, but work on integrating with upstream
googletest rather than pursue the RGT infrastructure as an LLVM
mod, when I've gotten our test fixes sorted out.

How does this compare to other local patches to gtest we have?

No clue. I'd have to do a diff against the upstream project, and
sort out which are due to LLVM's stripping-down procedure and
which are actual features & fixes we've done. However, as a guess
it's overall significantly bigger than anything else done locally.

I have tried to minimize RGT's intrusion into the base googletest
code (well, LLVM's base googletest code). Most of it is in a new
header and .cc file. After proper integration with the upstream
project, that might not remain true, though.

Thanks,
--paulr

Thanks, Michael!

Could one also run llvm-cov/gcov and look for unexecuted lines? What
is the advantage of this approach?

One could do that; however, it is quite clear no-one *has* done that.
The advantage of this approach is that it is automatic, and happens
while you are writing/modifying the test, instead of perhaps years
later, if ever.

--paulr

Thanks, James!

When writing googletest unit tests, I almost always run the test
executable directly. This is because it's by far the easiest way
to run the test and debug in Visual Studio ("Set startup project"
-> F5).

This is exactly the use-case I had in mind.

I wouldn't be happy if this started showing up false test
failures in some form or other, unless someone can point at an
equally simple way of doing the same thing.

None of them show up as individual *test* failures; what happens
is you get a report at the end with a list of rotten assertions.
The test *program* will have a failure exit status, but the list
of passing/failing tests is unaffected.

At the moment, the number of false positives appears to be small-ish.

A DISABLED test will show up as rotten, unless you run the test with
--gtest_also_run_disabled_tests; LLVM has 3 test programs like this,
out of 56, and Clang has two out of 22.

There are also some cases where EXPECT_NONFATAL_FAILURE has another
EXPECT as an argument, and that doesn't seem to work with RGT (but I
haven't figured out whether this is a case of Don't Do That, or
something else). There are two LLVM unittest programs like this.

There might be more, I haven't analyzed all the failure reports to
track down exactly what's going on. Most of the rest are likely
issues with the test programs themselves.

--paulr

Sure - that sounds good to me. Fixes to existing tests, no matter the means used to find them seem like general goodness, but yeah - when it comes to upstreaming the infrastructure to prevent regressing this invariant, upstream to gtest and sync’d to LLVM sounds like the best path.

& generally I’d be in favor of the functionality, if the false positives can be worked out, etc. No doubt upstream gtest review will likely have some ideas about how such a thing can/should be integrated, how its results should be rendered to the user, etc.

(my ultimate favourite idea that I’d love to see more of is mutation testing - randomized and/or coverage-directed mutation of code under test that verifies tests fail if the code is mutated - this ensures it’s more than just coverage (that lines of code were executed) but that their behavior is observed and validated - but that’s a bunch of work to implement/plumb into test systems, etc)

  • Dave