Digging into Linux unexpected successes

On an Ubuntu 14.04 x86_64 system, I’m seeing the following results:

cmake/ninja/clang-3.6:

Testing: 395 test suites, 24 threads
395 out of 395 test suites processed - TestGdbRemoteKill.py
Ran 395 test suites (0 failed) (0.000000%)
Ran 478 test cases (0 failed) (0.000000%)

Unexpected Successes (6)
UNEXPECTED SUCCESS: LLDB (suite) :: TestConstVariables.py
UNEXPECTED SUCCESS: LLDB (suite) :: TestEvents.py
UNEXPECTED SUCCESS: LLDB (suite) :: TestMiBreak.py
UNEXPECTED SUCCESS: LLDB (suite) :: TestMiGdbSetShow.py
UNEXPECTED SUCCESS: LLDB (suite) :: TestMiInterpreterExec.py
UNEXPECTED SUCCESS: LLDB (suite) :: TestMiSyntax.py

cmake/ninja/gcc-4.9.2:

395 out of 395 test suites processed - TestMultithreaded.py
Ran 395 test suites (1 failed) (0.253165%)
Ran 457 test cases (1 failed) (0.218818%)
Failing Tests (1)
FAIL: LLDB (suite) :: TestRegisterVariables.py

Unexpected Successes (6)
UNEXPECTED SUCCESS: LLDB (suite) :: TestDataFormatterSynth.py
UNEXPECTED SUCCESS: LLDB (suite) :: TestMiBreak.py
UNEXPECTED SUCCESS: LLDB (suite) :: TestMiGdbSetShow.py
UNEXPECTED SUCCESS: LLDB (suite) :: TestMiInterpreterExec.py
UNEXPECTED SUCCESS: LLDB (suite) :: TestMiSyntax.py
UNEXPECTED SUCCESS: LLDB (suite) :: TestRaise.py

I will look into those. I suspect some of them are compiler-version specific, much like some of the OS X ones I dug into earlier.

Hi Todd,

I attached the statistic of the last 100 test run on the Linux x86_64 builder (http://lab.llvm.org:8011/builders/lldb-x86_64-ubuntu-14.04-cmake). The data might be a little bit noisy because of the actual test failures happening because of a temporary regression, but they should give you a general idea about what is happening.

I will try to create a statistic where the results are displayed separately for each compiler and architecture to get a bit more detailed view, but it will take some time. If you want I can include the list of build numbers for all outcome, but it will be a very log list (currently only included for Timeout and Failure)

Tamas

test-statistics (26.5 KB)

Hi Todd,

I attached the statistic of the last 100 test run on the Linux x86_64
builder (http://lab.llvm.org:8011/builders/lldb-x86_64-ubuntu-14.04-cmake).
The data might be a little bit noisy because of the actual test failures
happening because of a temporary regression, but they should give you a
general idea about what is happening.

Thanks, Tamas! I'll have a look.

I will try to create a statistic where the results are displayed
separately for each compiler and architecture to get a bit more detailed
view, but it will take some time. If you want I can include the list of
build numbers for all outcome, but it will be a very log list (currently
only included for Timeout and Failure)

I'll know better when I have a look at what you provided. The hole I see
right now is we're not adequately dealing with unexpected successes for
different configurations. Any reporting around that is helpful.

Thanks!

Wow Tamas, this is perfect. Thanks for pulling that together!

Don’t worry about the bigger file.

Thanks much.

-Todd

Just to make sure I’m reading these right:

========== Compiler: totclang Architecture: x86_64 ==========

UnexpectedSuccess
TestMiInterpreterExec.MiInterpreterExecTestCase.test_lldbmi_settings_set_target_run_args_before (250/250 100.000000%)
TestRaise.RaiseTestCase.test_restart_bug_with_dwarf (119/250 47.600000%)
TestMiSyntax.MiSyntaxTestCase.test_lldbmi_process_output (250/250 100.000000%)
TestInferiorAssert.AssertingInferiorTestCase.test_inferior_asserting_expr_dwarf (195/250 78.000000%)

This is saying that running the tests with a top of tree clang, on x86_64, we see (for example):

  • test_lldbmi_settings_set_target_run_args_before() is always passing,
  • test_inferior_asserting_expr_dwarf() is always passing
  • test_restart_bug_with_dwarf() is failing more often than passing.

This is incredibly useful for figuring out the true disposition of a test on different configurations. What method did you use to gather that data?

Yes, you are reading it correctly (for totclang we mean the totclang at the time when the test suit was run).

The cmake builder runs in GCE and it uploads all test logs to Google Cloud Storage (including full host logs and server logs). I used a python script (running also in GCE) to download this data and to parse the test output from the test traces.

The cmake builder runs in GCE and it uploads all test logs to Google Cloud Storage (including full host logs and server logs). I used a python script (running also in GCE) to download this data and to parse the test output from the test traces.

Are the GCE logs public? If not, do you know if our buildbot protocol supports polling this info via another method straight from the build bot? (The latter is ultimately preferable so we can pull from multiple builders, e.g. macosx, freebsd, etc.) I suspect worst case the web interface could be botted and the data collected and scraped, but hopefully that isn’t necessary.

Thanks again for sharing the info!

You are probably looking for this: http://lab.llvm.org:8011/json/help

Yep looks like there’s a decent interface to it. Thanks, Siva!

I see there’s some docs here too:
http://docs.buildbot.net/current/index.html

IIRC, doing it from Python is straightforward and simple:

json.load(urlparse.urlopen(<…>))

Could be a little more, but should not be much.

Unfortunately the GCE logs aren’t public at the moment and the amount of them isn’t make it easy to make them accessible in any way (~30MB/build) and they aren’t much more machine parsable then the stdout from the build.

I think downloading data with the json API won’t help because it will only list the failures displayed on the Web UI what don’t contain full test names and don’t contain info about the UnexpectedSuccess-es. If you want to download it from the web interface then I am pretty sure we have to parse in the stdout of the test runner and change dotest in a way that it displays more information about the outcome of the different tests.

I fully support making those changes to dotest. Also, it would be nice to
actually have a stats cron running along with the master with a webui,
something like this: Chromium Main Console and
Chromium Main Console. Its a tall ask, but at the
very least, we should have dotest.py put out machine readable output. This
could be done on request (as in, when a certain flag is set).

Change http://reviews.llvm.org/D12831 in review (waiting on Windows results for that) adds a test event stream that supports pluggable test event formatters. The first formatter I’ve added is JUnit/XUnit output. That’s to support typical JUnit/XUnit output handling built into most commercial and open source CI solutions. But that eventing mechanism is intended to support quite a wider range of possible applications, including outputting to different formats, displaying test results as they occur in different viewers/controllers, etc.

-Todd