I think you grabbed me stats on failing tests in the past. Can you dig up the failure rate for TestRaise.py’s test_restart_bug() variants on Ubuntu 14.04 x86_64? I’d like to mark it as flaky on Linux, since it is passing most of the time over here. But I want to see if that’s valid across all Ubuntu 14.04 x86_64. (If it is passing some of the time, I’d prefer marking it flakey so that we don’t see unexpected successes).
Hmm, the flakey behavior may be specific to dwo. Testing it locally as unconditionally flaky on Linux is failing on dwarf. All the ones I see succeed are dwo. I wouldn’t expect a diff there but that seems to be the case.
So, the request still stands but I won’t be surprised if we find that dwo sometimes passes while dwarf doesn’t (or at least not enough to get through the flakey setting).
Nope, no good either when I limit the flakey to DWO.
So perhaps I don’t understand how the flakey marking works. I thought it meant:
- run the test.
- If it passes, it goes as a successful test. Then we’re done.
- run the test again.
- If it passes, then we’re done and mark it a successful test. If it fails, then mark it an expected failure.
But that’s definitely not the behavior I’m seeing, as a flakey marking in the above scheme should never produce a failing test.
I’ll have to revisit the flakey test marking to see what it’s really doing since my understanding is clearly flawed!
The expected flakey works a bit differently then you are described:
- Run the tests
- If it passes, it goes as a successful test and we are done
- Run the test again
- If it is passes the 2nd time then record it as expected failure (IMO expected falkey would be a better result, but we don’t have that category)
- If it fails 2 times in a row then record it as a failure because a flakey test should pass at least once in every 2 run (it means we need ~95% success rate to keep the build bot green in most of the time). If it isn’t passing often enough for that then it should be marked as expected failure. This is done this way to detect the case when a flakey test get broken completely by a new change.
I checked some states for TestRaise on the build bot and in the current definition of expected flakey we shouldn’t mark it as flakey because it will often fail 2 times in a row (it passing rate is ~50%) what will be reported as a failure making the build bot red.
I will send you the full stats from the lass 100 build in a separate off list mail as it is a too big for the mailing list. If somebody else is interested in it then let me know.
I have created this test to reproduce a race condition in
ProcessGDBRemote. Given that it tests a race condition, it cannot be
failing 100% of the time, but I agree with Tamas that we should keep
it as XFAIL to avoid noise in the buildbots.
Okay. I think for the time being, the XFAIL makes sense. Per my previous email, though, I think we should move away from unexpected success (XPASS) being a “sometimes meaningful, sometimes meaningless” signal. For almost all cases, an unexpected success is an actionable signal. I don’t want it to become the warning that everybody lives without fixing, and then it hides a real issue when one surfaces.
Thanks for explaining what I was seeing!