increase timeout for tests?

Hi,
I've noticed some of the tests we have trigger timeouts when running in debug.

TIMEOUT: test_NSError_p_dwarf (lang/objc/foundation/TestObjCMethods2.py)
TIMEOUT: test_expression_lookups_objc_dwarf
(lang/objc/foundation/TestObjCMethods.py)
TIMEOUT: test_expressions_in_literals_dsym
(lang/objc/objc-new-syntax/TestObjCNewSyntax.py)
TIMEOUT: test_frame_var_after_stop_at_interface_gmodules
(lang/objc/real-definition/TestRealDefinition.py)
TIMEOUT: test_frame_variable_across_modules_gmodules
(lang/objc/hidden-ivars/TestHiddenIvars.py)
TIMEOUT: test_kvo_with_run_command_gmodules
(functionalities/data-formatter/data-formatter-objc/TestDataFormatterObjC.py)
TIMEOUT: test_long_long_type_gmodules (types/TestIntegerTypesExpr.py)
TIMEOUT: test_long_type_gmodules (types/TestIntegerTypes.py)
TIMEOUT: test_rdar11106605_with_run_command_gmodules
(functionalities/data-formatter/data-formatter-objc/nsstring/TestDataFormatterNSString.py)
TIMEOUT: test_sb_api_listener_event_process_state
(api/multithreaded/TestMultithreaded.py)
TIMEOUT: test_scope_ambiguity_after_using_lookup_with_run_command_gmodules
(lang/cpp/namespace/TestNamespaceLookup.py)
TIMEOUT: test_set_auto_confirm_gmodules (settings/TestSettings.py)
TIMEOUT: test_step_over_load_gmodules
(functionalities/load_unload/TestLoadUnload.py)
TIMEOUT: test_uint32_t_gmodules (lang/cpp/enum_types/TestCPP11EnumTypes.py)

There's nothing fundamentally broken with these tests, it's just that
they take forever.
What do you think about increasing the timeout for single tests?
Alternatively, we could start splitting these tests into smaller
tests.

Thanks,

It would be great to look into these and see what is taking so long. Some tests are doing way too much work and should be split up. But it would be great to see if we have any tests that are not good tests and are just taking too much time for no reason (like the watchpoint tests were that Pavel fixed).

Greg

I tend to agree with you. This is important. Pavel, is there a
systematic way you use to profile test cases?
Maybe we should consider having a testsuite mode where we warn for
slow tests (or revamp the profiling mechanism, if there's one).

Thanks!

I don’t have any fancy mechanism for profiling this. I just hacked the process_file function in dosep.py to measure the time each test takes and report it if it exceeds a certain threshold. Then I just looked at the tests that seemed like they are taking more time than they ought to.

This is the patch, in it’s entirety:

— a/packages/Python/lldbsuite/test/dosep.py
+++ b/packages/Python/lldbsuite/test/dosep.py
@@ -489,8 +489,13 @@ def process_file(test_file, dotest_argv, inferior_pid_events):
timeout = (os.getenv(“LLDB_%s_TIMEOUT” % timeout_name) or
getDefaultTimeout(dotest_options.lldb_platform_name))

  • import time
  • t = time.time()
    results.append(call_with_timeout(
    command, timeout, base_name, inferior_pid_events, test_file))
  • t = time.time() - t
  • if t > 20:
  • print(“XXXXXXXXXXXXXXXXXX %s: %f”%(base_name, t))

result = (name, status, passes, failures, unexpected_successes)

timed_out = [name for name, status, _, _, _ in results

We could try to make this more fancy, but I’m not sure worth it. It just occurred to me you can achieve a similar effect by running the test suite with a reduced timeout threshold. Something like “LLDB_TEST_TIMEOUT=20s ninja check-lldb” should do the trick. At one point Todd also implemented a mechanism for recording what the test is doing before we time out and kill it. The idea was that this would help us determine what the test was doing when it was killed (and whether it was really stuck, or just slow). However, I’ve never used this, so I have no idea what’s the state of it.

I am not opposed to raising the timeout threshold, but before we do that, we should check whether it will actually help. I.e., whether the test is just slow, or it sometimes hangs (I think TestMultithreaded is a good candidate for the latter). Can you check what is the time those tests take to run individually? Current osx timeout (dosep.py:getDefaultTimeout) is set to 6 minutes, so if the tests normally take much shorter than this (less than 2 minutes?), I think there is something else going on, and increasing the timeout will not help.

Another interesting thing may be to check the system load while running the test suite. Nowadays, lldb will sometimes spawn N (where N is the number of cpus) threads to do debug info parsing, so when we run N tests in parallel, we may end up with N^2 threads competing for N cpus. I’ve never seen this to be an issue as our tests generally don’t have that much debug info to begin with, but I don’t think we’ve ever investigated this either…

BTW: This is my list of tests that take over 20 seconds (I ran it on linux so it does not include any osx-specific ones):

20.729438 TestWatchpointCommands.py
21.158373 TestMiData.py
22.018853 TestRecursiveInferior.py
22.692744 TestInferiorCrashing.py
22.986511 TestLinuxCore.py
23.073790 TestInferiorAssert.py
23.193351 TestAnonymous.py
24.539430 TestMiStack.py
25.232940 TestUnwindExpression.py
26.608233 TestSetWatchlocation.py
28.977390 TestWatchLocation.py
29.057234 TestMultithreaded.py
29.813142 TestCPP11EnumTypes.py
30.830618 TestWatchLocationWithWatchSet.py
32.979785 TestThreadStates.py
33.435421 TestTargetWatchAddress.py
36.902618 TestNamespaceLookup.py
37.705945 TestMiSymbol.py
38.735631 TestValueObjectRecursion.py
46.870848 TestLldbGdbServer.py
49.594663 TestLoadUnload.py
52.186429 TestIntegerTypes.py
55.340360 TestMiExec.py
56.148259 TestIntegerTypesExpr.py
110.121061 TestMiGdbSetShow.py

The top one is an lldb-mi test, and this one takes long for a very silly reason: When a test is xfailed because lldb-mi does not produce the expected output, pexpect will keep waiting, hoping that it will eventually produce a line that matches it’s regex. This times out after 30 seconds, so any xfailed lldb-mi test will take at least that long. I’m not sure how to fix this without revamping the way we do lldb-mi testing (which would probably be a good thing, but I don’t see anyone queuing up to do that).

I thought about this a little more and I guess the best solution would
be that of removing the timeout altogether.
This is, among others, what llvm does. timeouts can be opt-in. Setting
a timeout means inherently making an assumption about the speed at
which the machine will run the test.
It's, therefore, by definition, impossible to set an "optimal" value.
That said, slow tests should be analyzed independently (as we want
people to not tapping fingers while waiting for tests to complete).

Best,

The problem with having no timeouts is that you have to then be fairly careful how you write tests. You can't do:

while (1) {
   print("Set a breakpoint here and hit it a few times then stop the test");
}

because if the breakpoint setting fails the test can run forever. And we wait forever to launch or attach to a process internally in lldb, so if you mess up attaching or launching in some situation, that again makes the test run forever. The timeouts are a backstop so you will get useful results from the testsuite in the case of this sort of error.

Jim

I see this. So, maybe we should set this to a ridiculously large value
(e.g. 30/60 minutes)? FWIW, I have at home that's slow enough that the
testsuite times out more often, and that's not great (the funny part
is that there's nothing inherently wrong with the tests :slight_smile: Would you
be fine with such a middle ground, Jim?

Thanks,

I think we should get some data before pulling numbers out of our sleeves. If we can get some numbers on the slowest machine that we have around, then we should be able to specify the timeout as some multiple of the slowest test. For example, for me the slowest test takes around 110 seconds. The timeout is 4 minutes (~ 2x) and I don’t get tests timing out. How long does the slowest test take for you? If we set the timeout as 3x or 4x that number, we should create a sufficiently large buffer and still avoid half-hour waits.

The longest test takes over 300 seconds. This is a late 2013 Mac Pro,
so, definitely not the newest machine out there (but also something
fairly high spec).
For the archives, my conf is something like; cmake -GNinja ../llvm &&
ninja check-lldb

Thanks!

Aha, in that case, it definitely sounds like increasing the timeout is in order. I would still go for something less than 30 minutes, but I don’t care about that much. I may just export LLDB_TEST_TIMEOUT locally to lower it if tests taking long to time out start bugging me.

BTW, do you know which test is that? Is it one of the tests I have listed in one of the previous emails?

I'll re-run the test and send you a list.

Thank you!

Few examples:

360 out of 617 test suites processed - TestObjCMethods.py
                         XXXXXXXXXXXXXXXXXX
TestObjCMethods.py: 363.726592
381 out of 617 test suites processed - TestHiddenIvars.py
                            XXXXXXXXXXXXXXXXXX
  TestHiddenIvars.py: 363.887766
387 out of 617 test suites processed - TestObjCNewSyntax.py
                       XXXXXXXXXXXXXXXXXX
TestObjCNewSyntax.py: 363.842807
600 out of 617 test suites processed - TestDataFormatterObjC.py
                     XXXXXXXXXXXXXXXXXX
TestDataFormatterObjC.py: 388.414883
617 out of 617 test suites processed - TestLoadUnload.py
                          XXXXXXXXXXXXXXXXXX
TestLoadUnload.py: 363.372328

It sounds like we timing out based on the whole test class, not the individual tests? If you're worried about test failures not hanging up the test suite the you really want to do the latter.

These are all tests that contain 5 or more independent tests. That's probably why they are taking so long to run.

I don't object to having fairly long backstop timeouts, though I agree with Pavel that we should choose something reasonable based on the slowest running tests just so some single error doesn't cause test runs to just never complete, making analysis harder.

Jim

Vedant (cc:ed) is going to take a look at this as he's babysitting the
bots for the week. I'll defer the call to him.

As a first step, I think there's consensus on increasing the test timeout to ~3x the length of the slowest test we know of. That test appears to be TestDataFormatterObjC, which takes 388 seconds on Davide's machine. So I propose 20 minutes as the timeout value.

Separately, regarding x-failed pexpect()-backed tests, I propose deleting them if they've been x-failed for over a year. That seems like a long enough time to wait for someone to step up and fix them given that they're a real testing/maintenance burden. For any group of to-be-deleted tests, like the 22 lldb-mi tests x-failed in all configurations, I'd file a PR about potentially bringing the tests back. Thoughts?

thanks,
vedant

It is unfortunate that we have to set really long test timeouts because we are timing the total Test class run, not the individual tests. It is really convenient to be able to group similar tests in one class, so they can reuse setup and common check methods etc. But if we're going to have to push the timeouts to something really long because these tests get charged incorrectly, which makes this strategy artificially less desirable.

When we spawn a dotest.py to run each test file, the runner that is doing the timeout hasn't ingested the test class, so it can't do something reasonable like count the number of tests and multiply that into the timeout to get the full test timeout. I tried to hack around this but I wasn't successful at getting hold of the test configuration in the runner so you could figure out the number of tests there. If somebody more familiar with the test harness than I am can see a way to do that, that seems like a much better way to go.

But if we can't do that, then we can increase the overall timeout. Though we might want to override that with LLDB_TEST_TIMEOUT and set it to something lower on the bots.

Jim

+1 On deleting the lldb-mi tests and increasing the timeout.

I don't see 22 lldb-mi tests xfailed everywhere. I see a lot of tests skipped, but
those are clearly marked as skip on Windows, FreeBSD, Darwin, Linux. I've got
a good chunk of the lldb-mi tests running on Hexagon. I don’t want them deleted,
since I use them.

lldb-mi tests can be hard to debug, but I found that setting the lldb-mi log to be
stdout helps a lot. In lldbmi_testcase.py, in spawnLldbMi, add this line:

    self.child.logfile = sys.stdout

I got the 22 number from a command which may have counted too much:
find . -name \*TestMi\*.py -exec grep -E "(unittest2\.)?expectedFailure(All)?" {} \; | wc -l

Some of the 'expectedFailureAll' decorators actually specified an OS list. I'm not planning on touching those.

There were a handful of lldb-mi tests that didn't appear to work at all, and I've filed bugs / deleted those in r327552. If you see something you feel really should stay in tree, we can bring it back.

vedant