Test suite itself creating failures

Hi all,

So I was looking into why TestInferiorAssert was (still) failing on my machine, and it turned out the root cause was in fact that tests are not run in isolation; dotest.py runs multiple tests using the same LLDB context for each one. So if a test doesn't clean up after itself properly, it can cause following tests to incorrectly fail.

Is this really a good idea? Wouldn't it make more sense to make it so tests are always run individually, to guarantee consistent results?

Note that the buildbots use dosep.py. Stressing the tests as in dotest.py is useful as it also ensures that lldb is robust. We have other stress tests (i.e. running builds in parallel with test loops) that we occasionally run in private jobs on the buildbots. Just trying to strike a balance.

As long as the tests are run in the same order by dotest.py (at least by default), we'll have deterministic failures. This is a good use-case for pdb since we can break in one of hundreds of tests. Good tests should clean up after themselves, and that also helps the tests serve as examples of lldb usage.

Calling these points out on lldb.org could help. What do you think?

- Ashok

It should be fine to run multiple debuggers either serially or in parallel and have them not interfere with each other. Most GUI's that use lldb will do this, so it is worth while to test that. Multiple targets in the same debugger should also be relatively independent of one another, as that is another mode which is pretty useful. The testsuite currently stresses those features of lldb, which seems to me a good thing.

Note that the testsuite mostly shares a common debugger (lldb.DBG, made in dotest.py and copied over to self.dbg in setUp lldbtest.py.) You could try not creating the singleton debugger in dotest.py, and allow setUp to make a new one. If there's state in lldb that persists across debuggers that causes testsuite failures, that is a bug we should fix, not something we should work around in the testsuite.

What kinds of things are the tests leaving around that are causing tests to fail? It seems to me we should make it easier to clean up the state when a target goes away, then just cut the Gordian knot and make them all run independently.

Jim

This problem happened several times in the past.
It's usually related to tests that change preferences, but don't
change them back in the end. Trying to figure out what exactly is
going wrong is hard and time-consuming. I've seen problems like: test
A relies on option X, test B changes option X, test A is run twice
(x86_64 and i386) with test B being run between those runs.

When I debugged those, every bug was a bug of the test-suite
(missing/wrong cleanup). But, like Jim said, there may be some lldb
bugs that may be uncovered with these runs.

Since there's the possibility of having actual lldb bugs uncovered, I
would prefer to keep the suite as is (with one debugger for all) and
fix the failing tests. Making it easier to reset some of lldb's
options when changing targets is a good thing, too, but some problems
with the test suite are when a test sets formatting options but
doesn't reset them to the default, in the end. Those can only be fixed
on a test-by-test basis.

  Filipe

Hi Greg,

I got around to looking at pr15415 today, and noticed that TestRecursiveInferior.py is a case where r132021 is an issue. Specifically, this code looks at the case where two subsequent frames have the same pc. It correctly distinguishes between recursion and an infinite loop by testing for unique canonical frame addresses. However, it then goes on to assume that the unwind should stop if GetFP() shows a null frame pointer. However, recursive_inferior/Makefile can achieve this using –fomit-frame-pointer.

The attached patch fixes the test by nixing the code block that I can’t explain (the commit message for r132021 didn’t specify why the logic changed for UnwindLLDB, and I can’t see any regressions locally with the attached patch, and r132021 doesn’t add or fix any test cases). Do you figure that a stronger test is required to distinguish between a good unwind and a bad one? If so, what would be required to create a test case for the existing code?

Note also that overloads of StackUsesFrames() in trunk always return true. Cheers,

  • Ashok

pr15415.patch (1.61 KB)

Yes, in this code segment

    if (!m_frames.empty())
    {
        if (m_frames.back()->start_pc == cursor_sp->start_pc)
        {
            if (m_frames.back()->cfa == cursor_sp->cfa)
                goto unwind_done; // Infinite loop where the current cursor is the same as the previous one...
            else if (abi && abi->StackUsesFrames())
            {
                // We might have a CFA that is not using the frame pointer and
                // we want to validate that the frame pointer is valid.
                if (reg_ctx_sp->GetFP() == 0)
                    goto unwind_done;
            }
        }
    }

I agree that the reg_ctx_sp->GetFP() == 0 check is incorrect. This assumes that the code is using the fp register as a frame pointer (instead of as a gpr). A canonical frame address (CFA) of 0 obviously means that a stack walk should stop, but the CFA is not necessarily set in terms of the fp register (e.g. in -fomit-frame-pointer style code).

This code should be removed.

I'm not sure the m_frames.back()->cfa == cursor_sp->cfa bit is doing anything either. Over in RegisterContextLLDB::InitializeNonZerothFrame()

    // A couple of sanity checks..
    if (cfa_regval == LLDB_INVALID_ADDRESS || cfa_regval == 0 || cfa_regval == 1)
    {
        UnwindLogMsg ("could not find a valid cfa address");
        m_frame_type = eNotAValidFrame;
        return;
    }

    // If we have a bad stack setup, we can get the same CFA value multiple times -- or even
    // more devious, we can actually oscillate between two CFA values. Detect that here and
    // break out to avoid a possible infinite loop in lldb trying to unwind the stack.
    addr_t next_frame_cfa;
    addr_t next_next_frame_cfa = LLDB_INVALID_ADDRESS;
    if (GetNextFrame().get() && GetNextFrame()->GetCFA(next_frame_cfa))
    {
        bool repeating_frames = false;
        if (next_frame_cfa == m_cfa)
        {
            repeating_frames = true;
        }
        else
        {
            if (GetNextFrame()->GetNextFrame() && GetNextFrame()->GetNextFrame()->GetCFA(next_next_frame_cfa)
                && next_next_frame_cfa == m_cfa)
            {
                repeating_frames = true;
            }
        }
        if (repeating_frames && abi->FunctionCallsChangeCFA())
        {
            UnwindLogMsg ("same CFA address as next frame, assuming the unwind is looping - stopping");
            m_frame_type = eNotAValidFrame;

Thanks Jason, all good points.

I removed the code block quoted below from UnwindLLDB (see r191430) since it's neither correct, complete nor reachable. The code block from RegisterContextLLDB that you quoted got me wondering how to test for this stack setup.

Also, if we can oscillate between two CFAs, then wouldn't any size of cycle be possible? Perhaps the more robust defense is to stop the unwind if the new frame's CFA exists anywhere in the thread's call stack...

- Ashok

Any time we're getting a CFA repeat (assuming abi->FunctionCallsChangeCFA()), that would be a sign that the unwinder has made a mistake. Looking over my notes from when I was working on the patch in r166757, I don't see why I added that specific check for a two-frame cycle. To be honest, I wonder if I was just dreaming up a problem and adding a patch to address it.

I agree that if we are going to worry about this, we should check all frames to see if a given CFA has been seen already. Or (if abi->FunctionCallsChangeCFA()), check that each CFA is less-than or greater-than (depending on which way the stack moves) the previous frame.

But I think we could just remove that two-cycle check altogether and be fine.

And just to add one caveat about the less-than/greater-than sanity check I Mention here -- there's always one joker who implements their own threading model in userland code or whatever and has discontiguous stacks. Any less-than/greater-than sanity checks will fail when run on these types of programs.