Is anyone using the gtest Xcode project?

I mean that currently there is a gtest.xcodeproj in lldb/gtest whereas most stuff goes in lldb.xcodeproj.

Anyway, that part isn’t that important. The real question is is it important for you that it not break even for a few days, or is it ok if it running unit tests works only through the cmake build? I.e. Do i need to synchronize the changes for cmake and xcode?

Thanks for the reply!

Yeah, the lldb tests are sort of expect like, except that since they mostly use the Python API's to drive lldb, they are really more little mini-programs that drive the debugger to certain points, and then run some checks, then go a little further and run some more checks, etc. Generally the checks usually call some Python API and test that the result is what you intended (e.g. get the name of frame 0, get the value of Foo, call and expression & check the return type...)

Having the test harness framework in Python is a great fit for this, because each time you stop & want to do a check, you are calling the test harness "check that this test passed and fail if not" functionality. I'm not entirely sure how you'd do that if you didn't have access to the test harness functionality inside the test files. And in general, since the API that we are testing is all available to Python and we make extensive use of that in writing the validation checks, using a Python test harness just seems to me the obviously correct design. I'm not convinced we would get any benefit from trying to Frankenstein this into some other test harness.

Jim

I got confirmation from Vince offline that we don’t need gtest in the Xcode workspace in its current form (i.e. the scheme that runs the do-gtest.py). So I’m going to check in my changes which add gtest to the CMake and delete this xcodeproj from the repo. This may result in errors in the Xcode workspace the next time you load it up. This should be as easy to fix as removing the reference to gtest.xcodeproj.

I will try to figure out how to do that later today as well if nobody beats me to it, but I have a few things I need to get to first.

Also, after the dust settles I will then go and add a gtest scheme back to the xcode workspace as we discussed previously (again, unless someone beats me to it)

+ddunbar

Thanks for the reply!

Yeah, the lldb tests are sort of expect like, except that since they mostly use the Python API's to drive lldb, they are really more little mini-programs that drive the debugger to certain points, and then run some checks, then go a little further and run some more checks, etc. Generally the checks usually call some Python API and test that the result is what you intended (e.g. get the name of frame 0, get the value of Foo, call and expression & check the return type...)

Having the test harness framework in Python is a great fit for this, because each time you stop & want to do a check, you are calling the test harness "check that this test passed and fail if not" functionality. I'm not entirely sure how you'd do that if you didn't have access to the test harness functionality inside the test files. And in general, since the API that we are testing is all available to Python and we make extensive use of that in writing the validation checks, using a Python test harness just seems to me the obviously correct design. I'm not convinced we would get any benefit from trying to Frankenstein this into some other test harness.

The great thing is that LIT is a python library. The "Frankensteining" that it would remove is any lldb infrastructure that you've got in place for deciding things like: which tests to run, whether to run them in parallel, collecting & marking XFAILs, filtering out tests that are not supported on the particular platform, etc.

I don't think it helps much in terms of actually performing these "expect"-like tests, but it doesn't get in the way there either. Sounds like LLDB already has infrastructure for doing that, and LIT would not eliminate that part. From what I've heard about lldb, LIT still sounds like a good fit here.

Jim

jingham@apple.com writes:

Wasn't really trying to get into an extended discussion about this,
but FWIW I definitely realize that lldb's tests are more complicated
than what lit currently supports. But that's why I said "even if it
meant extending lit". It was mostly just a general comment about
how it's nice if everyone is focused on making one thing better
instead of everyone having different things.

Depending on how different the different things are. Compiler tests
tend to have input, output and some machine that converts the input to
the output. That is one very particular model of testing. Debugger
tests need to do: get to stage 1, if that succeeded, get to stage 2,
if that succeeded, etc. Plus there's generally substantial setup code
to get somewhere interesting, so while you are there you generally try
to test a bunch of similar things. Plus, the tests often have points
where there are several success cases, but each one requires a
different "next action", stepping being the prime example of this.
These are very different models and I don't see that trying to smush
the two together would be a fruitful exercise.

I think LIT does make the assumption that one "test file" has one "test result". But this is a place where we could extend LIT a bit. I don't think it would be very painful.

For me, this would be very useful for a few of the big libc++abi tests, like the demangler one, as currently I have to #ifdef out a couple of the cases that can't possibly work on my platform. It would be much nicer if that particular test file outputted multiple test results of which I could XFAIL the ones I know won't ever work. (For anyone who is curious, the one that comes to mind needs the c99 %a printf format, which my libc doesn't have. It's a baremetal target, and binary size is really important).

How much actual benefit is there in having lots of results per test case, rather than having them all &&'d together to one result?

Out of curiosity, does lldb's existing testsuite allow you to run individual test results in test cases where there are more than one test result?

lit's pretty flexible. It's certainly well suited to the "input file,
shell script, output expectations" model, but I've seen it used in a
number of other contexts as well. jroelofs described another use in his
reply, and I've also personally used lit to run arbitrary python
programs and FileCheck their output. That said, I don't know a ton about
lldb's test infrastructure or needs.

I'd expect lldb tests that can't be focused into a unit test to be
somewhat "expect"-like (pardon the pun), where there's something to be
debugged and a kind of conversation of input and expected output. Is
this a reasonable assumption?

I don't know of any current lit tests that work like this, but it would

Yeah, I don't know of any either. Maybe @ddunbar does?

be pretty simple for lit to act as a driver for a program that tested
things in an interactive way. I'd imagine the interactivity itself being
a separate utility, much like we delegate looking at output to FileCheck
or clang -verify in the llvm and clang test suites.

Anyways, I don't necessarily know if it's a good fit, but if we can make
lit suit your needs I think it'd be a nice gain in terms of bots being
easy to set up to show errors consistently and for random folks who are
familiar with LLVM but not necessarily LLDB to be able to quickly figure
out and fix issues they might cause.

+1

Cheers,

Jon

I think I’m not following this line of discussion. So it’s possible you and Jim are talking about different things here.

If I understand correctly (and maybe I don’t), what Jim is saying is that a debugger test might need to do something like:

  1. Set 5 breakpoints
  2. Continue
  3. Depending on which breakpoint gets hit, take one of 5 possible “next” actions.

But I’m having trouble coming up with an example of why this might be useful. Jim, can you make this a little more concrete with a specific example of a test that does this, how the test works, and what the different success / failure cases are so we can be sure everyone is on the same page?

In the case of the libc++ abi tests, I’m not sure what is meant by “multiple results per test case”. Do you mean (for example) you’d like to be able to XFAIL individual run lines based on some condition? If so, LLDB definitely needs that. One example which LLDB uses almost everywhere is that of running the same test with dSYM or DWARF debug info. On Apple platforms, tests generally need to run with both dSYM and DWARF debug info (literally just repeat the same test twice), and on non Apple platforms, only DWARF tests ever need to be run. So there would need to be a way to express this.

There are plenty of other one-off examples. Debuggers have a lot of platform specific code, and the different platforms support different amounts of functionality (especially for things like Android / Windows that are works in progress). So we frequently have the need to have a single test file which has, say 10 tests in it. And specific tests can be XFAILed or even disabled individually based on conditions (usually which platform is running the test suite, but not always).

Upon further thought, the dSYM vs. DWARF thing could probably be sunk into the TestFormat logic (I’m just guessing, I dont’ really know how to write a TestFormat or how much they’re capable of, but earlier you said they’re pretty flexible and responsible for determining what to do and what tests to run).

There’s still plenty of other examples though. Grep *.py files under tools/lldb/test for expectedFailureLinux and you’ll find quite a few examples

    +ddunbar

     >>> Depending on how different the different things are. Compiler
    tests
     >>> tend to have input, output and some machine that converts the
    input to
     >>> the output. That is one very particular model of testing.
    Debugger
     >>> tests need to do: get to stage 1, if that succeeded, get to
    stage 2,
     >>> if that succeeded, etc. Plus there's generally substantial
    setup code
     >>> to get somewhere interesting, so while you are there you
    generally try
     >>> to test a bunch of similar things. Plus, the tests often have
    points
     >>> where there are several success cases, but each one requires a
     >>> different "next action", stepping being the prime example of this.
     >>> These are very different models and I don't see that trying to
    smush
     >>> the two together would be a fruitful exercise.

    I think LIT does make the assumption that one "test file" has one "test
    result". But this is a place where we could extend LIT a bit. I don't
    think it would be very painful.

    For me, this would be very useful for a few of the big libc++abi tests,
    like the demangler one, as currently I have to #ifdef out a couple of
    the cases that can't possibly work on my platform. It would be much
    nicer if that particular test file outputted multiple test results of
    which I could XFAIL the ones I know won't ever work. (For anyone who is
    curious, the one that comes to mind needs the c99 %a printf format,
    which my libc doesn't have. It's a baremetal target, and binary size is
    really important).

    How much actual benefit is there in having lots of results per test
    case, rather than having them all &&'d together to one result?

    Out of curiosity, does lldb's existing testsuite allow you to run
    individual test results in test cases where there are more than one test
    result?

  I think I'm not following this line of discussion. So it's possible
you and Jim are talking about different things here.

I think that's the case... I was imagining the "logic of the test" something like this:

   1) Set 5 breakpoints
   2) Continue
   3) Assert that the debugger stopped at the first breakpoint
   4) Continue
   5) Assert that the debugger stopped at the second breakpoint
   6) etc.

Reading Jim's description again, with the help of your speculative example, it sounds like the test logic itself isn't straightline code.... that's okay too. What I was speaking to is a perceived difference in what the "results" of running such a test are.

In llvm, the assertions are CHECK lines. In libc++, the assertions are calls to `assert` from assert.h, as well as `static_assert`s. In both cases, failing any one of those checks in a test makes the whole test fail. For some reason I had the impression that in lldb there wasn't a single test result per *.py test. Perhaps that's not the case? Either way, what I want to emphasize is that LIT doesn't care about the "logic of the test", as long as there is one test result per test (and even that condition could be amended, if it would be useful for lldb).

If I understand correctly (and maybe I don't), what Jim is saying is
that a debugger test might need to do something like:

1) Set 5 breakpoints
2) Continue
3) Depending on which breakpoint gets hit, take one of 5 possible "next"
actions.

But I'm having trouble coming up with an example of why this might be
useful. Jim, can you make this a little more concrete with a specific
example of a test that does this, how the test works, and what the
different success / failure cases are so we can be sure everyone is on
the same page?

In the case of the libc++ abi tests, I'm not sure what is meant by
"multiple results per test case". Do you mean (for example) you'd like
to be able to XFAIL individual run lines based on some condition? If

I think this means I should make the libc++abi example even more concrete.... In libc++/libc++abi tests, the "RUN" line is implicit (well, aside from the few ShTest tests ericwf has added recently). Every *.pass.cpp test is a file that the test harness knows it has to compile, run, and check its exit status. That being said, libcxxabi/test/test_demangle.pass.cpp has a huge array like this:

       20 const char* cases[2] =
       21 {
       22 {"_Z1A", "A"},
       23 {"_Z1Av", "A()"},
       24 {"_Z1A1B1C", "A(B, C)"},
       25 {"_Z4testI1A1BE1Cv", "C test<A, B>()"},

    snip

    29594 {"_Zli2_xy", "operator\"\" _x(unsigned long long)"},
    29595 {"_Z1fIiEDcT_", "decltype(auto) f<int>(int)"},
    29596 };

Then there's some logic in `main()` that runs, __cxa_demangle on `cases[i][0]`, and asserts that it's the same as `cases[i][1]`. If any of those assertions fail, the entire test is marked as failing, and no further lines in that array are verified. For the sake of discussion, let's call each of entries in `cases` a "subtest", and the entirety of test_demangle.pass.cpp a test.

The sticky issue is that there are a few subtests in this test that don't make sense on various platforms, so currently, they are #ifdef'd out. If the LIT TestFormat and the tests themselves had a way to communicate that a subtest failed, but to continue running other subtests after that, then we could XFAIL these weird subtests individually.

Keep in mind though that I'm not really advocating we go and change test_demangle.pass.cpp to suit that model, because #ifdef's work reasonably well there, and there are relatively few subtests that have these platform differences... That's just the first example of the test/subtest relationship that I could think of.

so, LLDB definitely needs that. One example which LLDB uses almost
everywhere is that of running the same test with dSYM or DWARF debug
info. On Apple platforms, tests generally need to run with both dSYM
and DWARF debug info (literally just repeat the same test twice), and on
non Apple platforms, only DWARF tests ever need to be run. So there
would need to be a way to express this.

Can you point me to an example of this?

grep *.py files for test_with_dsym. But a random example I’ll pull from the search results is lldb\test\expression_command\call-function\TestCallStdStringFunction.py

In it you’ll see this:

@unittest2.skipUnless(sys.platform.startswith(“darwin”), “requires Darwin”)
@dsym_test
@expectedFailureDarwin(16361880) # rdar://problem/16361880, we get the result correctly, but fail to invoke the Summary formatter.
def test_with_dsym(self):
“”“Test calling std::String member function.”""
self.buildDsym()
self.call_function()

@dwarf_test
@expectedFailureFreeBSD(‘llvm.org/pr17807’) # Fails on FreeBSD buildbot
@expectedFailureGcc # llvm.org/pr14437, fails with GCC 4.6.3 and 4.7.2
@expectedFailureIcc # llvm.org/pr14437, fails with ICC 13.1
@expectedFailureDarwin(16361880) # rdar://problem/16361880, we get the result correctly, but fail to invoke the Summary formatter.
def test_with_dwarf(self):
“”“Test calling std::String member function.”""
self.buildDwarf()
self.call_function()

The LLDB test runner considers any class which derives from TestBase to be a “test case” (so ExprCommandCallFunctionTestCase from this file is a test case), and for each test case, any member function whose name starts with “test” to be a single test. So in this case we’ve got ExprCommandCallFunctionTestCase.test_with_dsym and ExprCommandCallFunctionTestCase.test_with_dwarf. The first only runs on darwin, the second runs on all platforms but is xfail’ed on FreeBSD, GCC, ICC, and darwin

(I’m not sure what the @dsym_test and @dwarf_test annotations are for)