Using FileCheck in lldb inline tests

Hello,

I’d like to make FileCheck available within lldb inline tests, in addition to existing helpers like ‘runCmd’ and ‘expect’.

My motivation is that several tests I’m working on can’t be made as rigorous as they need to be without FileCheck-style checks. In particular, the ‘matching’, ‘substrs’, and ‘patterns’ arguments to runCmd/expect don’t allow me to verify the ordering of checked input, to be stringent about line numbers, or to capture & reuse snippets of text from the input stream.

I’d curious to know if anyone else is interested or would be willing to review this (https://reviews.llvm.org/D50751).

Here’s an example of an inline test which benefits from FileCheck-style checking. This test is trying to check that certain frames appear in a backtrace when stopped inside of the “sink” function. Notice that without FileCheck, it’s not possible to verify the order in which frames are printed, and that dealing with line numbers would be cumbersome.

--- a/lldb/packages/Python/lldbsuite/test/functionalities/tail_call_frames/unambiguous_sequence/main.cpp
+++ b/lldb/packages/Python/lldbsuite/test/functionalities/tail_call_frames/unambiguous_sequence/main.cpp
@@ -9,16 +9,21 @@

volatile int x;

+// CHECK: frame #0: {{.*}}sink() at main.cpp:[[@LINE+2]] [opt]
void __attribute__((noinline)) sink() {
- x++; //% self.expect("bt", substrs = ['main', 'func1', 'func2', 'func3', 'sink'])
+ x++; //% self.filecheck("bt", "main.cpp")
}

+// CHECK-NEXT: frame #1: {{.*}}func3() {{.*}}[opt] [artificial]
void __attribute__((noinline)) func3() { sink(); /* tail */ }

+// CHECK-NEXT: frame #2: {{.*}}func2() at main.cpp:[[@LINE+1]] [opt]
void __attribute__((disable_tail_calls, noinline)) func2() { func3(); /* regular */ }

+// CHECK-NEXT: frame #3: {{.*}}func1() {{.*}}[opt] [artificial]
void __attribute__((noinline)) func1() { func2(); /* tail */ }

+// CHECK-NEXT: frame #4: {{.*}}main at main.cpp:[[@LINE+2]] [opt]
int __attribute__((disable_tail_calls)) main() {
func1(); /* regular */
return 0;

For reference, here’s the output of the “bt” command:

runCmd: bt
output: * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
* frame #0: 0x000000010c6a6f64 a.out`sink() at main.cpp:14 [opt]
frame #1: 0x000000010c6a6f70 a.out`func3() at main.cpp:15 [opt] [artificial]
frame #2: 0x000000010c6a6f89 a.out`func2() at main.cpp:21 [opt]
frame #3: 0x000000010c6a6f90 a.out`func1() at main.cpp:21 [opt] [artificial]
frame #4: 0x000000010c6a6fa9 a.out`main at main.cpp:28 [opt]

thanks,
vedant

I’ve thought about this in the past but the conclusion I came to is that lldbinline tests are actually just filecheck tests in disguise. Why do we need both? I’d rather delete the lldbinline infrastructure entirely and make a new lit TestFormat that basically does what lldbinline already does

I'd argue against this approach because it's exactly why the lit tests don't run against the lldb driver -- they're hardcoding the output of the lldb driver command into the testsuite and these will eventually make it much more difficult to change and improve the driver as we've accumulated this style of test.

This is a perfect test for a normal SB API. Run to your breakpoints and check the stack frames.

  f0 = thread.GetFrameAtIndex(0)
  check that f0.GetFunctionName() == sink
  check that f0.IsArtifical() == True
  check that f0.GetLineEntry().GetLine() == expected line number

it's more verbose, but it's also much more explicit about what it's checking, and easy to see what has changed if there is a failure.

J

I’ve thought about this in the past but the conclusion I came to is that lldbinline tests are actually just filecheck tests in disguise. Why do we need both? I’d rather delete the lldbinline infrastructure entirely and make a new lit TestFormat that basically does what lldbinline already does

An inline test does more than simply pattern-matching input. It builds a program, sets breakpoints, etc. I’d rather make this existing infrastructure easier to use than come up with something new.

vedant

It'd be easy to update FileCheck tests when changing the debugger (this happens all the time in clang/swift). OTOH, the verbosity of the python API means that fewer tests get written. I see a real need to make expressive tests easier to write.

vedant

Right, but only one specific type of lit test depends on pattern matching, and those are the SHTest format tests. You can make an arbitrary test format, including one that builds programs, set breakpoints etc. the format and structure of an lldbinline test need not even change at all (except that I think we could eliminate the .py file). The lldbinline tests are about as close to a drop in fit for lit as we can get, the only thing that needs to happen is the code in lldbinline.py needs to move to something called InlineTestFormat.py and then be hooked into lit

It's more verbose, and it does mean test writers need to learn the public API, but it's also much more stable and debuggable in the future. It's a higher up front cost but we're paid back in being able to develop lldb more quickly in the future, where our published API behaviors are being tested directly, and the things that must not be broken. The lldb driver's output isn't a contract, and treating it like one makes the debugger harder to innovate in the future.

It's also helpful when adding new features to ensure you've exposed the feature through the API sufficiently. The first thing I thought to try when writing the example below was SBFrame::IsArtificial() (see SBFrame::IsInlined()) which doesn't exist. If a driver / IDE is going to visually indicate artificial frames, they'll need that.

J

Having bugs also makes the debugger harder to innovate in the future because it’s, not having tests leads to having bugs, and sb api tests leads to not having tests. At the end of the day, it doesn’t matter how stable the tests are if there arent enough of them. There should be about 10x-20x as many tests as there are currently, and that will simply never happen under the current approach. If it means we need to have multiple different styles of test, so be it. The problem we face right now has nothing to do with command output changing, and in fact I don’t that we’ve ever had this problem. So we should be focusing on problems we have, not problems we don’t have.

Note that it is not strictly necessary for a test to check the debuggers command output. There could be a different set of commands whose only purpose is to print information for the purposes of debugging. One idea would be to introduce the notion of a debugger object model, where you print various aspects of the debuggers state with an object like syntax. For example,

(lldb) p debugger.targets
~/foo (running, pid: 123)

(lldb) p debugger.targets[0].threads[0].frames[1]
int main(int argc=3, char **argv=0x12345678) + 0x72

(lldb) p debugger.targets[0].threads[0].frames[1].params[0]
int argc=3

(lldb) p debugger.targets[0].breakpoints
[1] main.cpp:72

Etc. you can get arbitrarily granular and dxpose every detail of the debuggers internal state this way, and the output is so simple that you never have to worry about it changing.

That said, I think history has shown that limiting ourselves to sb api tests, despite all the theoretical benefits, leads to insufficient test coverage. So while it has benefits, it also has problems for which we need a better solution

Having bugs also makes the debugger harder to innovate in the future because it’s, not having tests leads to having bugs, and sb api tests leads to not having tests. At the end of the day, it doesn’t matter how stable the tests are if there arent enough of them. There should be about 10x-20x as many tests as there are currently, and that will simply never happen under the current approach. If it means we need to have multiple different styles of test, so be it. The problem we face right now has nothing to do with command output changing, and in fact I don’t that we’ve *ever* had this problem. So we should be focusing on problems we have, not problems we don’t have.

I think we've had this discussion many times over the years, so I apologize for not reiterating what I've said in the past. I worked on gdb for a decade before this, where the entire testsuite was filecheck style tests based on gdb's output. It made it easy for them to write, and after a couple decades of tests had accumulated, it became a nightmare to change or improve any of gdb's commands, we all avoided it like the plague because it was so onerous. The tests themselves would accumulate more and more complicated regular expressions to handle different output that happened to be seen, so debugging WHY a given test was failing required an amazing amount of work.

Yes, lldb does not have these problems -- because we learned from our decades working on gdb, and did not repeat that mistake. To be honest, lldb is such a young debugger - barely a decade old, depending on how you count it, that ANY testsuite approach would be fine at this point. Add a couple more decades and we'd be back into the hole that gdb was in. {I have not worked on gdb in over a decade, so I don't know how their testing methodology may be today}

It's always important to remember that lldb is first and foremost a debugger library. It also includes a driver program, lldb, but it is designed as a library and it should be tested as a library.

Note that it is not strictly necessary for a test to check the debuggers command output. There could be a different set of commands whose only purpose is to print information for the purposes of debugging. One idea would be to introduce the notion of a debugger object model, where you print various aspects of the debuggers state with an object like syntax. For example,

This was the whole point of the lit tests, wasn't it? To have a driver program, or driver programs, designed explicitly for filecheck, where the output IS API and can be relied on. There hasn't been disagreement about this.

That doesn’t mean that the current approach is the final word. As new people come onto the project, new ideas come forth and we should entertain them rather than deciding that all decisions are set in stone forever.

For example, the object model based approach I mentioned earlier would not have any of the problems that you’ve described from gdb. Just because one set of problems has been solved doesn’t mean we should declare victory and say there’s no point in trying to solve the remaining problems too. And right now, the problem is that we need to be coming up with a way to make tests easier to write so that people will actually write them

>
> Having bugs also makes the debugger harder to innovate in the future because it’s, not having tests leads to having bugs, and sb api tests leads to not having te

Yes, lldb does not have these problems -- because we learned from our decades working on gdb, and did not repeat that mistake. To be honest, lldb is such a young debugger - barely a decade old, depending on how you count it, that ANY testsuite approach would be fine at this point. Add a couple more decades and we'd be back into the hole that gdb was in. {I have not worked on gdb in over a decade, so I don't know how their testing methodology may be today}
That doesn’t mean that the current approach is the final word. As new people come onto the project, new ideas come forth and we should entertain them rather than deciding that all decisions are set in stone forever.

For example, the object model based approach I mentioned earlier would not have any of the problems that you’ve described from gdb. Just because one set of problems has been solved doesn’t mean we should declare victory and say there’s no point in trying to solve the remaining problems too. And right now, the problem is that we need to be coming up with a way to make tests easier to write so that people will actually write them

Just historically it is not true that we didn't have problems with command output scraping vrs. the testsuite. The major offender in this regard was breakpoint setting/checking. But the first time I had to go adjust a whole bunch of tests when I wanted to change the breakpoint output, I added test routines to do all this business, which mitigated the need for these changes. It is true I haven't had to fiddle with this much then - though that is mostly because I was careful in how I checked output, and didn't check inessential aspects. That's easy to do when you sit down and do it once, but if everybody just wings some regex every time they do a check you will end up with lots over-eager checks that cause problems down the road.

I also want to reiterate another point Jason made, which is that LLDB not only provides a command line interface, but it also provides an API. Again, we've gone over this before but I'm pretty sure that at this point it's still true that in terms of number of users, far more people interact with lldb through the API (Xcode, VSCode through the gdb-mi or through Greg's new adaptor when that's done...) than interact with it through the command line. And writing SB API tests when you add a new feature is a great way to ensure that you have thought about exposing your new feature to the SB API's correctly. In that example, a GUI needs to have a way to indicate that these "tail call" frames are special and you shouldn't be worried that they have no local variables, even though we show you source for them that obviously has variables...

So we need ways to encourage lldb developers to be aware that the SB API's need to have (tested) access to new features in the debugger and not just the command line. SB API tests are good way to encourage this. We can also do it as part of reviews and the like, but I think it better done as part of your development than after the fact in reviews... Again, that doesn't argue for their being the ONLY way to test things in lldb, but it does need to be part of an lldb developer's practice.

I like your idea of some kind of object model based output for test checking. I somehow had the impression that that was your intent for lldb-test once it was a real boy, but maybe I was just projecting... Relying on ad hoc dumping routines which are mostly for debugging purposes seemed okay as a boot-strap method, but not a great design long-term. Teaching either the SB or the lldb_private objects to describe themselves in a structured way and using that to do testing seems great.

Jim

Creating a new lit test format compatible with inline tests sounds like a nice follow-up step. The immediate problem I’m trying to address is that it’s too cumbersome to write the tests I need without FileCheck-like checks in inline tests. I think introducing this functionality is a necessary prerequisite.

vedant

It's more verbose, and it does mean test writers need to learn the public API, but it's also much more stable and debuggable in the future.

I'm not sure about this. Having looked at failing sb api tests for a while now, I find them about as easy to navigate and fix as FileCheck tests in llvm.

It's a higher up front cost but we're paid back in being able to develop lldb more quickly in the future, where our published API behaviors are being tested directly, and the things that must not be broken.

I think the right solution here is to require API tests when new functionality is introduced. We can enforce this during code review. Making it impossible to write tests against the driver's output doesn't seem like the best solution. It means that far fewer tests will be written (note that a test suite run of lldb gives less than 60% code coverage). It also means that the driver's output isn't tested as much as it should be.

The lldb driver's output isn't a contract, and treating it like one makes the debugger harder to innovate in the future.

I appreciate your experience with this (pattern matching on driver input) in gdb. That said, I think there are reliable/maintainable ways to do this, and proven examples we can learn from in llvm/clang/etc.

It's also helpful when adding new features to ensure you've exposed the feature through the API sufficiently. The first thing I thought to try when writing the example below was SBFrame::IsArtificial() (see SBFrame::IsInlined()) which doesn't exist. If a driver / IDE is going to visually indicate artificial frames, they'll need that.

Sure. That's true, we do need API exposure for new features, and again we can enforce that during code review. The reason you didn't find IsArtificial() is because it's sitting on my disk :). Haven't shared the patch yet.

vedant

Having bugs also makes the debugger harder to innovate in the future because it’s, not having tests leads to having bugs, and sb api tests leads to not having tests.

Yeah, this might be a bit of an under-appreciated point.

If there’s a high up front cost to writing tests, not as many of them will be written.

I’m having a hard time imagining writing sb api tests which can exercise every tricky path through a new patchset I’m working on. Without being able to write tests against more compact input (a la FileCheck), I really don’t see a way forward.

At the end of the day, it doesn’t matter how stable the tests are if there arent enough of them. There should be about 10x-20x as many tests as there are currently, and that will simply never happen under the current approach. If it means we need to have multiple different styles of test, so be it. The problem we face right now has nothing to do with command output changing, and in fact I don’t that we’ve ever had this problem. So we should be focusing on problems we have, not problems we don’t have.

Note that it is not strictly necessary for a test to check the debuggers command output. There could be a different set of commands whose only purpose is to print information for the purposes of debugging. One idea would be to introduce the notion of a debugger object model, where you print various aspects of the debuggers state with an object like syntax. For example,

(lldb) p debugger.targets
~/foo (running, pid: 123)

(lldb) p debugger.targets[0].threads[0].frames[1]
int main(int argc=3, char **argv=0x12345678) + 0x72

(lldb) p debugger.targets[0].threads[0].frames[1].params[0]
int argc=3

(lldb) p debugger.targets[0].breakpoints
[1] main.cpp:72

Etc. you can get arbitrarily granular and dxpose every detail of the debuggers internal state this way, and the output is so simple that you never have to worry about it changing.

That said, I think history has shown that limiting ourselves to sb api tests, despite all the theoretical benefits, leads to insufficient test coverage. So while it has benefits, it also has problems for which we need a better solution

Yeah.

vedant

Having bugs also makes the debugger harder to innovate in the future because it’s, not having tests leads to having bugs, and sb api tests leads to not having tests. At the end of the day, it doesn’t matter how stable the tests are if there arent enough of them. There should be about 10x-20x as many tests as there are currently, and that will simply never happen under the current approach. If it means we need to have multiple different styles of test, so be it. The problem we face right now has nothing to do with command output changing, and in fact I don’t that we’ve *ever* had this problem. So we should be focusing on problems we have, not problems we don’t have.

I think we've had this discussion many times over the years, so I apologize for not reiterating what I've said in the past. I worked on gdb for a decade before this, where the entire testsuite was filecheck style tests based on gdb's output. It made it easy for them to write, and after a couple decades of tests had accumulated, it became a nightmare to change or improve any of gdb's commands, we all avoided it like the plague because it was so onerous. The tests themselves would accumulate more and more complicated regular expressions to handle different output that happened to be seen, so debugging WHY a given test was failing required an amazing amount of work.

Yep, I definitely sympathize with this issue. We have this problem of having large, hard-to-parse pattern-matching tests in the frontends (some specific IRGen tests come to mind...). Again, I think the solution here is more active code review and using alternative representations of the input of interest. It's usually possible to simplify these tests substantially by having an even more compact, simplified dump of the internal representation to check against. This is what we did for clang/swift's code coverage IRGen tests and it's worked really well.

Yes, lldb does not have these problems -- because we learned from our decades working on gdb, and did not repeat that mistake. To be honest, lldb is such a young debugger - barely a decade old, depending on how you count it, that ANY testsuite approach would be fine at this point. Add a couple more decades and we'd be back into the hole that gdb was in. {I have not worked on gdb in over a decade, so I don't know how their testing methodology may be today}

It's always important to remember that lldb is first and foremost a debugger library. It also includes a driver program, lldb, but it is designed as a library and it should be tested as a library.

I broadly agree with your points here, but I'm not convinced that the only acceptable form of testing for lldb is based on sb api tests. I think the only way to an acceptable level of test coverage is to enable expressive testing on compact input. Moreover I don't think that's incompatible with having sufficient API test coverage.

vedant

Having bugs also makes the debugger harder to innovate in the future because it’s, not having tests leads to having bugs, and sb api tests leads to not having te

Yes, lldb does not have these problems -- because we learned from our decades working on gdb, and did not repeat that mistake. To be honest, lldb is such a young debugger - barely a decade old, depending on how you count it, that ANY testsuite approach would be fine at this point. Add a couple more decades and we'd be back into the hole that gdb was in. {I have not worked on gdb in over a decade, so I don't know how their testing methodology may be today}
That doesn’t mean that the current approach is the final word. As new people come onto the project, new ideas come forth and we should entertain them rather than deciding that all decisions are set in stone forever.

For example, the object model based approach I mentioned earlier would not have any of the problems that you’ve described from gdb. Just because one set of problems has been solved doesn’t mean we should declare victory and say there’s no point in trying to solve the remaining problems too. And right now, the problem is that we need to be coming up with a way to make tests easier to write so that people will actually write them

Just historically it is not true that we didn't have problems with command output scraping vrs. the testsuite. The major offender in this regard was breakpoint setting/checking. But the first time I had to go adjust a whole bunch of tests when I wanted to change the breakpoint output, I added test routines to do all this business, which mitigated the need for these changes. It is true I haven't had to fiddle with this much then - though that is mostly because I was careful in how I checked output, and didn't check inessential aspects. That's easy to do when you sit down and do it once, but if everybody just wings some regex every time they do a check you will end up with lots over-eager checks that cause problems down the road.

I definitely sympathize with this, but don't think the problems that arise in some tests which pattern-match against input are characteristic of all such tests. The lesson you're describing -- the need to be careful to only examine the interesting parts of the API/driver output -- applies generally.

I also want to reiterate another point Jason made, which is that LLDB not only provides a command line interface, but it also provides an API. Again, we've gone over this before but I'm pretty sure that at this point it's still true that in terms of number of users, far more people interact with lldb through the API (Xcode, VSCode through the gdb-mi or through Greg's new adaptor when that's done...) than interact with it through the command line. And writing SB API tests when you add a new feature is a great way to ensure that you have thought about exposing your new feature to the SB API's correctly. In that example, a GUI needs to have a way to indicate that these "tail call" frames are special and you shouldn't be worried that they have no local variables, even though we show you source for them that obviously has variables...

I agree with all of this. However, I don't take it to mean that there ought not be tests for the driver output.

So we need ways to encourage lldb developers to be aware that the SB API's need to have (tested) access to new features in the debugger and not just the command line. SB API tests are good way to encourage this. We can also do it as part of reviews and the like, but I think it better done as part of your development than after the fact in reviews... Again, that doesn't argue for their being the ONLY way to test things in lldb, but it does need to be part of an lldb developer's practice.

Totally agree.

vedant

It's more verbose, and it does mean test writers need to learn the public API, but it's also much more stable and debuggable in the future.

I'm not sure about this. Having looked at failing sb api tests for a while now, I find them about as easy to navigate and fix as FileCheck tests in llvm.

I don't find that to be true. I see a failing test on line 79 or whatever, and depending on what line 79 is doing, I'll throw in some self.runCmd("bt")'s or self.runCmd("fr v") to the test, re-run, and see what the relevant context is quickly. For most simple tests, I can usually spot the issue in under a minute. dotest.py likes to eat output when it's run in multiprocess mode these days, so I have to remember to add --no-multiprocess. If I'm adding something that I think is generally useful to debug the test case, I'll add a conditional block testing again self.TraceOn() and print things that may help people who are running dotest.py with -t trace mode enabled.

Sometimes there is a test written so it has a "verify this value" function that is run over a variety of different variables during the test timeframe, and debugging that can take a little more work to understand the context that is failing. But that kind of test would be harder (or at least much more redundant) to express in a FileCheck style system anyway, so I can't ding it.

As for the difficulty of writing SB API tests, you do need to know the general architecture of lldb (a target has a process, a process has threads, a thread has frames, a frame has variables), the public API which quickly becomes second nature because it is so regular, and then there's the testsuite specific setup and template code. But is that that intimidating to anyone familiar with lldb? packages/Python/lldbsuite/test/sample_test/TestSampleTest.py is 50 lines including comments; there's about ten lines of source related to initializing / setting up the testsuite, and then 6 lines is what's needed to run to a breakpoint, get a local variable, check the value.

J

It's more verbose, and it does mean test writers need to learn the public API, but it's also much more stable and debuggable in the future.

I'm not sure about this. Having looked at failing sb api tests for a while now, I find them about as easy to navigate and fix as FileCheck tests in llvm.

I don't find that to be true. I see a failing test on line 79 or whatever, and depending on what line 79 is doing, I'll throw in some self.runCmd("bt")'s or self.runCmd("fr v") to the test, re-run, and see what the relevant context is quickly. For most simple tests, I can usually spot the issue in under a minute. dotest.py likes to eat output when it's run in multiprocess mode these days, so I have to remember to add --no-multiprocess. If I'm adding something that I think is generally useful to debug the test case, I'll add a conditional block testing again self.TraceOn() and print things that may help people who are running dotest.py with -t trace mode enabled.

I do agree that there are effective ways of debugging sb api tests. Having worked with plenty of filecheck-based tests in llvm/clang/swift, I find them to be as easy (or easier for me personally) to debug.

Sometimes there is a test written so it has a "verify this value" function that is run over a variety of different variables during the test timeframe, and debugging that can take a little more work to understand the context that is failing. But that kind of test would be harder (or at least much more redundant) to express in a FileCheck style system anyway, so I can't ding it.

Yep, sounds like a great candidate for a unit test or an SB API test.

As for the difficulty of writing SB API tests, you do need to know the general architecture of lldb (a target has a process, a process has threads, a thread has frames, a frame has variables), the public API which quickly becomes second nature because it is so regular, and then there's the testsuite specific setup and template code. But is that that intimidating to anyone familiar with lldb?

Not intimidating, no. Cumbersome and slow, absolutely. So much so that I don't see a way of adequately testing my patches this way. It would just take too much time.

vedant

Back to the original proposal, my biggest concern is that a single inline test could generate many FileCheck invocations. This could cause measurable performance impact on the test suite. Have you considered this?

Another possible solution is what i mentioned earlier, basically to expose a debugger object model. This would allow you to accomplish what you want without FileCheck, while simultaneously being making many other types of tests easier to write at the same time. On the other hand, it’s a larger effort to create this system, but I think long term it would pay back enormously (it’s even useful as a general purpose debugger feature, not limited to testing)

Back to the original proposal, my biggest concern is that a single inline test could generate many FileCheck invocations. This could cause measurable performance impact on the test suite. Have you considered this?

That’s a good point. I hadn’t considered that. My thoughts on that are;

  • It’s relatively cheap to create a FileCheck process. If the build is (A|T)sanified, we can copy in a non-sanitized FileCheck to speed things up.

  • Based on the time it takes to run check-{llvm,clang} locally, which have ~56,000 FileCheck invocations, my intuition is that the overhead ought to be manageable.

  • The status quo is doing Python’s re.search over a chunk of command output. My (unverified) intuition is that FileCheck won’t be slower than that. Actually, FileCheck has an algorithmic advantage because it doesn’t re-scan the input text from the beginning of the text each time it tries to match a substring. self.expect does.

Another possible solution is what i mentioned earlier, basically to expose a debugger object model. This would allow you to accomplish what you want without FileCheck, while simultaneously being making many other types of tests easier to write at the same time. On the other hand, it’s a larger effort to create this system, but I think long term it would pay back enormously (it’s even useful as a general purpose debugger feature, not limited to testing)

I’d volunteer to work on that. At the moment I really need to get some form of testing put together for my patches soon.

vedant