[RFC] Upstreaming Reproducer Capture/Replay for the API Test Suite

Hi everyone,

Reproducers in LLDB are currently tested through (1) unit tests, (2) dedicated end-to-end shell tests and (3) the lldb-check-repro suite which runs all the shell tests against a replayed reproducer. While this already provides great coverage, we’re still missing out on about 800 API tests. These tests are particularly interesting to the reproducers, because as opposed to the shell tests, which only exercises a subset of SB API calls used to implement the driver, they cover the majority of the API surface.

To further qualify reproducer and to improve test coverage, I want to capture and replay the API test suite as well. Conceptually, this can be split up in two stages:

  1. Capture a reproducer and replay it with the driver. This exercises the reproducer instrumentation (serialization and deserialization) for all the APIs used in our test suite. While a bunch of issues with the reproducer instrumentation can be detected at compile time, a large subset only triggers through assertions at runtime. However, this approach by itself only verifies that we can (de)serialize API calls and their arguments. It has no knowledge of the expected results and therefore cannot verify the results of the API calls.

  2. Capture a reproducer and replay it with dotest.py. Rather than having the command line driver execute every API call one after another, we can have dotest.py call the Python API as it normally would, intercept the call, replay it from the reproducer, and return the replayed result. The interception can be hidden behind the existing LLDB_RECORD_* macros, which contains sufficient type info to drive replay. It then simply re-invokes itself with the arguments deserialized from the reproducer and returns that result. Just as with the shell tests, this approach allows us to reuse the existing API tests, completely transparently, to check the reproducer output.

I have worked on this over the past month and have shown that it is possible to achieve both stages. I have a downstream fork that contains the necessary changes.

All the runtime issues found in stage 1 have been fixed upstream. With the exception of about 30 tests that fail because the GDB packets diverge during replay, all the tests can be replayed with the driver.

About 120 tests, which include the 30 mentioned earlier, fail to replay for stage 2. This isn’t entirely unexpected, just like the shell tests, there are tests that simply are not expected to work. The reproducers don’t currently capture the output of the inferior and synchronization through external files won’t work either, as those paths will get remapped by the VFS. This requires manually triage.

I would like to start upstreaming this work so we can start running this in CI. The majority of the changes are limited to the reproducer instrumentation, but some changes are needed in the test suite as well, and there would be a new decorator to skip the unsupported tests. I’m splitting up the changes in self-contained patches, but wanted to send out this RFC with the bigger picture first.

Please let me know what you think!


I personally believe this is a required step to make sure:
a) Reproducers can jump from being a prototype idea to something that can actually run in production
b) Whenever we add a new test [or presumably a new API] we get coverage for-free.
c) We have a verification mechanism to make sure we don’t regress across the large surface API and not only what the unittests & shell tests cover.

I personally would be really glad to see this being upstreamed. I also would like to thank you for doing the work in a downstream branch until you proved this was achievable.

Hi Jonas, Davide,

I am not exactly thrilled by the ever-growing number of "modes" our test
suite can be run in. However, it seems that's a battle I am destined to
loose, so I'll just repeat what I've been saying for some time now.

I don't believe that either of these funny "modes" should be the _only_
way to test a given piece of code. Using the extra modes to increase
test coverage is fine, and I can certainly appreciate the value of this
kind of exploratory testing (I've added some temporary modes locally
myself when working on various patches), but I still believe that every
patch should have an accompanying test(s) which can run in the default
"mode" and with as few dependencies as possible.

I believe Jonas is aware of that, and his existing work on reproducers
reflects that philosophy, but I think it's still important to spell this