distributed lit testing

Hello I am Victor Kukshiev (cetjs2 in IRC), 2rd course student of PetrSU university.
Distributed lit testing idea is interested and possible for me, I think.

Could you tell us more about this project?
What is lit test suite?
I know python language.
What do I participate in thiis project?

Hi Victor,

The lit test framework is the main testing framework used by LLVM. You can find the source code for it in the LLVM github repository (see in particular https://github.com/llvm/llvm-project/tree/main/llvm/utils/lit), and there is documentation available for it on the LLVM website - https://llvm.org/docs/TestingGuide.html gives the high-level picture of how LLVM is tested, whilst https://llvm.org/docs/CommandGuide/lit.htmlis more focused on lit specifically.

Examples of where lit is used include the individual test files located in places like llvm/test, clang/test and lld/test within the github tree. These test directories include additional configuration files, some of which are configured when CMake is used to generate the build files for the LLVM project. If you aren’t already familiar with LLVM, I highly recommend reading up on https://llvm.org/docs/GettingStarted.html, and following the steps to make sure you can build and run LLVM components locally.

Lit works as a python process which spawns many child processes, each of which runs one or more of the tests located in the directory under test. These tests typically are a sequence of commands that use components of LLVM that have already been built. You can build the test dependencies and run the tests by building one of the CMake-generated targets called check-* (where * might be llvm, lld, clang, etc to run a test subset or “check-all” to run all known tests. Currently, the tests run in parallel on the user’s machine, using the python multiprocessing library to do this. There also exists the --num-shards and related options which allows multiple computers to each run a subset of the tests. I am not too familiar on how this option is used in practice, but I believe it requires the computers to all have access to some shared filesystem which contains the tests and build artifacts, or to each have the same version checked out and to have been sent the full set of build artifacts to use. Others on this list might be able to clarify further.

The project goal is to provide a framework for distributing these tests across multiple computers in a more flexible manner than the existing sharding mechanism. I can think of two different high-level options - either a layer on top of lit which uses the existing sharding mechanism somehow, or something built into the existing lit code that goes wide with the tests across the machines. It would be up to you to identify and implement a way forward doing this. The hope would be that this framework could be used for multiple different distributed systems, as described in the original project description on the Open Projects page.

This project is intended to be a possible Google Summer of Code project. As such, to participate in it, you’d need to sign up on the GSOC website, and provide a project proposal there which details how you plan to solve the challenge. It would help your proposal get accepted if you can show some understanding of the lit testsuite, and some evidence of contributions to LLVM (perhaps in the form of additional testing you might identify that is missing in some tests, or by fixing one or more bugs from the LLVM bugzilla page, perhaps labelled with the “beginner” keyword). I am happy to work with you on your proposal if you are uncertain about anything, but the core of the proposal needs to come from you.

I hope that gives you the information you are looking for. Please feel free to ask any further questions that you may have.

James

+iwg@llvm.org

Hi James,

We run lit tests at Google using a custom runner on a distributed build system similar to Bazel.
In particular we run most of the llvm-project tests both when pulling in upstream revisions, and for any change to our internal repository that touches nearby files.

I wanted to share some of our experiences in case they’re useful, and in the hope that this project may result in something we can use too :slight_smile:
I’m being brief here, but happy to provide more details.

Our build system wants to run each test in isolation (separate process, sandboxed).
Making each test hermetic separates concerns nicely (the same distributed runner is used for all kinds of testing, not just lit).
This model is also easier to fit into other containers (e.g. I imagine Ninja could make a good local test driver).
Compared to e.g. a custom driver that talks to a custom worker server that runs many tests per subprocess… there’s not very much of that we would be able to reuse.
I know there are OSS Bazel projects that want to run lit tests that would struggle with this model too.

The biggest problem with using the standard lit tool for hermetic tests is it was too slow to start to run a single test.
Fundamentally the slow parts are the config system, and init of python programs.

We had a greatly simplified time with the config system, because test (mostly) in a single config, so we could flatten it out into a list of features and substitutions.
But in a more general system, if we can produce the config data from config logic as a build step, then it can be cached in the usual way and simply fed into each test.
You’ll need to untangle config specific to the machine running the test from config specific to the machine driving the tests.

I wrote a hermetic test runner in Go - not my favorite language but it starts up fast and has good subprocess support.
It’s greatly simplifying to be able to assume you can fork a real shell and only limited state (CWD, exported vars) can leak from one RUN line to the next, this works fine for us in practice (but we don’t test on windows).
It has some nice features like printing a transcript of the test run, highlighting directives and stderr output, showing pre/post expansion lines, annotating each line with the result.
I should be able to share the code of this, it’s nothing terribly surprising.
It’s less than 1000LOC and runs almost all LLVM tests - IMO it would be worthwhile to keep the lit spec very simple and removing some of the marginal features that have crept in over the years. We chose to simply drop some tests rather than deal with all the corners.
(Before this existed, we ran sed over the lit tests to turn them into shell scripts, which worked but was hard to maintain and to read the output on failure… actually the upstream lit runner has the latter problem too!)

I’m sure I’ve forgotten things, but I think those were my biggest takeaways. Needing to solve the config problem + the go dependency were the main reasons I didn’t push to make these changes upstream :frowning:
Hope this is useful or maybe at least interesting :slight_smile:

Cheers, Sam

Thank you for sharing your experience Sam! I’d be interested in taking a look at your test runner if it’s something you could publish.

I started looking into this topic recently since we’re now looking into a way to run lit tests on Fuchsia. I started experimenting with the remote execution support in libc++ but using SCP and SSH for each test doesn’t really scale.

In Fuchsia, the unit of distribution is a package that’s completely hermetic. We then run these packages as components, where each component has its own filesystem and doesn’t have any unnecessary privileges. It’s similar to containers in many ways.

The idea I got was to extend lit to separate configuration from execution, which would allow us to package up all tests on the host, push them to the target and run each of them as a separate component using our test runner (we already have a Fuchsia test runner that runs tests as components).

It sounds very similar to what you already did and I’d be interested in seeing if we could reuse some of your tooling. Furthermore, it’d be great if we could come up with a way to support this workflow directly in lit and LLVM.

Nico also looked into this area in the past, experimenting with a custom test runner written in Go (github.com/nico/glitch) and using Ninja as a test runner (reviews.llvm.org/D47506) which may be worth checking.

I also have https://github.com/nico/llvm-project/commit/7246393c6bbc270044641415ffb0db93ffee3e29 in a local branch, which makes it possible and easy to zip up all build artifacts and test inputs needed to run tests on a remote machine.

With this, you can run check-llvm, check-clang, check-lld etc in parallel (sharded per test suite too) – but you’re limited by your uplink speed.

(Also, assumes GN build, but the idea should transfer to cmake fine.)