RFC: End-to-end testing

To check that instructions are generated from source, a two-step test
is the best approach:
- Verify that Clang emits different IR for different options, or the
right IR for a new functionality
- Verify that the affected targets (or at least two of the main ones)
can take that IR and generate the right asm

Clang can emit LLVM IR for any target, but you don't necessarily need
to build the back-ends.

If you want to do the test in Clang all the way to asm, you need to
make sure the back-end is built. Clang is not always build with all
back-ends, possibly even none.

To do that in the back-end, you'd have to rely on Clang being built,
which is not always true.

Hacking our test infrastructure to test different things when a
combination of components is built, especially after they start to
merge after being in a monorepo, will complicate tests and increase
the likelihood that some tests will never be run by CI and bit rot.

On the test-suite, you can guarantee that the whole toolchain is
available: Front and back end of the compilers, assemblers (if
necessary), linkers, libraries, etc.

Writing a small source file per test, as you would in Clang/LLVM,
running LIT and FileCheck, and *always* running it in the TS would be
trivial.

--renato

David Blaikie via llvm-dev <llvm-dev@lists.llvm.org> writes:

& I generally agree that end-to-end testing should be very limited - but
there are already some end-to-end-ish tests in clang and I don't think
they're entirely wrong there. I don't know much about the vectorization
tests - but any test that requires a tool to maintain/generate makes me a
bit skeptical and doubly-so if we were testing all of those end-to-end too.
(I'd expect maybe one or two sample/example end-to-end tests, to test
certain integration points, but exhaustive testing would usually be left to
narrower tests (so if you have one subsystem with three codepaths {1, 2, 3}
and another subsystem with 3 codepaths {A, B, C}, you don't test the full
combination of {1, 2, 3} X {A, B, C} (9 tests), you test each set
separately, and maybe one representative sample end-to-end (so you end up
with maybe 7-8 tests))

That sounds reasonable. End-to-end tests are probably going to be very
much a case-by-case thing. I imagine we'd start with the component
tests as is done today and then if we see some failure in end-to-end
operation that isn't covered by the existing component tests we'd add an
end-to-end test. Or maybe we create some new component tests to cover
it.

Possible I know so little about the vectorization issues in particular that
my thoughts on testing don't line up with the realities of that particular
domain.

Vectorization is one only small part of what I imagine we'd want to test
in an end-to-end fashion. There are lots of examples of "we want this
code generated" beyond vectorization.

                           -David

Mehdi AMINI via llvm-dev <llvm-dev@lists.llvm.org> writes:

The main thing I see that will justify push-back on such test is the
maintenance: you need to convince everyone that every component in LLVM
must also maintain (update, fix, etc.) the tests that are in other
components (clang, flang, other future subproject, etc.). Changing the
vectorizer in the middle-end may require now to understand the kind of
update a test written in Fortran (or Haskell?) is checking with some
Hexagon assembly. This is a non-trivial burden when you compute the full
matrix of possible frontend and backends.

That's true. But don't we want to make sure the complete compiler works
as expected? And don't we want to be alerted as soon as possible if
something breaks? To my knowledge we have very few end-to-end tests of
the type I've been thinking about. That worries me.

Even if you write very small tests for checking vectorization, what is
next? What about unrolling, inlining, loop-fusion, etc. ? Why would we stop
the end-to-end FileCheck testing to vectorization?

I actually think vectorization is probably lower on the concern list for
end-to-end testing than more focused things like FMA generation,
prefetching and so on. This is because there isn't a lot after the
vectorization pass that can be mess up vectorization. Once something is
vectorized, it is likely to stay vectorized. On the other hand, I have
for example frequently seen prefetches dropped or poorly scheduled by
code long after the prefetch got inserted into the IR.

So the monorepo vs the test-suite seems like a false dichotomy: if such
tests don't make it in the monorepo it will be (I believe) because folks
won't want to maintain them. Putting them "elsewhere" is fine but it does
not solve the question of the maintenance of the tests.

Agree 100%.

                      -David

Renato wrote:

If you want to do the test in Clang all the way to asm, you need to
make sure the back-end is built. Clang is not always build with all
back-ends, possibly even none.

This is no different than today. Many tests in Clang require a specific
target to exist. Grep clang/test for "registered-target" for example;
I get 577 hits. Integration tests (here called "end-to-end" tests)
clearly need to specify their REQUIRES conditions correctly.

To do that in the back-end, you'd have to rely on Clang being built,
which is not always true.

A frontend-based test in the backend would be a layering violation.
Nobody is suggesting that.

Hacking our test infrastructure to test different things when a
combination of components is built, especially after they start to
merge after being in a monorepo, will complicate tests and increase
the likelihood that some tests will never be run by CI and bit rot.

Monorepo isn't the relevant thing. It's all about the build config.

Any test introduced by any patch today is expected to be run by CI.
This expectation would not be any different for these integration tests.

On the test-suite, you can guarantee that the whole toolchain is
available: Front and back end of the compilers, assemblers (if
necessary), linkers, libraries, etc.

Writing a small source file per test, as you would in Clang/LLVM,
running LIT and FileCheck, and *always* running it in the TS would be
trivial.

I have to say, it's highly unusual for me to make a commit that
does *not* produce blame mail from some bot running lit tests.
Thankfully it's rare to get one that is actually my fault.

I can't remember *ever* getting blame mail related to test-suite.
Do they actually run? Do they ever catch anything? Do they ever
send blame mail? I have to wonder about that.

Mehdi wrote:

David Greene wrote:

Personally, I still find source-to-asm tests to be highly valuable and I
don't think we need test-suite for that. Such tests don't (usually)
depend on system libraries (headers may occasionally be an issue but I
would argue that the test is too fragile in that case).

So maybe we separate concerns. Use test-suite to do the kind of
system-level testing you've discussed but still allow some tests in a
monorepo top-level directory that test across components but don't
depend on system configurations.

If people really object to a top-level monorepo test directory I guess
they could go into test-suite but that makes it much more cumbersome to
run what really should be very simple tests.

The main thing I see that will justify push-back on such test is the
maintenance: you need to convince everyone that every component in LLVM
must also maintain (update, fix, etc.) the tests that are in other
components (clang, flang, other future subproject, etc.). Changing the
vectorizer in the middle-end may require now to understand the kind of
update a test written in Fortran (or Haskell?) is checking with some
Hexagon assembly. This is a non-trivial burden when you compute the
full matrix of possible frontend and backends.

So how is this different from today? If I put in a patch that breaks
Hexagon, or compiler-rt, or LLDB, none of which I really understand...
or omg Chrome, which isn't even an LLVM project... it's still my job to
fix whatever is broken. If it's some component where I am genuinely
clueless, I'm expected to ask for help. Integration tests would not be
any different.

Flaky or fragile tests that constantly break for no good reason would
need to be replaced or made more robust. Again this is no different
from any other flaky or fragile test.

I can understand people being worried that because an integration test
depends on more components, it has a wider "surface area" of potential
breakage points. This, I claim, is exactly the *value* of such tests.
And I see them breaking primarily under two conditions.

1) Something is broken that causes other component-level failures.
   Fixing that component-level problem will likely fix the integration
   test as well; or, the integration test must be fixed the same way
   as the component-level tests.

2) Something is broken that does *not* cause other component-level
   failures. That's exactly what integration tests are for! They
   verify *interactions* that are hard or maybe impossible to test in
   a component-level way.

The worry I'm hearing is about a third category:

3) Integration tests fail due to fragility or overly-specific checks.

...which should be addressed in exactly the same way as our overly
fragile or overly specific component-level tests. Is there some
reason they wouldn't be?

--paulr

Mehdi AMINI via llvm-dev <llvm-dev@lists.llvm.org> writes:

The main thing I see that will justify push-back on such test is the
maintenance: you need to convince everyone that every component in LLVM
must also maintain (update, fix, etc.) the tests that are in other
components (clang, flang, other future subproject, etc.). Changing the
vectorizer in the middle-end may require now to understand the kind of
update a test written in Fortran (or Haskell?) is checking with some
Hexagon assembly. This is a non-trivial burden when you compute the full
matrix of possible frontend and backends.

That's true, but at some point we really do just need to work together
to make changes. If some necessary group of people become unresponsive,
then we'll need to deal with that, but just not knowing whether the
compiler works as intended seems worse.

That's true. But don't we want to make sure the complete compiler works
as expected? And don't we want to be alerted as soon as possible if
something breaks? To my knowledge we have very few end-to-end tests of
the type I've been thinking about. That worries me.

I agree. We really should have more end-to-end testing for cases where
we have end-to-end contracts. If we provide a pragma to ask for
vectorization, or loop unrolling, or whatever, then we should test "end
to end" for whatever that means from the beginning of the contract
(i.e., the place where the request is asserted) to the end (i.e., the
place where to can confirm that the user will observe the intended
behavior) - this might mean checking assembly or it might mean checking
end-stage IR, etc. There are other cases where, even if there's no
pragma, we know what the optimal output is and we can test for it. We've
had plenty of cases where changes to the pass pipeline, instcombine,
etc. have caused otherwise reasonably-well-covered components to stop
behaving as expected in the context of the complete pipeline.
Vectorization is a good example of this, but is not the only such
example. As I recall, other loop optimizations (unrolling, idiom
recognition, etc.) have also had these problems over time

Even if you write very small tests for checking vectorization, what is
next? What about unrolling, inlining, loop-fusion, etc. ? Why would we stop
the end-to-end FileCheck testing to vectorization?

I actually think vectorization is probably lower on the concern list for
end-to-end testing than more focused things like FMA generation,
prefetching and so on.

In my experience, these are about equal. Vectorization being later means
that fewer things can mess things up afterwards (although there still is
all of codegen), but more things can mess things up beforehand.

-Hal

This is no different than today. Many tests in Clang require a specific
target to exist. Grep clang/test for "registered-target" for example;
I get 577 hits. Integration tests (here called "end-to-end" tests)
clearly need to specify their REQUIRES conditions correctly.

Right, why I wrote in the beginning that Clang already has tests like that.

So, if all David wants is to extend those tests, than I think this
thread was a heck of a time wasting exercise. :slight_smile:

It's nothing new, nothing deeply controversial and it's in the list of
things we know are not great, but accept anyway.

I personally don't think it's a good idea (for reasons already
expressed in this thread), and that has brought me trouble when I was
setting up the Arm bots. I had to build the x86 target, even though I
never used it, just because of some tests.

Today, Arm bots are faster, so it doesn't matter much, but new
hardware will still have that problem. I would like, long term, to
have the right tests on the right places.

Monorepo isn't the relevant thing. It's all about the build config.

I didn't mean it would, per se. Yes, it's about build config, but
setting up CI with SVN means you have to actively checkout repos,
while in monorepo, they all come together, so it's easier to forget
they are tangled, or to hack around build issues (like I did when I
marked the x86 to build) and never look back (that was 7 years ago).

I have to say, it's highly unusual for me to make a commit that
does *not* produce blame mail from some bot running lit tests.
Thankfully it's rare to get one that is actually my fault.

I was hoping to reduce that. :slight_smile:

I can't remember *ever* getting blame mail related to test-suite.
Do they actually run? Do they ever catch anything? Do they ever
send blame mail? I have to wonder about that.

They do run, on both x86 and Arm at least, in different
configurations, including correctness and benchmark mode, on anything
between 5 and 100 commits, continuously.

They rarely catch much nowadays because the toolchain is stable and no
new tests are being added. They work very well, though, for external
system tests and benchmarks, and people use it downstream a lot.

They do send blame mail occasionally, but only after all the others,
and people generally ignore them. Bot owners usually have to pressure
people, create bugs, revert patches or just fix the issues themselves.

David Blaikie <dblaikie@gmail.com> writes:

Renato Golin <rengolin@gmail.com> writes:

Can you elaborate? I'm talking about very small tests targeted to
generate a specific instruction or small number of instructions.
Vectorization isn't the best example. Something like verifying FMA
generation is a better example.

To check that instructions are generated from source, a two-step test
is the best approach:
- Verify that Clang emits different IR for different options, or the
right IR for a new functionality
- Verify that the affected targets (or at least two of the main ones)
can take that IR and generate the right asm

Yes, of course we have tests like that. We have found they are not
always sufficient.

If you want to do the test in Clang all the way to asm, you need to
make sure the back-end is built. Clang is not always build with all
back-ends, possibly even none.

Right, which is why we have things like REQUIRES: x86-registered-target.

To do that in the back-end, you'd have to rely on Clang being built,
which is not always true.

Sure.

Hacking our test infrastructure to test different things when a
combination of components is built, especially after they start to
merge after being in a monorepo, will complicate tests and increase
the likelihood that some tests will never be run by CI and bit rot.

From other discussion, it sounds like at least some people are open to

asm tests under clang. I think that should be fine. But there are
probably other kinds of end-to-end tests that should not live under
clang.

On the test-suite, you can guarantee that the whole toolchain is
available: Front and back end of the compilers, assemblers (if
necessary), linkers, libraries, etc.

Writing a small source file per test, as you would in Clang/LLVM,
running LIT and FileCheck, and *always* running it in the TS would be
trivial.

How often would such tests be run as part of test-suite?

Honestly, it's not really clear to me exactly which bots cover what, how
often they run and so on. Is there a document somewhere describing the
setup?

                     -David

From other discussion, it sounds like at least some people are open to
asm tests under clang. I think that should be fine. But there are
probably other kinds of end-to-end tests that should not live under
clang.

That is my position as well. Some tests, especially similar to
existing ones, are fine.

But if we really want to do do complete tests and stress more than
just grepping a couple of instructions, should be in a better suited
place.

How often would such tests be run as part of test-suite?

Every time the TS is executed. Some good work has been put on it to
run with CMake etc, so it should be trivial to to run that before
commits, but it *does* require more than just "make check-all".

On CI, a number of bots run those as often as they can, non-stop.

Honestly, it's not really clear to me exactly which bots cover what, how
often they run and so on. Is there a document somewhere describing the
setup?

Not really. The main Buildbot page is a mess and the system is very
old. There is a round table at the dev meeting to discuss the path
forward.

This is not the first, though. We have been discussing this for a
number of years, but getting people / companies to commit to testing
is not trivial.

I created a page for the Arm bots (after many incarnations, it ended
up here: http://ex40-01.tcwglab.linaro.org/) to make that simpler. But
that wouldn't scale, nor it fixes the real problems.

Renato Golin <rengolin@gmail.com> writes:

From other discussion, it sounds like at least some people are open to
asm tests under clang. I think that should be fine. But there are
probably other kinds of end-to-end tests that should not live under
clang.

That is my position as well. Some tests, especially similar to
existing ones, are fine.

Ok.

But if we really want to do do complete tests and stress more than
just grepping a couple of instructions, should be in a better suited
place.

That's probably true.

How often would such tests be run as part of test-suite?

Every time the TS is executed. Some good work has been put on it to
run with CMake etc, so it should be trivial to to run that before
commits, but it *does* require more than just "make check-all".

I have been viewing test-suite as a kind of second-level/backup testing
that catches things not flagged by "make check-all." Is that a
reasonable interpretation? I was hoping to get some end-to-end tests
under "make check-all" because that's easier for developers to run in
their workflows.

On CI, a number of bots run those as often as they can, non-stop.

Honestly, it's not really clear to me exactly which bots cover what, how
often they run and so on. Is there a document somewhere describing the
setup?

Not really. The main Buildbot page is a mess and the system is very
old. There is a round table at the dev meeting to discuss the path
forward.

Yeah, I saw that. I will see if I can attend. There are some conflicts
we have to work out here.

This is not the first, though. We have been discussing this for a
number of years, but getting people / companies to commit to testing
is not trivial.

Is there a proposal somewhere of what companies would be expected to do?
It's difficult for us engineers to talk to management without a concrete
set of expectations, resource requirements, etc.

I created a page for the Arm bots (after many incarnations, it ended
up here: http://ex40-01.tcwglab.linaro.org/) to make that simpler. But
that wouldn't scale, nor it fixes the real problems.

Nice! That's much better. Yes, it won't scale but it's much clearer
about what is being run.

                         -David

I have been viewing test-suite as a kind of second-level/backup testing
that catches things not flagged by "make check-all." Is that a
reasonable interpretation? I was hoping to get some end-to-end tests
under "make check-all" because that's easier for developers to run in
their workflows.

It is a common understanding, which makes the test-suite less useful,
but that's not really relevant.

No one needs to the test-suite as part of their development processes,
because we have bots for that.

If you have decent piece-wise tests in Clang/LLVM, you really don't
need end-to-end tests in Clang/LLVM, because the test-suite will run
on bots and you will be told if you break them.

Most errors will be picked up by piece-wise tests, and the minority
where e2e make a difference can be reactive, rather than pro-active
fixing.

Is there a proposal somewhere of what companies would be expected to do?
It's difficult for us engineers to talk to management without a concrete
set of expectations, resource requirements, etc.

There were talks about upgrading Buildbot (the service), moving to
Jenkins or something else (Travis?). None of them have the majority of
the community behind, and that's the main problem.

IIRC, the arguments (definitely outdated, probably wrong) were:

Buildbot:
- Pros: we already have a big infra based on it, it's passable, an
upgrade could ameliorate a lot of problems without creating many new
ones.
- Cons: it's old tech and requires extensive coding to make it work

Jenkins:
- Pros: Most companies use that already, it's more modetn than
Apple's GreenBot is based on that, lots of plugins and expertise in
the community
- Cons: It requires Java running on the client, which not all targets
like. Alternatives require a separate server to run as a slave and
connect to targets.

Travis:
- Pros: It's natively compatible with Github (is is still the case?)
and it could be the easiest to connect with our new repo for CI
- Cons: less expertise, I guess, and other things I don't really know.

Nice! That's much better. Yes, it won't scale but it's much clearer
about what is being run.

Perhaps adding a new column as to what components we test in each one
would be nice.

--renato

This is good summary! I just want to put some new options like BuildKite (we’ve been experimenting with this on MLIR: https://buildkite.com/mlir/ ), and GitHub Actions (Still in Beta: https://github.com/features/actions , I haven’t had time to play with it). There are also paying options like CircleCI and TeamCity, but they seem out-of-scope for us I think?
It’d be interesting to collect a more complete list of CI tools free-for-open-source :slight_smile:

Philip Reames via cfe-dev <cfe-dev@lists.llvm.org> writes:

A challenge we already have - as in, I've broken these tests and had to
fix them - is that an end to end test which checks either IR or assembly
ends up being extraordinarily fragile. Completely unrelated profitable
transforms create small differences which cause spurious test failures.
This is a very real issue today with the few end-to-end clang tests we
have, and I am extremely hesitant to expand those tests without giving
this workflow problem serious thought. If we don't, this could bring
development on middle end transforms to a complete stop. (Not kidding.)

Do you have a pointer to these tests? We literally have tens of
thousands of end-to-end tests downstream and while some are fragile, the
vast majority are not. A test that, for example, checks the entire
generated asm for a match is indeed very fragile. A test that checks
whether a specific instruction/mnemonic was emitted is generally not, at
least in my experience. End-to-end tests require some care in
construction. I don't think update_llc_test_checks.py-type operation is
desirable.

The couple I remember off hand were mostly vectorization tests, but it's been a while, so I might be misremembering.

Still, you raise a valid point and I think present some good options
below.

A couple of approaches we could consider:

  1. Simply restrict end to end tests to crash/assert cases. (i.e. no
     property of the generated code is checked, other than that it is
     generated) This isn't as restrictive as it sounds when combined
     w/coverage guided fuzzer corpuses.

I would be pretty hesitant to do this but I'd like to hear more about
how you see this working with coverage/fuzzing.

We've found end-to-end fuzzing from Java (which guarantees single threaded determinism and lack of UB) comparing two implementations to be extremely effective at catching regressions. A big chunk of the regressions are assertion failures. Our ability to detect miscompiles by comparing the output of two implementations (well, 2 or more for tie breaking purposes) has worked extremely well. However, once a problem is identified, we're stuck manually reducing and reacting, which is a very major time sink. Key thing here in the context of this discussion is that there are no IR checks of any form, we just check the end-to-end correctness of the system and then reduce from there.

  2. Auto-update all diffs, but report them to a human user for
     inspection. This ends up meaning that tests never "fail" per se,
     but that individuals who have expressed interest in particular tests
     get an automated notification and a chance to respond on list with a
     reduced example.

That's certainly workable.

  3. As a variant on the former, don't auto-update tests, but only inform
     the *contributor* of an end-to-end test of a failure. Responsibility
     for determining failure vs false positive lies solely with them, and
     normal channels are used to report a failure after it has been
     confirmed/analyzed/explained.

I think I like this best of the three but it raises the question of what
happens when the contributor is no longer contributing. Who's
responsible for the test? Maybe it just sits there until someone else
claims it.

I'd argue it should be deleted if no one is willing to actively step up. It is not in the community's interest to assume unending responsibility for any third party test suite given the high burden involved here.