RFC: End-to-end testing

David_A_Greene · October 8, 2019, 4:49pm

[ I am initially copying only a few lists since they seem like
the most impacted projects and I didn't want to spam all the mailing
lists. Please let me know if other lists should be included. ]

I submitted D68230 for review but this is not about that patch per se.
The patch allows update_cc_test_checks.py to process tests that should
check target asm rather than LLVM IR. We use this facility downstream
for our end-to-end tests. It strikes me that it might be useful for
upstream to do similar end-to-end testing.

Now that the monorepo is about to become the canonical source of truth,
we have an opportunity for convenient end-to-end testing that we didn't
easily have before with svn (yes, it could be done but in an ugly way).
AFAIK the only upstream end-to-end testing we have is in test-suite and
many of those codes are very large and/or unfocused tests.

With the monorepo we have a place to put lit-style tests that exercise
multiple subprojects, for example tests that ensure the entire clang
compilation pipeline executes correctly. We could, for example, create
a top-level "test" directory and put end-to-end tests there. Some of
the things that could be tested include:

- Pipeline execution (debug-pass=Executions)
- Optimization warnings/messages
- Specific asm code sequences out of clang (e.g. ensure certain loops
are vectorized)
- Pragma effects (e.g. ensure loop optimizations are honored)
- Complete end-to-end PGO (generate a profile and re-compile)
- GPU/accelerator offloading
- Debuggability of clang-generated code

Each of these things is tested to some degree within their own
subprojects, but AFAIK there are currently no dedicated tests ensuring
such things work through the entire clang pipeline flow and with other
tools that make use of the results (debuggers, etc.). It is relatively
easy to break the pipeline while the individual subproject tests
continue to pass.

I realize that some folks prefer to work on only a portion of the
monorepo (for example, they just hack on LLVM). I am not sure how to
address those developers WRT end-to-end testing. On the one hand,
requiring them to run end-to-end testing means they will have to at
least check out and build the monorepo. On the other hand, it seems
less than ideal to have people developing core infrastructure and not
running tests.

I don't yet have a formal proposal but wanted to put this out to spur
discussion and gather feedback and ideas. Thank you for your interest
and participation!

-David

pogo59 · October 8, 2019, 6:51pm

From: cfe-dev <cfe-dev-bounces@lists.llvm.org> On Behalf Of David Greene
via cfe-dev
Sent: Tuesday, October 08, 2019 12:50 PM
To: llvm-dev@lists.llvm.org; cfe-dev@lists.llvm.org; openmp-
dev@lists.llvm.org; lldb-dev@lists.llvm.org
Subject: [cfe-dev] RFC: End-to-end testing

[ I am initially copying only a few lists since they seem like
  the most impacted projects and I didn't want to spam all the mailing
  lists. Please let me know if other lists should be included. ]

I submitted D68230 for review but this is not about that patch per se.
The patch allows update_cc_test_checks.py to process tests that should
check target asm rather than LLVM IR. We use this facility downstream
for our end-to-end tests. It strikes me that it might be useful for
upstream to do similar end-to-end testing.

Now that the monorepo is about to become the canonical source of truth,
we have an opportunity for convenient end-to-end testing that we didn't
easily have before with svn (yes, it could be done but in an ugly way).
AFAIK the only upstream end-to-end testing we have is in test-suite and
many of those codes are very large and/or unfocused tests.

With the monorepo we have a place to put lit-style tests that exercise
multiple subprojects, for example tests that ensure the entire clang
compilation pipeline executes correctly. We could, for example, create
a top-level "test" directory and put end-to-end tests there. Some of
the things that could be tested include:

- Pipeline execution (debug-pass=Executions)
- Optimization warnings/messages
- Specific asm code sequences out of clang (e.g. ensure certain loops
  are vectorized)
- Pragma effects (e.g. ensure loop optimizations are honored)
- Complete end-to-end PGO (generate a profile and re-compile)
- GPU/accelerator offloading
- Debuggability of clang-generated code

Each of these things is tested to some degree within their own
subprojects, but AFAIK there are currently no dedicated tests ensuring
such things work through the entire clang pipeline flow and with other
tools that make use of the results (debuggers, etc.). It is relatively
easy to break the pipeline while the individual subproject tests
continue to pass.

I agree with all your points. End-to-end testing is a major hole in the
project infrastructure; it has been largely left up to the individual
vendors/packagers/distributors. The Clang tests verify that Clang will
produce some sort of not-unreasonable IR for given situations; the LLVM
tests verify that some (other) set of input IR will produce something
that looks not-unreasonable on the target side. Very little connects
the two.

There is more than nothing:
- test-suite has some quantity of code that is compiled end-to-end for
  some targets.
- lldb can be set up to use the just-built Clang to compile its tests,
  but those are focused on debug info and are nothing like comprehensive.
- libcxx likely also can use the just-built Clang, although I've never
  tried it so I don't know for sure. It obviously exercises just the
  runtime side of things.
- compiler-rt likewise. The sanitizer tests in particular I'd expect to
  be using the just-built Clang.
- debuginfo-tests also uses the just-built Clang but is a pathetically
  small set, and again focused on debug info.

I'm not saying the LLVM Project should invest in a commercial suite
(although I'd expect vendors to do so; Sony does). But a place to *put*
end-to-end tests seems entirely reasonable and useful. Although I would
resist calling it simply "tests" (we have too many directories with that
name already).

I realize that some folks prefer to work on only a portion of the
monorepo (for example, they just hack on LLVM). I am not sure how to
address those developers WRT end-to-end testing. On the one hand,
requiring them to run end-to-end testing means they will have to at
least check out and build the monorepo. On the other hand, it seems
less than ideal to have people developing core infrastructure and not
running tests.

People should obviously be running the tests for the project(s) they're
modifying. People aren't expected to run everything. That's why...

Bots. "Don't argue with the bots." I don't checkout and build and test
everything, and I've broken LLDB, compiler-rt, and probably others from
time to time. Probably everybody has broken other projects unexpectedly.
That's what bots are for, to run the combinations and the projects that
I don't have the infrastructure or resources to do myself. It's not up
to me to run everything possible before committing; it IS up to me to
respond promptly to bot failures for my changes. I don't see a new
end-to-end test project being any different in that respect.

I don't yet have a formal proposal but wanted to put this out to spur
discussion and gather feedback and ideas. Thank you for your interest
and participation!

Thanks for bringing it up! It's been a pebble in my shoe for a long time.
--paulr

dblaikie · October 8, 2019, 7:22pm

I have a bit of concern about this sort of thing - worrying it’ll lead to people being less cautious about writing the more isolated tests. That said, clearly there’s value in end-to-end testing for all the reasons you’ve mentioned (& we do see these problems in practice - recently DWARF indexing broke when support for more nuanced language codes were added to Clang).

Dunno if they need a new place or should just be more stuff in test-suite, though.

David_A_Greene · October 8, 2019, 7:46pm

David Blaikie via Openmp-dev <openmp-dev@lists.llvm.org> writes:

I have a bit of concern about this sort of thing - worrying it'll lead to
people being less cautious about writing the more isolated tests.

That's a fair concern. Reviewers will still need to insist on small
component-level tests to go along with patches. We don't have to
sacrifice one to get the other.

Dunno if they need a new place or should just be more stuff in test-suite,
though.

There are at least two problems I see with using test-suite for this:

- It is a separate repository and thus is not as convenient as tests
that live with the code. One cannot commit an end-to-end test
atomically with the change meant to be tested.

- It is full of large codes which is not the kind of testing I'm talking
about.

Let me describe how I recently added some testing in our downstream
fork.

- I implemented a new feature along with a C source test.

- I used clang to generate asm from that test and captured the small
piece of it I wanted to check in an end-to-end test.

- I used clang to generate IR just before the feature kicked in and
  created an opt-style test for it. Generating this IR is not always
  straightfoward and it would be great to have better tools to do this,
  but that's another discussion.

- I took the IR out of opt (after running my feature) and created an
llc-style test out of it to check the generated asm. The checks are
the same as in the original C end-to-end test.

So the tests are checking at each stage that the expected input is
generating the expected output and the end-to-end test checks that we go
from source to asm correctly.

These are all really small tests, easily runnable as part of the normal
"make check" process.

-David

dblaikie · October 8, 2019, 8:05pm

David Blaikie via Openmp-dev <openmp-dev@lists.llvm.org> writes:

I have a bit of concern about this sort of thing - worrying it’ll lead to
people being less cautious about writing the more isolated tests.

That’s a fair concern. Reviewers will still need to insist on small
component-level tests to go along with patches. We don’t have to
sacrifice one to get the other.

Dunno if they need a new place or should just be more stuff in test-suite,
though.

There are at least two problems I see with using test-suite for this:

It is a separate repository and thus is not as convenient as tests
that live with the code. One cannot commit an end-to-end test
atomically with the change meant to be tested.

It is full of large codes which is not the kind of testing I’m talking
about.

Oh, right - I’d forgotten that the test-suite wasn’t part of the monorepo (due to size, I can understand why) - fair enough. Makes sense to me to have lit-style lightweight, targeted, but intentionally end-to-end tests.

mehdi_amini · October 9, 2019, 2:14am

I have a bit of concern about this sort of thing - worrying it’ll lead to people being less cautious about writing the more isolated tests.

I have the same concern. I really believe we need to be careful about testing at the right granularity to keep things both modular and the testing maintainable (for instance checking vectorized ASM from a C++ source through clang has always been considered a bad FileCheck practice).
(Not saying that there is no space for better integration testing in some areas).

That said, clearly there’s value in end-to-end testing for all the reasons you’ve mentioned (& we do see these problems in practice - recently DWARF indexing broke when support for more nuanced language codes were added to Clang).

Dunno if they need a new place or should just be more stuff in test-suite, though.

[ I am initially copying only a few lists since they seem like
the most impacted projects and I didn’t want to spam all the mailing
lists. Please let me know if other lists should be included. ]

I submitted D68230 for review but this is not about that patch per se.
The patch allows update_cc_test_checks.py to process tests that should
check target asm rather than LLVM IR. We use this facility downstream
for our end-to-end tests. It strikes me that it might be useful for
upstream to do similar end-to-end testing.

Now that the monorepo is about to become the canonical source of truth,
we have an opportunity for convenient end-to-end testing that we didn’t
easily have before with svn (yes, it could be done but in an ugly way).
AFAIK the only upstream end-to-end testing we have is in test-suite and
many of those codes are very large and/or unfocused tests.

With the monorepo we have a place to put lit-style tests that exercise
multiple subprojects, for example tests that ensure the entire clang
compilation pipeline executes correctly.

I don’t think I agree with the relationship to the monorepo: there was nothing that prevented tests inside the clang project to exercise the full pipeline already. I don’t believe that the SVN repo structure was really a factor in the way the testing was setup, but instead it was a deliberate choice in the way we structure our testing.
For instance I remember asking about implementing test based on checking if some loops written in C source file were properly vectorized by the -O2 / -O3 pipeline and it was deemed like the kind of test that we don’t want to maintain: instead I was pointed at the test-suite to add better benchmarks there for the end-to-end story. What is interesting is that the test-suite is not gonna be part of the monorepo!

To be clear: I’m not saying here we can’t change our way of testing, I just don’t think the monorepo has anything to do with it and that it should carefully motivated and scoped into what belongs/doesn’t belong to integration tests.

We could, for example, create
a top-level “test” directory and put end-to-end tests there. Some of
the things that could be tested include:

Pipeline execution (debug-pass=Executions)

Optimization warnings/messages

Specific asm code sequences out of clang (e.g. ensure certain loops
are vectorized)

Pragma effects (e.g. ensure loop optimizations are honored)

Complete end-to-end PGO (generate a profile and re-compile)

GPU/accelerator offloading

Debuggability of clang-generated code

Each of these things is tested to some degree within their own
subprojects, but AFAIK there are currently no dedicated tests ensuring
such things work through the entire clang pipeline flow and with other
tools that make use of the results (debuggers, etc.). It is relatively
easy to break the pipeline while the individual subproject tests
continue to pass.

I’m not sure I really see much in your list that isn’t purely about testing clang itself here?
Actually the first one seems more of a pure LLVM test.

I realize that some folks prefer to work on only a portion of the
monorepo (for example, they just hack on LLVM). I am not sure how to
address those developers WRT end-to-end testing. On the one hand,
requiring them to run end-to-end testing means they will have to at
least check out and build the monorepo. On the other hand, it seems
less than ideal to have people developing core infrastructure and not
running tests.

I think we already expect LLVM developers to update clang APIs? And we revert LLVM patches when clang testing is broken. So I believe the acknowledgment to maintain the other in-tree projects isn’t really new, it is true that the monorepo will make this easy for everyone to reproduce locally most failure, and find all the use of an API across projects (which was provided as a motivation to move to a monorepo model: https://llvm.org/docs/Proposals/GitHubMove.html#monorepo ).

David_A_Greene · October 9, 2019, 3:12pm

Mehdi AMINI via cfe-dev <cfe-dev@lists.llvm.org> writes:

I have a bit of concern about this sort of thing - worrying it'll lead to
people being less cautious about writing the more isolated tests.

I have the same concern. I really believe we need to be careful about
testing at the right granularity to keep things both modular and the
testing maintainable (for instance checking vectorized ASM from a C++
source through clang has always been considered a bad FileCheck practice).
(Not saying that there is no space for better integration testing in some
areas).

I absolutely disagree about vectorization tests. We have seen
vectorization loss in clang even though related LLVM lit tests pass,
because something else in the clang pipeline changed that caused the
vectorizer to not do its job. We need both kinds of tests. There are
many asm tests of value beyond vectorization and they should include
component and well as end-to-end tests.

For instance I remember asking about implementing test based on checking if
some loops written in C source file were properly vectorized by the -O2 /
-O3 pipeline and it was deemed like the kind of test that we don't want to
maintain: instead I was pointed at the test-suite to add better benchmarks
there for the end-to-end story. What is interesting is that the test-suite
is not gonna be part of the monorepo!

And it shouldn't be. It's much too big. But there is a place for small
end-to-end tests that live alongside the code.

We could, for example, create
a top-level "test" directory and put end-to-end tests there. Some of
the things that could be tested include:

- Pipeline execution (debug-pass=Executions)

- Optimization warnings/messages
- Specific asm code sequences out of clang (e.g. ensure certain loops
are vectorized)
- Pragma effects (e.g. ensure loop optimizations are honored)
- Complete end-to-end PGO (generate a profile and re-compile)
- GPU/accelerator offloading
- Debuggability of clang-generated code

Each of these things is tested to some degree within their own
subprojects, but AFAIK there are currently no dedicated tests ensuring
such things work through the entire clang pipeline flow and with other
tools that make use of the results (debuggers, etc.). It is relatively
easy to break the pipeline while the individual subproject tests
continue to pass.

I'm not sure I really see much in your list that isn't purely about testing
clang itself here?

Debugging and PGO involve other components, no? If we want to put clang
end-to-end tests in the clang subdirectory, that's fine with me. But we
need a place for tests that cut across components.

I could also imagine llvm-mca end-to-end tests through clang.

Actually the first one seems more of a pure LLVM test.

Definitely not. It would test the pipeline as constructed by clang,
which is very different from the default pipeline constructed by
opt/llc. The old and new pass managers also construct different
pipelines. As we have seen with various mailing list messages, this is
surprising to users. Best to document and check it with testing.

-David

mehdi_amini · October 9, 2019, 7:38pm

Mehdi AMINI via cfe-dev <cfe-dev@lists.llvm.org> writes:

I have a bit of concern about this sort of thing - worrying it’ll lead to
people being less cautious about writing the more isolated tests.

I have the same concern. I really believe we need to be careful about
testing at the right granularity to keep things both modular and the
testing maintainable (for instance checking vectorized ASM from a C++
source through clang has always been considered a bad FileCheck practice).
(Not saying that there is no space for better integration testing in some
areas).

I absolutely disagree about vectorization tests. We have seen
vectorization loss in clang even though related LLVM lit tests pass,
because something else in the clang pipeline changed that caused the
vectorizer to not do its job.

Of course, and as I mentioned I tried to add these tests (probably 4 or 5 years ago), but someone (I think Chandler?) was asking me at the time: does it affect a benchmark performance? If so why isn’t it tracked there? And if not does it matter?
The benchmark was presented as the actual way to check this invariant (because you’re only vectoring to get performance, not for the sake of it).
So I never pursued, even if I’m a bit puzzled that we don’t have such tests.

We need both kinds of tests. There are
many asm tests of value beyond vectorization and they should include
component and well as end-to-end tests.

For instance I remember asking about implementing test based on checking if
some loops written in C source file were properly vectorized by the -O2 /
-O3 pipeline and it was deemed like the kind of test that we don’t want to
maintain: instead I was pointed at the test-suite to add better benchmarks
there for the end-to-end story. What is interesting is that the test-suite
is not gonna be part of the monorepo!

And it shouldn’t be. It’s much too big. But there is a place for small
end-to-end tests that live alongside the code.

We could, for example, create
a top-level “test” directory and put end-to-end tests there. Some of
the things that could be tested include:

Pipeline execution (debug-pass=Executions)

Optimization warnings/messages

Specific asm code sequences out of clang (e.g. ensure certain loops
are vectorized)

Pragma effects (e.g. ensure loop optimizations are honored)

Complete end-to-end PGO (generate a profile and re-compile)

GPU/accelerator offloading

Debuggability of clang-generated code

Each of these things is tested to some degree within their own
subprojects, but AFAIK there are currently no dedicated tests ensuring
such things work through the entire clang pipeline flow and with other
tools that make use of the results (debuggers, etc.). It is relatively
easy to break the pipeline while the individual subproject tests
continue to pass.

I’m not sure I really see much in your list that isn’t purely about testing
clang itself here?

Debugging and PGO involve other components, no?

I don’t think that you need anything else than LLVM core (which is a dependency of clang) itself?

Things like PGO (unless you’re using frontend instrumentation) don’t even have anything to do with clang, so we may get into the situation David mentioned where we would rely on clang to test LLVM features, which I find non-desirable.

If we want to put clang
end-to-end tests in the clang subdirectory, that’s fine with me. But we
need a place for tests that cut across components.

I could also imagine llvm-mca end-to-end tests through clang.

Actually the first one seems more of a pure LLVM test.

Definitely not. It would test the pipeline as constructed by clang,
which is very different from the default pipeline constructed by
opt/llc.

I am not convinced it is “very” difference (they are using the PassManagerBuilder AFAIK), I am only aware of very subtle difference.
But more fundamentally: should they be different? I would want opt -O3 to be able to reproduce 1-1 the clang pipeline.
Isn’t it the role of LLVM PassManagerBuilder to expose what is the “-O3” pipeline?
If we see the PassManagerBuilder as the abstraction for the pipeline, then I don’t see what testing belongs to clang here, this seems like a layering violation (and maintaining the PassManagerBuilder in LLVM I wouldn’t want to have to update the tests of all the subproject using it because they retest the same feature).

The old and new pass managers also construct different
pipelines. As we have seen with various mailing list messages, this is
surprising to users. Best to document and check it with testing.

Yes: both old and new pass managers are LLVM components, so hopefully that are documented and tested in LLVM

David_A_Greene · October 9, 2019, 9:31pm

Mehdi AMINI via llvm-dev <llvm-dev@lists.llvm.org> writes:

I absolutely disagree about vectorization tests. We have seen
vectorization loss in clang even though related LLVM lit tests pass,
because something else in the clang pipeline changed that caused the
vectorizer to not do its job.

Of course, and as I mentioned I tried to add these tests (probably 4 or 5
years ago), but someone (I think Chandler?) was asking me at the time: does
it affect a benchmark performance? If so why isn't it tracked there? And if
not does it matter?
The benchmark was presented as the actual way to check this invariant
(because you're only vectoring to get performance, not for the sake of it).
So I never pursued, even if I'm a bit puzzled that we don't have such tests.

Thanks for explaining.

Our experience is that relying solely on performance tests to uncover
such issues is problematic for several reasons:

- Performance varies from implementation to implementation. It is
  difficult to keep tests up-to-date for all possible targets and
  subtargets.

- Partially as a result, but also for other reasons, performance tests
  tend to be complicated, either in code size or in the numerous code
  paths tested. This makes such tests hard to debug when there is a
  regression.

- Performance tests don't focus on the why/how of vectorization. They
  just check, "did it run fast enough?" Maybe the test ran fast enough
  for some other reason but we still lost desired vectorization and
  could have run even faster.

With a small asm test one can documented why vectorization is desired
and how it comes about right in the test. Then when it breaks it's
usually pretty obvious what the problem is.

They don't replace performance tests, they complement each other, just
as end-to-end and component tests complement each other.

Debugging and PGO involve other components, no?

I don't think that you need anything else than LLVM core (which is a
dependency of clang) itself?

What about testing that what clang produces is debuggable with lldb?
debuginfo tests do that now but AFAIK they are not end-to-end.

Things like PGO (unless you're using frontend instrumentation) don't even
have anything to do with clang, so we may get into the situation David
mentioned where we would rely on clang to test LLVM features, which I find
non-desirable.

We would still expect component-level tests. This would be additional
end-to-end testing, not replacing existing testing methodology. I agree
the concern is valid but shouldn't code review discover such problems?

> Actually the first one seems more of a pure LLVM test.

Definitely not. It would test the pipeline as constructed by clang,
which is very different from the default pipeline constructed by
opt/llc.

I am not convinced it is "very" difference (they are using the
PassManagerBuilder AFAIK), I am only aware of very subtle difference.

I don't think clang exclusively uses PassManagerBuilder but it's been a
while since I looked at that code.

But more fundamentally: *should* they be different? I would want `opt -O3`
to be able to reproduce 1-1 the clang pipeline.

Which clang pipeline? -O3? -Ofast? opt currently can't do -Ofast. I
don't *think* -Ofast affects the pipeline itself but I am not 100%
certain.

Isn't it the role of LLVM PassManagerBuilder to expose what is the "-O3"
pipeline?

Ideally, yes. In practice, it's not.

If we see the PassManagerBuilder as the abstraction for the pipeline, then
I don't see what testing belongs to clang here, this seems like a layering
violation (and maintaining the PassManagerBuilder in LLVM I wouldn't want
to have to update the tests of all the subproject using it because they
retest the same feature).

If nothing else, end-to-end testing of the pipeline would uncover
layering violations.

The old and new pass managers also construct different
pipelines. As we have seen with various mailing list messages, this is
surprising to users. Best to document and check it with testing.

Yes: both old and new pass managers are LLVM components, so hopefully that
are documented and tested in LLVM

But we have nothing to guarantee that what clang does matches what opt
does. Currently they do different things.

-David

preames · October 9, 2019, 11:35pm

The two major concerns I see are a potential decay in component test quality, and an increase in difficulty changing components. The former has already been discussed a bit downstream, so let me focus on the later.

A challenge we already have - as in, I've broken these tests and had to fix them - is that an end to end test which checks either IR or assembly ends up being extraordinarily fragile. Completely unrelated profitable transforms create small differences which cause spurious test failures. This is a very real issue today with the few end-to-end clang tests we have, and I am extremely hesitant to expand those tests without giving this workflow problem serious thought. If we don't, this could bring development on middle end transforms to a complete stop. (Not kidding.)

A couple of approaches we could consider:

1. Simply restrict end to end tests to crash/assert cases. (i.e. no
    property of the generated code is checked, other than that it is
    generated) This isn't as restrictive as it sounds when combined
    w/coverage guided fuzzer corpuses.
2. Auto-update all diffs, but report them to a human user for
    inspection. This ends up meaning that tests never "fail" per se,
    but that individuals who have expressed interest in particular tests
    get an automated notification and a chance to respond on list with a
    reduced example.
3. As a variant on the former, don't auto-update tests, but only inform
    the *contributor* of an end-to-end test of a failure. Responsibility
    for determining failure vs false positive lies solely with them, and
    normal channels are used to report a failure after it has been
    confirmed/analyzed/explained.

I really think this is a problem we need to have thought through and found a workable solution before end-to-end testing as proposed becomes a practically workable option.

Philip

mehdi_amini · October 9, 2019, 11:43pm

Mehdi AMINI via llvm-dev <llvm-dev@lists.llvm.org> writes:

I absolutely disagree about vectorization tests. We have seen
vectorization loss in clang even though related LLVM lit tests pass,
because something else in the clang pipeline changed that caused the
vectorizer to not do its job.

Of course, and as I mentioned I tried to add these tests (probably 4 or 5
years ago), but someone (I think Chandler?) was asking me at the time: does
it affect a benchmark performance? If so why isn’t it tracked there? And if
not does it matter?
The benchmark was presented as the actual way to check this invariant
(because you’re only vectoring to get performance, not for the sake of it).
So I never pursued, even if I’m a bit puzzled that we don’t have such tests.

Thanks for explaining.

Our experience is that relying solely on performance tests to uncover
such issues is problematic for several reasons:

Performance varies from implementation to implementation. It is
difficult to keep tests up-to-date for all possible targets and
subtargets.

Partially as a result, but also for other reasons, performance tests
tend to be complicated, either in code size or in the numerous code
paths tested. This makes such tests hard to debug when there is a
regression.

Performance tests don’t focus on the why/how of vectorization. They
just check, “did it run fast enough?” Maybe the test ran fast enough
for some other reason but we still lost desired vectorization and
could have run even faster.

With a small asm test one can documented why vectorization is desired
and how it comes about right in the test. Then when it breaks it’s
usually pretty obvious what the problem is.

They don’t replace performance tests, they complement each other, just
as end-to-end and component tests complement each other.

Debugging and PGO involve other components, no?

I don’t think that you need anything else than LLVM core (which is a
dependency of clang) itself?

What about testing that what clang produces is debuggable with lldb?
debuginfo tests do that now but AFAIK they are not end-to-end.

Things like PGO (unless you’re using frontend instrumentation) don’t even
have anything to do with clang, so we may get into the situation David
mentioned where we would rely on clang to test LLVM features, which I find
non-desirable.

We would still expect component-level tests. This would be additional
end-to-end testing, not replacing existing testing methodology. I agree
the concern is valid but shouldn’t code review discover such problems?

Yes I agree, this concern is not a blocker for doing end-to-end testing, but more a “we will need to be careful about scoping the role of the end-to-end testing versus component level testing”.

Actually the first one seems more of a pure LLVM test.

Definitely not. It would test the pipeline as constructed by clang,
which is very different from the default pipeline constructed by
opt/llc.

I am not convinced it is “very” difference (they are using the
PassManagerBuilder AFAIK), I am only aware of very subtle difference.

I don’t think clang exclusively uses PassManagerBuilder but it’s been a
while since I looked at that code.

Here is the code: https://github.com/llvm/llvm-project/blob/master/clang/lib/CodeGen/BackendUtil.cpp#L545

All the extension point where passes are hooked in are likely things where the pipeline would differ from LLVM.

But more fundamentally: should they be different? I would want opt -O3
to be able to reproduce 1-1 the clang pipeline.

Which clang pipeline? -O3? -Ofast? opt currently can’t do -Ofast. I
don’t think -Ofast affects the pipeline itself but I am not 100%
certain.

If -Ofast affects the pipeline, I would expose it on the PassManagerBuilder I think.

Isn’t it the role of LLVM PassManagerBuilder to expose what is the “-O3”
pipeline?

Ideally, yes. In practice, it’s not.

If we see the PassManagerBuilder as the abstraction for the pipeline, then
I don’t see what testing belongs to clang here, this seems like a layering
violation (and maintaining the PassManagerBuilder in LLVM I wouldn’t want
to have to update the tests of all the subproject using it because they
retest the same feature).

If nothing else, end-to-end testing of the pipeline would uncover
layering violations.

The old and new pass managers also construct different
pipelines. As we have seen with various mailing list messages, this is
surprising to users. Best to document and check it with testing.

Yes: both old and new pass managers are LLVM components, so hopefully that
are documented and tested in LLVM

But we have nothing to guarantee that what clang does matches what opt
does. Currently they do different things.

My point is that this should be guaranteed by refactoring and using the right APIs, not duplicate the testing. But I can also misunderstand what it is that you would test on the clang side. For instance I wouldn’t want to duplicate testing the O3 pass pipeline which is covered here: https://github.com/llvm/llvm-project/blob/master/llvm/test/Other/opt-O3-pipeline.ll
But testing that a specific pass is added with respect to a particular clang option is fair, and actually this is already what we do I believe, like here: https://github.com/llvm/llvm-project/blob/master/clang/test/CodeGen/thinlto-debug-pm.c#L11-L14

I don’t think these particular tests are the most controversial though, and it is really still fairly “focused” testing. I’m much more curious about larger end-to-end scope: for instance since you mention debug info and LLDB, what about a test that would verify that LLDB can print a particular variable content from a test that would come as a source program for instance. Such test are valuable in the absolute, it isn’t clear to me that we could in practice block any commit that would break such test though: this is because a bug fix or an improvement in one of the pass may be perfectly correct in isolation but make the test fail by exposing a bug where we are already losing some debug info precision in a totally unrelated part of the codebase.
I wonder how you see this managed in practice: would you gate any change on InstCombine (or other mid-level pass) on not regressing any of the debug-info quality test on any of the backend, and from any frontend (not only clang)? Or worse: a middle-end change that would end-up with a slightly different Dwarf construct on this particular test, which would trip LLDB but not GDB (basically expose a bug in LLDB). Should we require the contributor of inst-combine to debug LLDB and fix it first?

Best,

David_A_Greene · October 10, 2019, 1:16am

Mehdi AMINI via cfe-dev <cfe-dev@lists.llvm.org> writes:

I don't think these particular tests are the most controversial though, and
it is really still fairly "focused" testing. I'm much more curious about
larger end-to-end scope: for instance since you mention debug info and
LLDB, what about a test that would verify that LLDB can print a particular
variable content from a test that would come as a source program for
instance. Such test are valuable in the absolute, it isn't clear to me that
we could in practice block any commit that would break such test though:
this is because a bug fix or an improvement in one of the pass may be
perfectly correct in isolation but make the test fail by exposing a bug
where we are already losing some debug info precision in a totally
unrelated part of the codebase.
I wonder how you see this managed in practice: would you gate any change on
InstCombine (or other mid-level pass) on not regressing any of the
debug-info quality test on any of the backend, and from any frontend (not
only clang)? Or worse: a middle-end change that would end-up with a
slightly different Dwarf construct on this particular test, which would
trip LLDB but not GDB (basically expose a bug in LLDB). Should we require
the contributor of inst-combine to debug LLDB and fix it first?

Good questions! I think for situations like this I would tend toward
allowing the change and the test would alert us that something else is
wrong. At that point it's probably a case-by-case decision. Maybe we
XFAIL the test. Maybe the fix is easy enough that we just do it and the
test starts passing again. What's the policy for breaking current tests
when the change itself is fine but exposes a problem elsewhere (adding
an assert, for example)?

-David

David_A_Greene · October 10, 2019, 1:25am

Philip Reames via cfe-dev <cfe-dev@lists.llvm.org> writes:

A challenge we already have - as in, I've broken these tests and had to
fix them - is that an end to end test which checks either IR or assembly
ends up being extraordinarily fragile. Completely unrelated profitable
transforms create small differences which cause spurious test failures.
This is a very real issue today with the few end-to-end clang tests we
have, and I am extremely hesitant to expand those tests without giving
this workflow problem serious thought. If we don't, this could bring
development on middle end transforms to a complete stop. (Not kidding.)

Do you have a pointer to these tests? We literally have tens of
thousands of end-to-end tests downstream and while some are fragile, the
vast majority are not. A test that, for example, checks the entire
generated asm for a match is indeed very fragile. A test that checks
whether a specific instruction/mnemonic was emitted is generally not, at
least in my experience. End-to-end tests require some care in
construction. I don't think update_llc_test_checks.py-type operation is
desirable.

Still, you raise a valid point and I think present some good options
below.

A couple of approaches we could consider:

1. Simply restrict end to end tests to crash/assert cases. (i.e. no
    property of the generated code is checked, other than that it is
    generated) This isn't as restrictive as it sounds when combined
    w/coverage guided fuzzer corpuses.

I would be pretty hesitant to do this but I'd like to hear more about
how you see this working with coverage/fuzzing.

2. Auto-update all diffs, but report them to a human user for
    inspection. This ends up meaning that tests never "fail" per se,
    but that individuals who have expressed interest in particular tests
    get an automated notification and a chance to respond on list with a
    reduced example.

That's certainly workable.

3. As a variant on the former, don't auto-update tests, but only inform
    the *contributor* of an end-to-end test of a failure. Responsibility
    for determining failure vs false positive lies solely with them, and
    normal channels are used to report a failure after it has been
    confirmed/analyzed/explained.

I think I like this best of the three but it raises the question of what
happens when the contributor is no longer contributing. Who's
responsible for the test? Maybe it just sits there until someone else
claims it.

I really think this is a problem we need to have thought through and
found a workable solution before end-to-end testing as proposed becomes
a practically workable option.

Noted. I'm very happy to have this discussion and work the problem.

-David

fhahn · October 10, 2019, 9:34am

Hi David,

Thanks for kicking off a discussion on this topic!

Mehdi AMINI via llvm-dev <llvm-dev@lists.llvm.org> writes:

I absolutely disagree about vectorization tests. We have seen
vectorization loss in clang even though related LLVM lit tests pass,
because something else in the clang pipeline changed that caused the
vectorizer to not do its job.

Of course, and as I mentioned I tried to add these tests (probably 4 or 5
years ago), but someone (I think Chandler?) was asking me at the time: does
it affect a benchmark performance? If so why isn't it tracked there? And if
not does it matter?
The benchmark was presented as the actual way to check this invariant
(because you're only vectoring to get performance, not for the sake of it).
So I never pursued, even if I'm a bit puzzled that we don't have such tests.

Thanks for explaining.

Our experience is that relying solely on performance tests to uncover
such issues is problematic for several reasons:

- Performance varies from implementation to implementation. It is
difficult to keep tests up-to-date for all possible targets and
subtargets.

Could you expand a bit more what you mean here? Are you concerned about having to run the performance tests on different kinds of hardware? In what way do the existing benchmarks require keeping up-to-date?

With tests checking ASM, wouldn’t we end up with lots of checks for various targets/subtargets that we need to keep up to date? Just considering AArch64 as an example, people might want to check the ASM for different architecture versions and different vector extensions and different vendors might want to make sure that the ASM on their specific cores does not regress.

- Partially as a result, but also for other reasons, performance tests
tend to be complicated, either in code size or in the numerous code
paths tested. This makes such tests hard to debug when there is a
regression.

I am not sure they have to. Have you considered adding the small test functions/loops as micro-benchmarks using the existing google benchmark infrastructure in test-suite?

I think that might be able to address the points here relatively adequately. The separate micro benchmarks would be relatively small and we should be able to track down regressions in a similar fashion as if it would be a stand-alone file we compile and then analyze the ASM. Plus, we can easily run it and verify the performance on actual hardware.

- Performance tests don't focus on the why/how of vectorization. They
just check, "did it run fast enough?" Maybe the test ran fast enough
for some other reason but we still lost desired vectorization and
could have run even faster.

If you would add a new micro-benchmark, you could check that it produces the desired result when adding it. The runtime-tracking should cover cases where we lost optimizations. I guess if the benchmarks are too big, additional optimizations in one part could hide lost optimizations somewhere else. But I would assume this to be relatively unlikely, as long as the benchmarks are isolated.

Also, checking the assembly for vector code does also not guarantee that the vector code will be actually executed. So for example by just checking the assembly for certain vector instructions, we might miss that we regressed performance, because we messed up the runtime checks guarding the vector loop.

Cheers,
Florian

fhahn · October 10, 2019, 9:55am

Have you considered alternatives to checking the assembly for ensuring vectorization or other transformations? For example, instead of checking the assembly, we could check LLVM’s statistics or optimization remarks. If you want to ensure a loop got vectorized, you could check the loop-vectorize remarks, which should give you the position of the loop in the source and vectorization/interleave factor used. There are few other things that could go wrong later on that would prevent vector instruction selection, but I think it should be sufficient to guard against most cases where we loose vectorization and should be much more robust to unrelated changes. If there are additional properties you want to ensure, they potentially could be added to the remark as well.

This idea of leveraging statistics and optimization remarks to track the impact of changes on overall optimization results is nothing new and I think several people already discussed it in various forms. For regular benchmark runs, in addition to tracking the existing benchmarks, we could also track selected optimization remarks (e.g. loop-vectorize, but not necessarily noisy ones like gvn) and statistics. Comparing those run-to-run could potentially highlight new end-to-end issues on a much larger scale, across all existing benchmarks integrated in test-suite. We might be able to detect loss in vectorization pro-actively, instead of requiring someone to file a bug report and then we add an isolated test after the fact.

But building something like this would be much more work of course….

Cheers,
Florian

Renato_Golin2 · October 10, 2019, 10:39am

We used to have lots of them, at least in the initial implementation
of the loop vectoriser (I know, many years ago).

The thread has enough points not to repeat here, but I guess the main
point is to make sure we don't duplicate tests, increasing CI cost
(especially on slower hardware).

I'd recommend trying to move any e2e tests into the test-suite and
make it easier to run, and leave specific tests only in the repo (to
guarantee independence of components).

The last thing we want is to create direct paths from front-ends to
back-ends and make LLVM IR transformation less flexible.

cheers,
--renato

pogo59 · October 10, 2019, 3:01pm

For debug info in particular, we already have the debuginfo-tests project,
which is separate because it requires executing the test program; this is
something the clang/llvm test suites specifically do NOT require. There
is of course also the LLDB test suite, which I believe can be configured
to use the just-built clang to compile its test programs.

Regarding breakage policy, it's just like anything else: do what's needed
to make the bots happy. What exactly that means will depend on the exact
situation. I can cite a small patch that was held off for a ridiculously
long time, like around a year, because Chromium had some environmental
problem that they were slow to address. That wasn't even an LLVM bot!
But eventually it got sorted out and our patch went in.

My point here is, that kind of thing happens already, adding a new e2e
test project won't inherently change any policy or how the community
responds to breakage.
--paulr

pogo59 · October 10, 2019, 3:46pm

David Greene, will you be at the LLVM Dev Meeting? If so, could you sign
up for a Round Table session on this topic? Obviously lots to discuss
and concerns to be addressed.

In particular I think there are two broad categories of tests that would
have to be segregated just by the nature of their requirements:

(1) Executable tests. These obviously require an execution platform; for
feasibility reasons this means host==target and the guarantee of having
a linker (possibly but not necessarily LLD) and a runtime (possibly but
not necessarily including libcxx). Note that the LLDB tests and the
debuginfo-tests project already have this kind of dependency, and in the
case of debuginfo-tests, this is exactly why it's a separate project.

(2) Non-executable tests. These are near-identical in character to the
existing clang/llvm test suites and I'd expect lit to drive them. The
only material difference from the majority(*) of existing clang tests is
that they are free to depend on LLVM features/passes. The only difference
from the majority of existing LLVM tests is that they have [Obj]{C,C++} as
their input source language.
(*) I've encountered clang tests that I feel depend on too much within LLVM,
and it's common for new contributors to provide a C/C++ test that needs to
be converted to a .ll test. Some of them go in anyway.

More comments/notes below.

From: lldb-dev <lldb-dev-bounces@lists.llvm.org> On Behalf Of David Greene
via lldb-dev
Sent: Wednesday, October 09, 2019 9:25 PM
To: Philip Reames <listmail@philipreames.com>; llvm-dev@lists.llvm.org;
cfe-dev@lists.llvm.org; openmp-dev@lists.llvm.org; lldb-dev@lists.llvm.org
Subject: Re: [lldb-dev] [cfe-dev] [llvm-dev] RFC: End-to-end testing

Philip Reames via cfe-dev <cfe-dev@lists.llvm.org> writes:

> A challenge we already have - as in, I've broken these tests and had to
> fix them - is that an end to end test which checks either IR or assembly
> ends up being extraordinarily fragile. Completely unrelated profitable
> transforms create small differences which cause spurious test failures.
> This is a very real issue today with the few end-to-end clang tests we
> have, and I am extremely hesitant to expand those tests without giving
> this workflow problem serious thought. If we don't, this could bring
> development on middle end transforms to a complete stop. (Not kidding.)

Do you have a pointer to these tests? We literally have tens of
thousands of end-to-end tests downstream and while some are fragile, the
vast majority are not. A test that, for example, checks the entire
generated asm for a match is indeed very fragile. A test that checks
whether a specific instruction/mnemonic was emitted is generally not, at
least in my experience. End-to-end tests require some care in
construction. I don't think update_llc_test_checks.py-type operation is
desirable.

Sony likewise has a rather large corpus of end-to-end tests. I expect any
vendor would. When they break, we fix them or report/fix the compiler bug.
It has not been an intolerable burden on us, and I daresay if it were at
all feasible to put these upstream, it would not be an intolerable burden
on the community. (It's not feasible because host!=target and we'd need
to provide test kits to the community and our remote-execution tools. We'd
rather just run them internally.)

Philip, what I'm actually hearing from your statement is along the lines,
"Our end-to-end tests are really fragile, therefore any end-to-end test
will be fragile, and that will be an intolerable burden."

That's an understandable reaction, but I think the community literally
would not tolerate too-fragile tests. Tests that are too fragile will
be made more robust or removed. This has been community practice for a
long time. There's even an entire category of "noisy bots" that certain
people take care of and don't bother the rest of the community. The
LLVM Project as a whole would not tolerate a test suite that "could
bring development ... to a complete stop" and I hope we can ease your
concerns.

More comments/notes/opinions below.

Still, you raise a valid point and I think present some good options
below.

> A couple of approaches we could consider:
>
> 1. Simply restrict end to end tests to crash/assert cases. (i.e. no
> property of the generated code is checked, other than that it is
> generated) This isn't as restrictive as it sounds when combined
> w/coverage guided fuzzer corpuses.

I would be pretty hesitant to do this but I'd like to hear more about
how you see this working with coverage/fuzzing.

I think this is way too restrictive.

> 2. Auto-update all diffs, but report them to a human user for
> inspection. This ends up meaning that tests never "fail" per se,
> but that individuals who have expressed interest in particular tests
> get an automated notification and a chance to respond on list with a
> reduced example.

That's certainly workable.

This is not different in principle from the "noisy bot" category, and if
it's a significant concern, the e2e tests can start out in that category.
Experience will tell us whether they are inherently fragile. I would not
want to auto-update tests.

> 3. As a variant on the former, don't auto-update tests, but only inform
> the *contributor* of an end-to-end test of a failure. Responsibility
> for determining failure vs false positive lies solely with them, and
> normal channels are used to report a failure after it has been
> confirmed/analyzed/explained.

I think I like this best of the three but it raises the question of what
happens when the contributor is no longer contributing. Who's
responsible for the test? Maybe it just sits there until someone else
claims it.

This is *exactly* the "noisy bot" tactic, and bots are supposed to have
owners who are active.

David_A_Greene · October 10, 2019, 9:21pm

Florian Hahn via llvm-dev <llvm-dev@lists.llvm.org> writes:

- Performance varies from implementation to implementation. It is
difficult to keep tests up-to-date for all possible targets and
subtargets.

Could you expand a bit more what you mean here? Are you concerned
about having to run the performance tests on different kinds of
hardware? In what way do the existing benchmarks require keeping
up-to-date?

We have to support many different systems and those systems are always
changing (new processors, new BIOS, new OS, etc.). Performance can vary
widely day to day from factors completely outside the compiler's
control. As the performance changes you have to keep updating the tests
to expect the new performance numbers. Relying on performance
measurements to ensure something like vectorization is happening just
isn't reliable in our experience.

With tests checking ASM, wouldn’t we end up with lots of checks for
various targets/subtargets that we need to keep up to date?

Yes, that's true. But the only thing that changes the asm generated is
the compiler.

Just considering AArch64 as an example, people might want to check the
ASM for different architecture versions and different vector
extensions and different vendors might want to make sure that the ASM
on their specific cores does not regress.

Absolutely. We do a lot of that sort of thing downstream.

- Partially as a result, but also for other reasons, performance tests
tend to be complicated, either in code size or in the numerous code
paths tested. This makes such tests hard to debug when there is a
regression.

I am not sure they have to. Have you considered adding the small test
functions/loops as micro-benchmarks using the existing google
benchmark infrastructure in test-suite?

We have tried nightly performance runs using LNT/test-suite and have
found it to be very unreliable, especially the microbenchmarks.

I think that might be able to address the points here relatively
adequately. The separate micro benchmarks would be relatively small
and we should be able to track down regressions in a similar fashion
as if it would be a stand-alone file we compile and then analyze the
ASM. Plus, we can easily run it and verify the performance on actual
hardware.

A few of my colleagues really struggled to get consistent results out of
LNT. They asked for help and discussed with a few upstream folks, but
in the end were not able to get something reliable working. I've talked
to a couple of other people off-list and they've had similar
experiences. It would be great if we have a reliable performance suite.
Please tell us how to get it working!

But even then, I still maintain there is a place for the kind of
end-to-end testing I describe. Performance testing would complement it.
Neither is a replacement for the other.

- Performance tests don't focus on the why/how of vectorization. They
just check, "did it run fast enough?" Maybe the test ran fast enough
for some other reason but we still lost desired vectorization and
could have run even faster.

If you would add a new micro-benchmark, you could check that it
produces the desired result when adding it. The runtime-tracking
should cover cases where we lost optimizations. I guess if the
benchmarks are too big, additional optimizations in one part could
hide lost optimizations somewhere else. But I would assume this to be
relatively unlikely, as long as the benchmarks are isolated.

Even then I have seen small performance tests vary widely in performance
due to system issues (see above). Again, there is a place for them but
they are not sufficient.

Also, checking the assembly for vector code does also not guarantee
that the vector code will be actually executed. So for example by just
checking the assembly for certain vector instructions, we might miss
that we regressed performance, because we messed up the runtime checks
guarding the vector loop.

Oh absolutely. Presumably such checks would be included in the test or
would be checked by a different test. As always, tests have to be
constructed intelligently.

-David

David_A_Greene · October 10, 2019, 9:24pm

Florian Hahn via cfe-dev <cfe-dev@lists.llvm.org> writes:

Have you considered alternatives to checking the assembly for ensuring
vectorization or other transformations? For example, instead of
checking the assembly, we could check LLVM’s statistics or
optimization remarks.

Yes, absolutely. We have tests that do things like that. I don't want
to focus on the asm bit, that's just one type of test. The larger issue
is end-to-end tests that ensure the compiler and other tools are working
correctly, be it from checking messages, statistics, asm or something
else.

This idea of leveraging statistics and optimization remarks to track
the impact of changes on overall optimization results is nothing new
and I think several people already discussed it in various forms. For
regular benchmark runs, in addition to tracking the existing
benchmarks, we could also track selected optimization remarks
(e.g. loop-vectorize, but not necessarily noisy ones like gvn) and
statistics. Comparing those run-to-run could potentially highlight new
end-to-end issues on a much larger scale, across all existing
benchmarks integrated in test-suite. We might be able to detect loss
in vectorization pro-actively, instead of requiring someone to file a
bug report and then we add an isolated test after the fact.

That's an interesting idea! I would love to get more use out of
test-suite.

-David

Topic		Replies	Views
[RFC] Cross-project lit test suite LLVM Dev List Archives	27	209	May 4, 2021
Testing Best Practices/Goals (in the context of compiler-rt) LLVM Dev List Archives	40	180	March 1, 2016
Proposal for GSoC project for improving llvm-test testsuite LLVM Dev List Archives	11	68	March 25, 2008
Adding ffmpeg in LLVM test suite? Project Infrastructure	23	697	December 19, 2024
Proposal for an ABI testsuite for clang Clang Frontend	25	95	July 22, 2014

RFC: End-to-end testing

Related topics