LLVMdev Digest, Vol 88, Issue 29

I am very interested in seeing a qualification plan for ARM given that it is
a widely used target with several combinations of options/modes to be
tested. I & my team use ARM hardware for running tests and we run all test
LLVM test suite tests as part of qualification process. I had started a
similar conversation in llvm-commits, but this is probably the right forum.
It will save everyone a lot of time if we can all agree on qualification
tests and options for ARM and would be happy to initiate the discussion and
process for this.

--Raja

Hi Raja,

I'm open to suggestions. Our current release qualification is to bootstrap the compiler (similar to how GCC does their bootstrapping), run the test suites, and verify that there are no regressions. Improving the test suite is always welcome. In addition, we send out pre-release tarballs and have people in the community build and test their programs with it. This is not a perfect system, but it's one which works for us given the number of testers available, the amount of time and resources they have, and whatever fixes need to be merged into the release.

ARM qualification is a bit trickier, because of the different specific chips out there, different OSes, and having to verify ARM, Thumb1, and Thumb2 for the same configurations. And the tests tend to run a bit slower than, say, an x86 chip. So it's mostly a matter of time and resources. Unless we can get people who are willing to perform these tests, we won't be able to release ARM as an official supported platform.

-bw

Bill Wendling <wendling@apple.com> writes:

Improving the test suite is always welcome.

Do we have an idea of what sorts of improvements we'd like? Any codes
that we want to add, for example? What would be useful for ARM?

In addition, we send out pre-release tarballs and have people in the
community build and test their programs with it. This is not a perfect
system, but it's one which works for us given the number of testers
available, the amount of time and resources they have, and whatever
fixes need to be merged into the release.

ARM qualification is a bit trickier, because of the different specific
chips out there, different OSes, and having to verify ARM, Thumb1, and
Thumb2 for the same configurations. And the tests tend to run a bit
slower than, say, an x86 chip. So it's mostly a matter of time and
resources. Unless we can get people who are willing to perform these
tests, we won't be able to release ARM as an official supported
platform.

Resources isn't the only problem. I've asked several times about adding
my personal machines to the testing pool but I never get a reply. So
there they sit idle every day, processing the occasional e-mail when
they could be chewing LLVM tests.

It is in fact highly in my own interest to get them running. I just
need to be pointed to the document that tells me what the buildbot
master expects to see and defines a procedure for adding them as slaves.

One thing that could help these situations is virtualization.

I've toyed with the idea of setting up various virtual machines to test
various OS/architecture combinations. With QEMU I can imagine testing
various ISAs as well.

If there are any ARM full-system simulators we could use those as well.
I'd be happy to run them.

                              -Dave

Bill Wendling <wendling@apple.com> writes:

Improving the test suite is always welcome.

Do we have an idea of what sorts of improvements we'd like? Any codes
that we want to add, for example? What would be useful for ARM?

In addition, we send out pre-release tarballs and have people in the
community build and test their programs with it. This is not a perfect
system, but it's one which works for us given the number of testers
available, the amount of time and resources they have, and whatever
fixes need to be merged into the release.

ARM qualification is a bit trickier, because of the different specific
chips out there, different OSes, and having to verify ARM, Thumb1, and
Thumb2 for the same configurations. And the tests tend to run a bit
slower than, say, an x86 chip. So it's mostly a matter of time and
resources. Unless we can get people who are willing to perform these
tests, we won't be able to release ARM as an official supported
platform.

Resources isn't the only problem. I've asked several times about adding
my personal machines to the testing pool but I never get a reply. So
there they sit idle every day, processing the occasional e-mail when
they could be chewing LLVM tests.

It is in fact highly in my own interest to get them running. I just
need to be pointed to the document that tells me what the buildbot
master expects to see and defines a procedure for adding them as slaves.

Daniel may have the details you need to get this up and running.

-bw

I think we need to think along two dimensions - Breadth of testing and depth
of testing

1. Breadth: What the best supported ARM ISA versions in LLVM ARM? Say its
armv6 and armv7; We need to
  - regression test ARM mode, Thumb-2 and Thumb-1 mode (armv6)
  - Performance/code-size test ARM mode, Thumb-2 and Thumb-1 modes

    We need to agree on an optimization level for regression as well as
performance (such as -O3 for performance and -Os for code-size)

2. Depth:
  (a) Adding more regression tests: Every new commit comes with a set of
tests, but these are just regression tests. We need global access to
validation suites; unfortunately most validation suites are commercial and
their licensing prohibits even proxy public use. What about leveraging some
other open source test suites?
  (b) Adding more performance tests: We need to identify performance and
code-size regressions before committing. Currently there are wrappers for
SPEC. What other performance/code-size suites can we get? Should there be
guidelines for performance reporting on SPEC and/or other suites? A lot of
users depend on LLVM ARM performance/codesize remaining stable or getting
better, so any degradation will trigger extra work for all consumers.

3. Reporting:
   We need a more formal reporting process of validation done for a commit.
Currently, the validation process for ARM is same as x86 (just run the tests
and make sure they pass). We need to expand reporting to include breadth and
depth above to ensure reduced work for community in tracking down
regressions.

Of course, all this is going to increase the threshold for committing.
Either the committer pays early by running all these tests or the community
pays late by fixing them. The risk of paying late is also an unstable LLVM
tip for ARM.

Availability of ARM hardware is certainly an issue. I can make available ARM
hardware to run regressions through buildbot (or some other bot mechanism),
but making login access available to ARM hardware (for debugging) raises
firewall and security issues. I would like to hear community's thoughts on
these.

Thanks
--Raja

Bill Wendling <wendling@apple.com> writes:

It is in fact highly in my own interest to get them running. I just
need to be pointed to the document that tells me what the buildbot
master expects to see and defines a procedure for adding them as slaves.

Daniel may have the details you need to get this up and running.

Great! It would also help if we put instructions on the web site and
publicized them widely. People are more likely to volunteer if doing so
is dirt simple.

                            -Dave

+100 :wink:

-- Marshall

Marshall Clow Idio Software <mailto:mclow.lists@gmail.com>

A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait).
        -- Yu Suzuki

Hello Raja,

1. Breadth: What the best supported ARM ISA versions in LLVM ARM? Say its
armv6 and armv7; We need to
- regression test ARM mode, Thumb-2 and Thumb-1 mode (armv6)
- Performance/code-size test ARM mode, Thumb-2 and Thumb-1 modes

You forget about different platforms, e.g. arm/linux vs arm/darwin.
Also, even for ARMv7 we can get different results since we have
different schedulers for Cortex-A8 and Cortex-A9 :slight_smile:

2. Depth:
(a) Adding more regression tests: Every new commit comes with a set of
tests, but these are just regression tests. We need global access to
validation suites; unfortunately most validation suites are commercial and
their licensing prohibits even proxy public use. What about leveraging some
other open source test suites?

LLVM has its own test-suite which should be enough for first step.

"Raja Venkateswaran" <rajav@codeaurora.org> writes:

I think we need to think along two dimensions - Breadth of testing and depth
of testing

1. Breadth: What the best supported ARM ISA versions in LLVM ARM? Say its
armv6 and armv7; We need to
  - regression test ARM mode, Thumb-2 and Thumb-1 mode (armv6)
  - Performance/code-size test ARM mode, Thumb-2 and Thumb-1 modes

    We need to agree on an optimization level for regression as well as
performance (such as -O3 for performance and -Os for code-size)

2. Depth:
  (a) Adding more regression tests: Every new commit comes with a set of
tests, but these are just regression tests. We need global access to
validation suites; unfortunately most validation suites are commercial and
their licensing prohibits even proxy public use. What about leveraging some
other open source test suites?
  (b) Adding more performance tests: We need to identify performance and
code-size regressions before committing. Currently there are wrappers for
SPEC. What other performance/code-size suites can we get? Should there be
guidelines for performance reporting on SPEC and/or other suites? A lot of
users depend on LLVM ARM performance/codesize remaining stable or getting
better, so any degradation will trigger extra work for all consumers.

This seems excessive and unrealistic. We're never going to come up with
a testsuite that satisfies everyone's needs and doing so could well be
counter-productive. If no one can commit anything unless it passes
every test (including performance) for every target under multiple
option combinations, nothing will ever get committed. Especially if no
one has access to systems to debug on.

I think it's reasonable for the LLVM community to expect that LLVM users
who have such rigorous testing needs develop their own systems. Testing
is an extremely costly process in terms of dollars, work hours and
equipment expenditures. There's no way the LLVM community can support
such things.

We have our own test suites with tens of thousands of tests that gets
run every night. When we find a problem in LLVM, we fix it and report
it upstream when feasible. If we don't report it upstream or commit the
fix, we have implicitly accepted responsibility for maintaining the fix.
This has worked well for us for years and I don't see any need to push
that cost and responsibility onto the community.

                             -Dave

As I see it, there are regulary commits that introduce performance and
code size regressions. There doesn't seem to be any formal testing in
place. Not for X86, not for ARM. Hunting down regressions like
enable-iv-rewrite=false, which added 130 Bytes to a piece of code that
can only be 8KB large in total is painful and slow. From my point of
view, the only way to ensure that the compiler does a good job is
providing a test infrastructure to monitor this. This is about forcing
pre-commit test, it is about ensuring that the testing is done at all
in a timely manner.

Joerg

The need for ARM hardware can be partially satisfied by using ARM
emulators like softgun, QEMU and I think there is an ARM emulator that
can be built as part of GDB. Of course the ARM Holdings development
system comes with an emulator.

You could run several emulator processes on an x86 or x86_64 server
that has more RAM than your typical ARM boards do and get a full test
run in a short amount of time.

Of course this risks that test failures may actually be bugs in the
emulators, but I assert those would be useful results. If we can't
explain our own test failures as bugs in LLVM then maybe we have a
useful bug to report to the emulator developers.

Real ARM boards can be had cheap these days. The Raspberry Pi
lower-end model will just be $25.00. I own a Gumstix Overo Fire COM
that cost me about $200, plus about $200 for a Tobi add-on board for
I/O. Gumstix also sells boards for making distributed computers that
are connected via 100 MBPS ethernet.

Don Quixote

I'm sorry you feel that way. Perhaps you could elaborate on what
you want as testing and how to make it work, who would do the work,
and how to integrate it in with everyone's disparate desires?

-eric

In a world of multiple developers with conflicting priorities, this simply isn’t realistic. I know that those 130 bytes are very important to those concerned with the NetBSD bootloader, but the patch that added them was worth significant performance improvements on important benchmarks (see Jack Howarth’s posting for 9/6/11, for instance), which lots of other developers consider an obviously good tradeoff.

A policy of “never regress anything” is not tenable, because ANY change in code generation has the possibility to regress something. We end up in a world where either we never make any forward progress, or where developers hoard up trivial improvements they can use to “negate” the regressions caused by real development work. Neither of these is a desirable direction.

The existing modus operandi on X86 and other targets has been that there is a core of functionality (what is represented by the LLVM regression tests and test-suite) that all developers implicitly agree to avoid regressing on set of “blessed” configurations. We are deliberately cautious in expanding the range of functionality that cannot be regressed, or on widening the set of configurations (beyond those easily accessible to all developers) on which those regressions must not occur. This allows us to improve quality over time without preventing forward progress.

While I do think it would be a good idea to consider expanding the blessed configurations to include some ARM targets, the heterogeneity of ARM targets makes defining a configuration that is easily accessibly to all developers quite difficult. Apple developers obviously care strongly about the processors on which Darwin runs, and those targets are easily the best supported, but other developers can’t easily replicate that for pre-commit testing. Blessing a target whose support is “fragile” may create problems down the road if it needs significant, possibly-regression-causing work to improve the target in future.

In summary, we can only commit to a no-regressions policy on targets that are already well-supported (unlikely to need drastic breaking work in the future), easily testable by all developers, and on a controlled body of testcases that are universally acceptable. Defining those targets and those testcases is a hard but necessary job to ensure quality while continuing to improve the compiler. Simply freezing code generation as-is is not an acceptable solution.

–Owen

> As I see it, there are regulary commits that introduce performance and
> code size regressions. There doesn't seem to be any formal testing in
> place. Not for X86, not for ARM. Hunting down regressions like
> enable-iv-rewrite=false, which added 130 Bytes to a piece of code that
> can only be 8KB large in total is painful and slow. From my point of
> view, the only way to ensure that the compiler does a good job is
> providing a test infrastructure to monitor this. This is about forcing

                                                             ^^^ not

> pre-commit test, it is about ensuring that the testing is done at all
> in a timely manner.

In a world of multiple developers with conflicting priorities, this
simply isn't realistic. I know that those 130 bytes are very important
to those concerned with the NetBSD bootloader, but the patch that added
them was worth significant performance improvements on important
benchmarks (see Jack Howarth's posting for 9/6/11, for instance), which
lots of other developers consider an obviously good tradeoff.

Don't get me wrong, my problem is not the patch by itself. LLVM at the
moment is relatively bad at creating compact code on x86. I'm not sure
what the status is on ARM for that, but there are use cases where it
matters a lot. Boot loaders are one of them. So disabling some
optimisations when using -Os or -Oz is fine.

The bigger issue is that accepting a size/performance trade off here and
another one there and yet another trade off in that corner sums up. It
can get to the point any of the trade offs by itself is fine, but the
total result goes over the CPU instruction cache and completely kills
performance. More importantly, it will happen with completely harmless
looking changes at some point.

A policy of "never regress anything" is not tenable, because ANY change
in code generation has the possibility to regress something. We end up
in a world where either we never make any forward progress, or where
developers hoard up trivial improvements they can use to "negate" the
regressions caused by real development work. Neither of these is a
desirable direction.

This is not what I was asking for. For GCC there are not only build bots
and functional regression tests, but also regular runs of benchmarks
like SPEC etc. Consider it a call for the community to identify useful
real-world test cases to measure:

(1) Changes in the performance of compiled code, both with and without
LTO.

(2) Changes in the size of compiled code, both with and without
explicitly optimising for it.

(3) Changes in compilation time.

I know that for many bigger changes at least (1) and (3) are often
checked. This is about doing a general testing over a long time. When a
regression on one of the metrics occur, it can be evaluated. But that's
a separate discussion, e.g. whether to disable an optimisation for
-Os/-Oz or move it to a higher optimiser level etc.

The existing modus operandi on X86 and other targets has been that
there is a core of functionality (what is represented by the LLVM
regression tests and test-suite) that all developers implicitly agree
to avoid regressing on set of "blessed" configurations. We are
deliberately cautious in expanding the range of functionality that
cannot be regressed, or on widening the set of configurations (beyond
those easily accessible to all developers) on which those regressions
must not occur. This allows us to improve quality over time without
preventing forward progress.

As I see it, the current regression test suite is aimed at preventing
bad compilation. It's not that useful to handle the other cases above.
Of course, checking for compile or runtime regressions is a lot harder
to do as they require a reproducable environment. So my request can't
replace the existing tests and it isn't meant to.

I hope I made myself a bit clearer.

Joerg

Joerg Sonnenberger <joerg@britannica.bec.de> writes:

This is not what I was asking for. For GCC there are not only build bots
and functional regression tests, but also regular runs of benchmarks
like SPEC etc. Consider it a call for the community to identify useful
real-world test cases to measure:

I don't disagree that some additional community testing like this would
be useful. SPEC in particular would be nice as most people don't have
the resources to acquire their own copy and it would be helpful for
those folks to know when something has gone wrong. SPEC is, justifiably
or not, pretty important to a lot of people.

If I had a personal copy of SPEC I'd set this up for people to use. If
anyone wants to donate $$$ to make that happen, I'm all ears. :slight_smile:

However, the additional testing feasible for the LLVM community will
never come close to what's required for a production-level compiler.
The existing testbase does a very good job of keeping things pretty
stable.

                             -Dave

The change you refer to was not intended to improve performance at the expense of code size.
Has this been fixed on trunk, or did you discover the workaround and move on without filing a bug?

Thanks,
-Andy

I haven't filled a bug yet, since I haven't had time to produce a proper
test case. I have disabled the option for now explicitly.

Joerg

Bill Wendling <wendling@apple.com> writes:

Improving the test suite is always welcome.

Do we have an idea of what sorts of improvements we'd like? Any codes
that we want to add, for example? What would be useful for ARM?

In addition, we send out pre-release tarballs and have people in the
community build and test their programs with it. This is not a perfect
system, but it's one which works for us given the number of testers
available, the amount of time and resources they have, and whatever
fixes need to be merged into the release.

ARM qualification is a bit trickier, because of the different specific
chips out there, different OSes, and having to verify ARM, Thumb1, and
Thumb2 for the same configurations. And the tests tend to run a bit
slower than, say, an x86 chip. So it's mostly a matter of time and
resources. Unless we can get people who are willing to perform these
tests, we won't be able to release ARM as an official supported
platform.

Resources isn't the only problem. I've asked several times about adding
my personal machines to the testing pool but I never get a reply. So
there they sit idle every day, processing the occasional e-mail when
they could be chewing LLVM tests.

I'm pretty sure this has been answered:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-September/034555.html

But now the master has moved (this week), so its no longer osuosl.

However, I do agree that it should be put into an appropriate place on the website (if its there, I can't easily find it).

-Tanya

Tanya Lattner <lattner@apple.com> writes:

Resources isn't the only problem. I've asked several times about adding
my personal machines to the testing pool but I never get a reply. So
there they sit idle every day, processing the occasional e-mail when
they could be chewing LLVM tests.

I'm pretty sure this has been answered:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-September/034555.html

Oh! I missed that one. Thanks for the pointer!

But now the master has moved (this week), so its no longer osuosl.

Right. I'll try to figure that bit out. :slight_smile:

However, I do agree that it should be put into an appropriate place on
the website (if its there, I can't easily find it).

Who owns the website? If I want to make such changes, who do I go
through?

                             -Dave

I'm not sure if anyone really owns it, but there are only a few with access to the actual web server. However, the website is in svn (www repo). If its a small fix (typo, etc), just check in and people post commit review. Anything larger (new page, restructuring, etc), its best to send a patch just to get a few eyes on it.

Thanks,
Tanya