RFC: Change tests to run with fixed (not-host dependent) triple

Hi all,

We consistently see test failures arising because by default many of our tests run in a mode where the tool (clang or llc) pick host-dependent behavior. This makes it easy for developers to write tests that pass on their system, but fail for other developers.

There is some utility in this behavior, as it gives us (unintended) testing coverage of some things, but overall I think it is a net loss for productivity.

I propose:

a. We change the test suite to run in such a way that all tools default to an “unknown” host triple.

b. If someone feels there is missing coverage in some area, we add increased tests for that area (which get run on all platforms).

c. If there is some reason that running with an “unknown” host triple is undesirable, I propose that we set the default test triple to be “x86_64-pc-linux-gnu”, and require deviations to be specified.

Comments?

  • Daniel

I'm ok with this in principle, but how about with the nuance that some tests (eg test/codegen) explicitly opt into march=native?

-Chris

I'm ok with this in principle, but how about with the nuance that some
tests (eg test/codegen) explicitly opt into march=native?

I'd really like the default behavior to be something that forces the test
to either be independent of the targeted triple, or explicitly set a
target. I like the default being unknown.

I wonder, would the ability to run the entire test suite with all of the
'default' triples (that lit sets to unknown in normal runs) instead set to
the host, or to a specific triple maybe be a useful extra form of checking?
This would let both humans and build bots find bugs and discrepancies
specific to a particular target.

We could even have a common test target that build bots use which runs all
the tests both in the default, and in the host-triple mode so that we force
people to converge on target independent tests or explicit triples.

My thoughts:

To state the obvious, there’s two different things that go on in tests: the specific thing being tested and things that aren’t being tested but just need to be done to provide enough “context” for what’s being tested. I was involved in a long slog trying to fix up a lot of the ARM regression test failues (using my work email address). Here’s some “roughly right” statistics:

There were probably about 25–30 bugs where the issue was behaviour that FileCheck regexps didn’t account for (mostly due to ABI issues). There the prevailing opinion seemed to be keep simple FileCheck tests but that tests should be run with a specific triple; however that triple shouldn’t always be x86_64 (because that’s a bit special). There’ve been about 5-10 tests where the test was testing something that was architecture specific without secifying they needed it (eg, testing for specific x86_64 machine optimizations without doing that); again the upshot has been to have these requirements specified explicitly. There have been 5-10 JIT code tests where “support” code pasted from, eg, lli into a test hadn’t been updated when the JIT core was changed. The one definite bug that was there was in devirtualisation (which probably lowered ok but failed the module verifier). This bug was definitely not visible due to the general set of ARM failures that were basically issues with the tests.

So on the one hand, I’d love it if the tests were constructed in such a way that the “fuzz” of ABI differences didn’t need to be considered. On the other hand, if the devirtualization test had been run only using an x86_64 triple the issue wouldn’t have come to light as quickly. That seems to me to be the crux of the problem: LLVM (and especially Clang) is only mostly target independent, and getting the smaller set of target dependent elements wrong breaks compilation just as much as a generic bug so finding these things as early as possible seems desirable.

Dreaming here: I wonder if one could come up with some set of meta-regexps that describe the annoying stuff like ABI differences so that if the above were done, one could try t separate the test failures into “test regreps written implicitly assuming different system, failures look due to correct system dependent stuff being generated” and “test is failing on this system in a non-understood way”. That sounds too tricky to be reliable, but I don’t know…

A reasonable idea, except I’ll reiterate: it assumes there is completely target independent behaviour in non-trivial test code. If it’s the case that there’s not really it might be a situation where biting the bullet and trying to put a wide ranging set of triples randomly throughout the test suite and hoping to catch stuff that way is the best idea.

cheers, dave tweed__________________________
high-performance computing and machine vision expert: david.tweed@gmail.com
“while having code so boring anyone can maintain it, use Python.” – attempted insult seen on slashdot

My thoughts:

I'm ok with this in principle, but how about with the nuance that some
tests (eg test/codegen) explicitly opt into march=native?

I'd really like the default behavior to be something that forces the test
to either be independent of the targeted triple, or explicitly set a target.
I like the default being unknown.

To state the obvious, there's two different things that go on in tests: the
specific thing being tested and things that aren't being tested but just
need to be done to provide enough "context" for what's being tested. I was
involved in a long slog trying to fix up a lot of the ARM regression test
failues (using my work email address). Here's some "roughly right"
statistics:

There were probably about 25--30 bugs where the issue was behaviour that
FileCheck regexps didn't account for (mostly due to ABI issues).

These were mostly in Clang tests, right?

I expect Clang IRGen tests will generally be more ABI-dependent, but
we can probably still find some number of tests there that don't
specify triples & pass on both ARM and Itanium ABIs which shows some
degree of portability.

For backend tests I expect many IR pass tests would be able to be
written in a fairly target-neutral fashion (though admittedly I
haven't worked much below Clang so my intuition here is not very
strong)

There the
prevailing opinion seemed to be keep simple FileCheck tests but that tests
should be run with a specific triple; however that triple shouldn't always
be x86_64 (because that's a bit special). There've been about 5-10 tests
where the test was testing something that was architecture specific without
secifying they needed it (eg, testing for specific x86_64 machine
optimizations without doing that); again the upshot has been to have these
requirements specified explicitly. There have been 5-10 JIT code tests where
"support" code pasted from, eg, lli into a test hadn't been updated when the
JIT core was changed. The one definite bug that was there was in
devirtualisation (which probably lowered ok but failed the module verifier).
This bug was definitely not visible due to the general set of ARM failures
that were basically issues with the tests.

So on the one hand, I'd love it if the tests were constructed in such a way
that the "fuzz" of ABI differences didn't need to be considered. On the
other hand, if the devirtualization test had been run only using an x86_64
triple the issue wouldn't have come to light as quickly.

This would be exposed by Chandler's suggestion of having the ability
to run target-unspecified tests with a range of (up to & including
"all targets") triples.

I think this gives the best of both worlds, really. Developers would
be expected to run the basic test suite (using the default/agnostic
triple) and have a consistent experience/expectation that they pass.
They could optionally run with specific target triples or "all
targets" which would take longer but provide more confidence when
making a test or change that they're concerned might be
target-dependent. And bots would be responsible for testing the "all
targets" case regularly (& as is the case today with other bot
workloads, we don't necessarily expect developers to always run all
these heavier workloads so when they break it's "no harm, no foul":
the developer is notified & fixes up the test based on bot results).
We don't unduly block any development from happening by accidentally
introducing a test break for Apple developers while developing on
Linux, for example.

Apart from what has already being said, which I agree mostly, I have
some specific comments:

a. We change the test suite to run in such a way that all tools default to
an "unknown" host triple.

The assumptions will be different and I believe the refactoring of the
tests to make them pass on all currently passing architectures will
not be trivial. Good luck! :wink:

b. If someone feels there is missing coverage in some area, we add increased
tests for that area (which get run on all platforms).

If you mean adding "unknown" triple tests as much as possible, I agree.

c. If there is some reason that running with an "unknown" host triple is
undesirable, I propose that we set the default test triple to be
"x86_64-pc-linux-gnu", and require deviations to be specified.

Here, I don't agree. I don't see why one platform should be the
default over another.

A similar concept, but less prejudicial is: If the tests require
platform-specific features, platform-specific tests should be added.
*Only* those added will be assumed tested (for the sakes of
validation). Other architectures can add similar tests on their own
triples, if necessary / desired.

In a nutshell, "unknown" defaults to nothing at all, whether running
on Intel, ARM, MIPS or whatever. If you default "unknown" to the host
architecture, or to a specific architecture, the benefits of having an
"unknown" target is void, IMO.

Apart from what has already being said, which I agree mostly, I have
some specific comments:

a. We change the test suite to run in such a way that all tools default to
an "unknown" host triple.

The assumptions will be different and I believe the refactoring of the
tests to make them pass on all currently passing architectures will
not be trivial. Good luck! :wink:

I'm not sure what you mean by "refactoring of the tests to make them
pass on all currently passing architectures" - if we force tests
without specified triples to use an "unknown" triple, the tests would
need to be refactored, but why would they need to be refactored to
account for all currently passing architectures? They should behave
exactly the same (since they won't inherit the host triple that they
do now) on all test machines/architectures, right?

b. If someone feels there is missing coverage in some area, we add increased
tests for that area (which get run on all platforms).

If you mean adding "unknown" triple tests as much as possible, I agree.

I believe this statement was about the reverse: if certain
target-specific coverage is required it would have to be done by
writing a test that specifies /that/ target triple, not by (as is
sometimes the case currently) writing a test that picks up the host
triple & then running it on a machine with the desired triple to get
coverage for.

c. If there is some reason that running with an "unknown" host triple is
undesirable, I propose that we set the default test triple to be
"x86_64-pc-linux-gnu", and require deviations to be specified.

Here, I don't agree. I don't see why one platform should be the
default over another.

Because we would need/want a default of some kind. The argument here
is: "If we can't choose some agnostic default for all tests, we should
choose a non-agnostic default" - the only alternative position is that
we don't choose a default but instead force every test to specify an
arbitrary triple. I don't think this is substantially more valuable,
though it is the current state of affairs among the tests that do have
triples specified (that they are "random" in the sense that they're
usually whatever architecture the developer is working on at the time
- so we have lots of linux ones, lots of darwin ones, and a smattering
of ARM)

A similar concept, but less prejudicial is: If the tests require
platform-specific features, platform-specific tests should be added.
*Only* those added will be assumed tested (for the sakes of
validation). Other architectures can add similar tests on their own
triples, if necessary / desired.

Yes, that's the intent. This clause (c) was only applicable to the
approach overall (ie: if there are no (or no substantial) tests that
can be expressed with an unknown triple then we can't really use that
as a default - we'd have to pick some other default, or have no
default at all & force every test to specify a triple). For specific
tests that need to test platform-specific things, that is what Daniel
was saying in (b) - add coverage for that area by writing a test with
an explicit triple.

I see. Apart from the communication breakdown, I think we're all
talking about the same things... :wink:

Behalf Of David Blaikie

I suppose I had that feeling, but couldn't put my finger on it, and to
be honest, I still can't. It feels like we're jumping from one
undefined state to another, but I think the new state is more strict
than the previous, so it is better than what we've got.

However, it'd be good it whoever is changing this could test on a few
other architectures before committing. Or, maybe even preferred, pass
the changes around so that people that work every day on other
architectures (like David Tweed) can have a look and see if it makes
sense on them, too.

I'm ok with this in principle, but how about with the nuance that some
tests (eg test/codegen) explicitly opt into march=native?

I'd really like the default behavior to be something that forces the test
to either be independent of the targeted triple, or explicitly set a
target. I like the default being unknown.

I wonder, would the ability to run the entire test suite with all of the
'default' triples (that lit sets to unknown in normal runs) instead set to
the host, or to a specific triple maybe be a useful extra form of checking?
This would let both humans and build bots find bugs and discrepancies
specific to a particular target.

I'd prefer not to do this universally (i.e., run the whole test suite that
way), but what I do think would be useful is to add enough test suite
support to be able to easily run the same test on multiple triples (or even
all configured ones).

My primary goal is to have the tests that individual developers be
equivalent (independent of the target they are running on).

- Daniel

We could even have a common test target that build bots use which runs all

I'm ok with this in principle, but how about with the nuance that some
tests (eg test/codegen) explicitly opt into march=native?

I'd really like the default behavior to be something that forces the test
to either be independent of the targeted triple, or explicitly set a target.
I like the default being unknown.

I wonder, would the ability to run the entire test suite with all of the
'default' triples (that lit sets to unknown in normal runs) instead set to
the host, or to a specific triple maybe be a useful extra form of checking?
This would let both humans and build bots find bugs and discrepancies
specific to a particular target.

I'd prefer not to do this universally (i.e., run the whole test suite that
way),

I'm not entirely sure which distinction you're drawing here when you
say "universally". I assume we're specifically talking about only
those tests in the lit-based regression suite that don't already have
an explicit triple. Only these tests would get the default ("unknown")
triple and be affected by some lit-parameter to override the default
with any other triple or possibly a list of triples.

Icing would be to have that list be autodetected from the supported
targets in the current build.

This would not be the default (the default would just be to run them
once, with the "unknown" triple) but would be an easily accessible
target for developers and would be the target that any respectable
buildbot would use. (without enforcement by bots the feature would be
useless to users since the results would not be clean).

but what I do think would be useful is to add enough test suite
support to be able to easily run the same test on multiple triples (or even
all configured ones).

My primary goal is to have the tests that individual developers be
equivalent (independent of the target they are running on).

Did you a word? ("have the tests that individual developers <write?
run?> be equivalent")

- David

From: cfe-dev-bounces@cs.uiuc.edu [mailto:cfe-dev-bounces@cs.uiuc.edu] On
Behalf Of David Blaikie
Sent: 03 December 2012 16:41
To: Renato Golin
Cc: LLVM Developers Mailing List; cfe-dev
Subject: Re: [cfe-dev] [LLVMdev] RFC: Change tests to run with fixed
(not-host dependent) triple

c. If there is some reason that running with an "unknown" host triple is
undesirable, I propose that we set the default test triple to be
"x86_64-pc-linux-gnu", and require deviations to be specified.

Here, I don't agree. I don't see why one platform should be the
default over another.

> Because we would need/want a default of some kind. The argument here
> is: "If we can't choose some agnostic default for all tests, we should
> choose a non-agnostic default" - the only alternative position is that
> we don't choose a default but instead force every test to specify an
> arbitrary triple. I don't think this is substantially more valuable,
> though it is the current state of affairs among the tests that do have
> triples specified (that they are "random" in the sense that they're
> usually whatever architecture the developer is working on at the time
> - so we have lots of linux ones, lots of darwin ones, and a smattering
> of ARM)

Just a point here: the reason I'd mildly prefer not to have a default that
avoids as much target dependent stuff as possible is that it's generally
going to have a higher probability of passing even if something is "wrong"
in the sense that, eg, if the return type of some thing is ABI mandated to
be void, then you could be getting the type from the wrong place and still
pass since all places give the same result;

Sorry, I think I lost track of this example around here. What do you
mean by "getting the type from the wrong place"?
What sort of solution are you proposing?

What I think we're discussing here is:
All tests without a triple be written in such a way as to pass for any
triple (& there would be test infrastructure to help ensure this) &
that the default would be a fixed "unknown" triple, or an arbitrary
(but constant/singular) concrete triple.

What is it you have in mind?

Hi David,

I think this is a valuable topic, but I also think it is somewhat off topic from my original post.

What you are primarily focusing on here, is, I believe, how we right tests that are more target independent (i.e. can focus more on “test the thing being tested” and less on other stuff). That is a great topic, but can we separate it out to a different discussion? I think it can be treated orthogonally to my goals which is just to have the test suite behave uniformly for all developers.

  • Daniel

Just to also reply to Renato’s concerns here, I’m on the same page with David here.

>>
>>>
>>> I'm ok with this in principle, but how about with the nuance that some
>>> tests (eg test/codegen) explicitly opt into march=native?
>>
>>
>> I'd really like the default behavior to be something that forces the
test
>> to either be independent of the targeted triple, or explicitly set a
target.
>> I like the default being unknown.
>>
>> I wonder, would the ability to run the entire test suite with all of the
>> 'default' triples (that lit sets to unknown in normal runs) instead set
to
>> the host, or to a specific triple maybe be a useful extra form of
checking?
>> This would let both humans and build bots find bugs and discrepancies
>> specific to a particular target.
>
>
> I'd prefer not to do this universally (i.e., run the whole test suite
that
> way),

I'm not entirely sure which distinction you're drawing here when you
say "universally". I assume we're specifically talking about only
those tests in the lit-based regression suite that don't already have
an explicit triple. Only these tests would get the default ("unknown")
triple and be affected by some lit-parameter to override the default
with any other triple or possibly a list of triples.

Icing would be to have that list be autodetected from the supported
targets in the current build.

What I want is for this to not be "implicit" behavior. Many tests have no
value being run with multiple triples.

What I want is for the author of the test to explicitly declare what they
are trying to do. If a test is useful on multiple triples, then they should
write something that says so, and the test suite will take the extra time
to do it.

My opinion on tests is that they are significantly better when they are
written with an explicit idea of what coverage they are trying to get (and
hopefully a comment explaining that too).

This would not be the default (the default would just be to run them

once, with the "unknown" triple) but would be an easily accessible
target for developers and would be the target that any respectable
buildbot would use. (without enforcement by bots the feature would be
useless to users since the results would not be clean).

> but what I do think would be useful is to add enough test suite
> support to be able to easily run the same test on multiple triples (or
even
> all configured ones).
>
> My primary goal is to have the tests that individual developers be
> equivalent (independent of the target they are running on).

Did you a word? ("have the tests that individual developers <write?
run?> be equivalent")

Hah. Yes, "tests that individual developers write".

- Daniel

>>
>>>
>>> I'm ok with this in principle, but how about with the nuance that some
>>> tests (eg test/codegen) explicitly opt into march=native?
>>
>>
>> I'd really like the default behavior to be something that forces the
>> test
>> to either be independent of the targeted triple, or explicitly set a
>> target.
>> I like the default being unknown.
>>
>> I wonder, would the ability to run the entire test suite with all of
>> the
>> 'default' triples (that lit sets to unknown in normal runs) instead set
>> to
>> the host, or to a specific triple maybe be a useful extra form of
>> checking?
>> This would let both humans and build bots find bugs and discrepancies
>> specific to a particular target.
>
>
> I'd prefer not to do this universally (i.e., run the whole test suite
> that
> way),

I'm not entirely sure which distinction you're drawing here when you
say "universally". I assume we're specifically talking about only
those tests in the lit-based regression suite that don't already have
an explicit triple. Only these tests would get the default ("unknown")
triple and be affected by some lit-parameter to override the default
with any other triple or possibly a list of triples.

Icing would be to have that list be autodetected from the supported
targets in the current build.

What I want is for this to not be "implicit" behavior. Many tests have no
value being run with multiple triples.

What I want is for the author of the test to explicitly declare what they
are trying to do. If a test is useful on multiple triples, then they should
write something that says so, and the test suite will take the extra time to
do it.

My opinion on tests is that they are significantly better when they are
written with an explicit idea of what coverage they are trying to get (and
hopefully a comment explaining that too).

I agree wholeheartedly, I perhaps got a little carried away trying to
express other (Chandler) people's ideas in this thread on the
assumption that my personal position wasn't necessarily
correct/strong.

I'll leave it to Chandler to champion his idea of multi-target tests
if he so desires.

The example I was giving was a “thinly generalized” version of the devirtualisation issue: what was happening was that a destructor, which on ARM ABI is returns “this”, was being devirtualised and a function with the return type of the base class prototype was being added whereas it ought to have been the return type of the actual function being inlined. (Devirt can’t do general covariant returns, but I gather that no-one actually uses destructor return types anyway.) On x86_64 the return type of a destructor is always void, so getting the type from the base class prototype gives the same result. I know it’s a matter of semantics: was that code actually “a latent bug” on x86_64 when it would always be guaranteed to produce the right code (on that platform)?

I know this is a ridiculously specific example, and I don’t know if similar effects can happen elsewhere, I just have this feeling that picking the simplest platform it’s more likely for stuff to pass even if there’s a minor latent issue. But I can’t back that up with any facts.

As for proposed solutions, I can’t say I really have anything that counts as a solution. I think I need to see what “unknown” actually implies in terms of behaviour/code-paths, it’s just that if it’s equivalent to “the simplest, most vanilla target” it may be less effective in finding issues. But again, that’s just a feeling.

Regards,
Dave