ARM failures

The following failures are consistent on buildbot (and my local box).

The Clang one I think it’s assuming an Intel box, the other two look like the FileCheck are not good enough.

http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a9/builds/4305

Clang :: CodeGen/compound-assign-overflow.c
LLVM :: Transforms/LoopStrengthReduce/post-inc-icmpzero.ll
LLVM :: Transforms/LoopStrengthReduce/2012-07-18-LimitReassociate.ll

cheers,
–renato

Usually the best way to get traction on such things is to reply to the
commit that caused the regression. Whoever broke things is usually
more invested in making sure the change is solid (& doesn't get
reverted).

[slightly ranting point: Whoever owns this builder should be in a
position of authority/autonomy with the project to be able to maintain
their passing status. That means someone who cares about this bot
should have the rights (both technically & culturally) to revert
patches if it becomes necessary (an unco-operative committer). That's
not to say that patches should be reverted without consideration or
taking steps to help the author reproduce the issue. In the case of a
test that's not platform agnostic & needs a triple, that should be
easy/obvious & once notified the committer should be able to make the
change quickly (or the bot maintainer can make such a commit (simply
adding the triple that was assumed, or even generalizing the test to
be neutral if that looks viable) & leave it to the original committer
to choose an alternative fix when they have time.]

- David

Hi David,

Good point. The build bot is broken for a while and I assumed the person
who did that commit would spot it better than I would, but I shouldn't have
assumed that the person would receive my email. I'll try to point the
commit and re-send, copying the author.

Sorry for the noise.

--renato

Usually the best way to get traction on such things is to reply to the
commit that caused the regression. Whoever broke things is usually
more invested in making sure the change is solid (& doesn't get
reverted).

Hi David,

Good point. The build bot is broken for a while and I assumed the person who
did that commit would spot it better than I would,

If the bot isn't configured to send fail-mail to the blame list,
people probably won't notice. That's how the buildmaster/bots ended up
in the rather multicolored state they are in now.

but I shouldn't have
assumed that the person would receive my email.

They'll no-doubt get your email (everyone with commit rights should be
on the dev list) but might not read it soon, nor realize it was their
commit that caused the failure. (either because they don't remember
that particular test file change, or think someone else might've
touched it, etc)

I'll try to point the commit
and re-send, copying the author.

Specifically replying to the -commits mailing that committed the break
is the most useful - it provides the context & keeps the discussion
close to the code that it's related to.

Sorry for the noise.

Not a problem. Good that people are looking at these things (& I've
done the same thing you've done here in the past - because I had no
idea what broke & I wanted to see if anyone had ideas/cared).

Are you the owner (or at least a strongly invested party) in this bot,
or just interested in getting bots green generally? Do you know who
the owner is?

- David

The following failures are consistent on buildbot (and my local box).

The Clang one I think it's assuming an Intel box, the other two look like
the FileCheck are not good enough.

http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a9/builds/4305

Clang :: CodeGen/compound-assign-overflow.c

r171853 for this one. Build not finished yet.
http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a9/builds/4312

LLVM :: Transforms/LoopStrengthReduce/post-inc-icmpzero.ll

Looks like a better regex can fix this.

Dmitri

If the bot isn't configured to send fail-mail to the blame list,
people probably won't notice. That's how the buildmaster/bots ended up
in the rather multicolored state they are in now.

I'm supposing this is done in Zorg... :wink:

Specifically replying to the -commits mailing that committed the break

is the most useful - it provides the context & keeps the discussion
close to the code that it's related to.

Yup, will do.

Are you the owner (or at least a strongly invested party) in this bot,

or just interested in getting bots green generally? Do you know who
the owner is?

I'm not the owner, but certainly a very interested party. I know other
people are also monitoring this bot for failures, only less verbose than I
am. I could be wrong, but I think Galina is the owner.

--renato

r171853 for this one. Build not finished yet.
http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a9/builds/4312

Thanks!!

LLVM :: Transforms/LoopStrengthReduce/post-inc-icmpzero.ll

Looks like a better regex can fix this.

I think both of them are just bad FileChecks...

This is the first build that fails:

http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a9/builds/4294

And they haven't introduced the tests... Most of them are debug, there's a
mips change, and one "test" change:

http://llvm.org/viewvc/llvm-project/?view=rev&revision=171705

--renato

If the bot isn't configured to send fail-mail to the blame list,
people probably won't notice. That's how the buildmaster/bots ended up
in the rather multicolored state they are in now.

I'm supposing this is done in Zorg... :wink:

Yes - again, patches welcome :slight_smile:

Specifically replying to the -commits mailing that committed the break
is the most useful - it provides the context & keeps the discussion
close to the code that it's related to.

Yup, will do.

Are you the owner (or at least a strongly invested party) in this bot,
or just interested in getting bots green generally? Do you know who
the owner is?

I'm not the owner, but certainly a very interested party. I know other
people are also monitoring this bot for failures, only less verbose than I
am. I could be wrong, but I think Galina is the owner.

Galina manages/monitors the lab as a whole, but generally each slave
is contributed along with a builder task to run on it by someone who
cares about that particular workload.

Though, indeed, in this case it looks like Galina introduced this in
https://llvm.org/viewvc/llvm-project?view=rev&revision=132144 (& not
via a patch contribution, so far as I can tell from the commit list,
etc)

Galina - are you actively interested in the state of this builder?
Is there someone else in the community who contributed it/is invested
in ensuring that it passes?
Should we add fail-mail for this builder so that the people
contributing these patches would know about the breakage?

r171853 for this one. Build not finished yet.
http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a9/builds/4312

Thanks!!

> LLVM :: Transforms/LoopStrengthReduce/post-inc-icmpzero.ll

Looks like a better regex can fix this.

I think both of them are just bad FileChecks...

This is the first build that fails:

http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a9/builds/4294

That failure points to "member-pointers.ll" which you didn't mention
in your original email. The blame is pretty clear (me):
http://llvm.org/viewvc/llvm-project/?view=rev&revision=171698 (which I
fixed by removing the target info for that test in a subsequent commit
maybe 30 minutes later).

Actually it is this build:

http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a9/builds/4299

(r171735) that has failure on these tests.

The member-pointers.ll is unrelated.

Dmitri

Good point. The build bot is broken for a while and I assumed the person who
did that commit would spot it better than I would,

If the bot isn't configured to send fail-mail to the blame list,
people probably won't notice. That's how the buildmaster/bots ended up
in the rather multicolored state they are in now.

I think what happens from the buildbots depends on how many commits since the last build that succeeded. During peak commit time (working hours in the US) there can be 10-15 commits between builds. (Conversely it's not too unusual to see 1 commit between builds early in the morning UK time.) I think automated emails are generally only enable for bots where the average commits to be blamed is lower. Otherwise it's manual analysis, but a couple of times I've received emails from Galina when I've committed something that's increased the failures.

but I shouldn't have
assumed that the person would receive my email.

I'll try to point the commit
and re-send, copying the author.

Specifically replying to the -commits mailing that committed the break
is the most useful - it provides the context & keeps the discussion
close to the code that it's related to.

Yes, although there are occasionally instances when there's multiple commits that break tests they don't touch so it's non-obvious what's responsible.

Sorry for the noise.

Not a problem. Good that people are looking at these things (& I've
done the same thing you've done here in the past - because I had no
idea what broke & I wanted to see if anyone had ideas/cared).

I think the biggest issue is that if a committer is unlucky (commit just after a buildbot kicks off) it can be 2.25+2.25=4.5 hours (due to two build cycles) before the buildbot turns red. I wish I had a magic suggestion to cure that, but I can't think of any.

Regards,
Dave

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Good point. The build bot is broken for a while and I assumed the person who
did that commit would spot it better than I would,

>If the bot isn't configured to send fail-mail to the blame list,
>people probably won't notice. That's how the buildmaster/bots ended up
>in the rather multicolored state they are in now.

I think what happens from the buildbots depends on how many commits since the last build that succeeded. During peak commit time (working hours in the US) there can be 10-15 commits between builds. (Conversely it's not too unusual to see 1 commit between builds early in the morning UK time.) I think automated emails are generally only enable for bots where the average commits to be blamed is lower. Otherwise it's manual analysis, but a couple of times I've received emails from Galina when I've committed something that's increased the failures.

but I shouldn't have
assumed that the person would receive my email.

I'll try to point the commit
and re-send, copying the author.

>Specifically replying to the -commits mailing that committed the break
>is the most useful - it provides the context & keeps the discussion
>close to the code that it's related to.

Yes, although there are occasionally instances when there's multiple commits that break tests they don't touch so it's non-obvious what's responsible.

Indeed - that's part of the reason why builders need owners who care
about them. I think it's always going to be up to the owners to
investigate in a situation like this where any individual contributor,
not being on/having access to/personally being invested in the
builder, can't really be expected to go out of their way to sift
through commits & decide whether they're to blame. In cases like that
each individual will just assume it's "not their problem" so it must
fall to someone to ensure it doesn't just get dropped on the floor.

The owner should be on any fail-mail thread and, if the issue is not
addressed in a timely manner, should take steps to ensure that the
responsible party is identified (& made aware) and unblocked (clear
repro steps - usually with regression tests like LLVM's, this can be
done by anyone simply by specifying the relevant target triple &
watching the failure - no need to have access to special hardware,
etc). The owner can either fix it themselves (add an explicit triple,
generalize a CHECK line, etc) or wait a reasonable amount of time
(where reasonable depends on the issue, time of day, etc) for a fix
from the author. If no fix is forthcoming, it's not unreasonable to
revert the patch to get the builder back to green.

This needs to be how things happen or bots end up red for too long &
then the buildmaster page is useless as a clear sense of "is Clang
broken" (because, hey, it's always 'broken' - & people won't know
which builders matter & which ones don't).

Sorry for the noise.

> Not a problem. Good that people are looking at these things (& I've
> done the same thing you've done here in the past - because I had no
> idea what broke & I wanted to see if anyone had ideas/cared).

I think the biggest issue is that if a committer is unlucky (commit just after a buildbot kicks off) it can be 2.25+2.25=4.5 hours (due to two build cycles) before the buildbot turns red. I wish I had a magic suggestion to cure that, but I can't think of any.

Certainly slow builders are problematic. The phase-based building
system David Dean is setting up may help mitigate some of this (it
should make better use of the resources we have, as well as allowing
us to benefit (in the form of smaller blame lists, though not
necessarily lower buildbot result latency) from additional resources
by allowing greater parallelism).

Even at 4.5 hours of turnaround, we don't break these things /that/
often that a builder broken even for a whole day is the end of the
world. It's the builders broken for weeks & weeks (well beyond the
history/backlog on the build master's console page) that I think we
should seek to avoid/resolve. That being said, yes, shorter turnaround
& more fine-grained blame would be great.

- David

Certainly slow builders are problematic. The phase-based building
system David Dean is setting up may help mitigate some of this (it
should make better use of the resources we have, as well as allowing
us to benefit (in the form of smaller blame lists, though not
necessarily lower buildbot result latency) from additional resources
by allowing greater parallelism).

There's something I've always meant to ask: when you've got a stable buildot setup (same compiler, etc)
all the buildbots are still set up to do a configure and "make clean". I can easily understand if there was the possibility that
any other part of the system might also be being changed you'd want to be sure changes were due to repo changes
and so do a make clean. But those changes aren't too frequent and you could do manual clean at those times.

So all the make clean appears to be doing is guarding against an error in llvm's build system dependency checking. Is the tiny
probability of this worth the effect on the build times? (Personally I'd say no, but maybe there's an argument for it I haven't spotted.)

Thanks,
Dave

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

I don't believe so, no, and in fact when I went to setup my own
builders (for the GDB 7.5 test suite - I've got it running on the
public builder, but also my own internal one with a bit more hardware
so I can get results sooner (not ideal - I'd prefer to have things
publically, but I had the hardware lying around so figured I'd use
it)) I tried to do this. I seemed to have trouble with the 'configure'
step invalidating the whole build anyway - I don't want the configure
step to be something someone has to do manually on a new builder, but
I'm not sure how to run it in such a way that it doesn't cause an
otherwise incremental build to become a full rebuild either. Any ideas
would be most welcome.

(I believe Takumi's bots do incremental rebuilds though, so I guess he
has some way of doing that)

More hardware! :smiley:

--renato

The following failures are consistent on buildbot (and my local box).

[...]

LLVM :: Transforms/LoopStrengthReduce/post-inc-icmpzero.ll
LLVM :: Transforms/LoopStrengthReduce/2012-07-18-LimitReassociate.ll

It is interesting that I don't see this on my ARM box. Instead I see these:

Failing Tests (5):
    LLVM :: ExecutionEngine/MCJIT/2003-01-04-ArgumentBug.ll
    LLVM :: ExecutionEngine/MCJIT/pr13727.ll
    LLVM :: ExecutionEngine/MCJIT/test-common-symbols.ll
    LLVM :: ExecutionEngine/MCJIT/test-fp-no-external-funcs.ll
    LLVM :: ExecutionEngine/MCJIT/test-fp.ll

I configure with:
--build=armv7l-unknown-linux-gnueabihf
--host=armv7l-unknown-linux-gnueabihf
--target=armv7l-unknown-linux-gnueabihf
--with-cpu=cortex-a9 --with-fpu=neon
--with-float=hard --enable-optimized

$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 0 (v7l)
processor : 0
BogoMIPS : 1992.29

processor : 1
BogoMIPS : 1992.29

Features : swp half thumb fastmult vfp edsp vfpv3 vfpv3d16
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x1
CPU part : 0xc09
CPU revision : 0

Any ideas?

Dmitri

The obvious difference is that you're using --enable-optimized and implicitly --disable-assertions. If you run the tests with

make check-all VERBOSE=1 'LIT_ARGS=-v ' > logfile

and grep for FAILED in logfile, does what's listed there give any more details? (Quite possible in a Release-Asserts build
it might not.)

Cheers,
Dave

All these tests fail with 'illegal instruction' signal. For example:

******************** TEST 'LLVM ::
ExecutionEngine/MCJIT/2003-01-04-ArgumentBug.ll' FAILED

You can compare your configure/build arguments + environment with the build bot:

http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a9/builds/4313/steps/configure/logs/stdio

I’ll check how I built my LLVM on Chromebook tomorrow, but it didn’t look too different to yours.

–renato

Hello guys,

First of all, this builder is configured to send e-mails to the blame list.
I’m the owner of these 2 identical build slaves (which is listed in the builder’s properties :slight_smile: ).
I keep an eye on this builder. Though, since builds take ~3 hours they combine quite a lot of changes and the blame lists are long, it takes time for people to notice and fix issues.

We do need more hardware, indeed.

The build has broken yesterday, then it has been fixed, and broken again.
If I get it right, the revisions in questions for the last break are r171734 and r171735.

Let me know if anyone needs an access to one of the build slaves to debug.

And it is great that you care!

Thanks

Galina