Buildbot General Failure - Production Stop?

Folks,

As Nico and Diana investigated earlier [1], there was a change in Zorg
which made buildbots update one source directory (llvm.src) but build
from another (llvm), which made *all* builds from the same revision,
no matter the update.

Essentially, the bots were all lying when they said this or that
commit "passed", since they were still testing the same old commit.
All our bots were affected, and it seems many other Windows, PowerPC,
s390, Atom, etc.

I have worked around the problem now by making "llvm" as a symbolic
link to "llvm.src", so we build what we update and *many* of the bots
are coming back with a myriad of failures, which are most likely from
different commits in the last 4 days. This will take a while to clean
up... for all of us.

My question is: what do we do now?

The safest option would be to stop production, ie. block commits,
until the bots are reverted and then green. In a way, with all those
bots not testing anything, whatever we commit is *not* going to be
tested at all in a large part of our infrastructure, so I don't really
think there is a point in assuming we can continue committing at
will...

I don't remember this every happening in LLVM, that's why I'm
reluctant to propose it more strongly, but I see no better
alternative.

So, what now?

cheers,
--renato

[1] http://lists.llvm.org/pipermail/cfe-dev/2016-September/050651.html

Let's first see how bad it is once bots are fixed to build the latest revision. It's only been a few days and that includes a weekend.

-Krzysztof

Of course. I'll wait until all our bots return from the first round to
know the size of the damage.

I just wanted to warn all other buildbot owners and general public
that we do have a tough situation going, and if their patches are not
essential, maybe voluntarily hold on for a while would be a good idea.
(Though, now that I read my email, it does sound alarmist. I'm in
panic mode right now, so I apologise :).

But all other affected bots won't come back until Galina reverts and
restart the master (or until their owners work around like I did).
There's no way of knowing how bad it is if the bots continue chugging
bogus green status to every commit...

cheers,
--renato

Hi Renato, do we have an idea which was the latest commit that the buildbots (really) tested?

- V. Kalintiris

No idea, as Galina said SVN was out of sync on the 2nd (18pm her local
time). I'm guessing the problem already existed by then for a long
time.

Right now I'm trying to put out the fires, not worry about the range.
Our first bot [1] broke lots of tests, and I was lucky to find commits
in those areas since the 2nd, so I already fired the email and I
expect those patches to be preemptively reverted.

Also, ARM can't self-host anymore [2], and I'm trying 280555 which is
around the time of her email. If that fails, I'll be going back ~100
commits or so at a time.

Our other bots are falling down like rotten apples [3], and I'll be
going through one at a time... :frowning:

--renato

[1] http://lab.llvm.org:8011/builders/clang-cmake-armv7-a15/builds/14952
[2] http://lab.llvm.org:8011/builders/clang-cmake-armv7-a15-selfhost/builds/7685
[3] http://llvm.tcwglab.linaro.org/monitor/

FYI, I created https://llvm.org/bugs/show_bug.cgi?id=30287 to track
the problems with all bots.

Bot owners, please CC yourselves and let's get those bots green! :slight_smile:

--renato

I can tell you that at least for one of the buildbots I follow, this is the last build that was actually tested. You can see the slave losing connection with the master during the “ninja check 2” stage, and if you change the URL to look at build 15320 (the next build) you’ll see that the “build stage 1” drops to 0 seconds and remains that way on subsequent builds.

So, somewhere around SVN r280562 (Friday 17:38 Pacific Time) is going to be the last tested build, but each bot will be on a slightly different revision around that time.

Turns out Galina's refactoring has scrambled the source directories
around and our bots are completely nuts.

The commit we have identified as possible breakers was r280435, but
potentially more. We need this refactoring reverted immediately and
the master restarted, so we can cleanup our bot directories and start
fresh.

Any subsequent refactoring must be validated with buildbots owners
*before* going in.

Galina,

Can you please revert the changes and restart the master? Can anyone
else restart the master?

cheers,
--renato