Buildbots building one revision at a time

Hi everyone,
the short version - can I make a bot skip a bunch of revisions even if it is set up to build every revision?

TL; DR;

Since many of the PPC build bots were down for 8 days due to the winter storm in Austin, TX, a large number of build requests have queued up. For most of our bots, this isn’t really an issue since they just pick up the latest revision and make the jump.

However, our flang and mlir bots appear to be building a single revision at a time. Even though each build takes under 10 minutes, it will take quite some time for them to catch up on the few hundred requests. This appears to be because they have ‘collapseRequests’: False in their configuration.
I would like to keep that behaviour, but hopefully there is an override for special circumstances such as this. None of my attempts (Cancel whole queue and Force Build on https://lab.llvm.org/buildbot/#/builders/88) have done anything. Does someone know of a way to make these bots skip all these build requests?

Not sure, but Galina might have some ideas.

Hi Nemanja

I have cancelled the queue for both ppc64le-mlir-rhel-clang and ppc64le-flang-rhel-clang for you.
Hope this helps.

the short version - can I make a bot skip a bunch of revisions even if it is set up to build every revision?

In cases like this, you can manually cancel the queued build requests from the Web UI. Under the hood there could be multiple queues, so it might take more than one click on the “Cancel whole queue” button.

Every worker owner has permissions to control the worker itself and the queue of the build requests, modulo to unknown issues. If the worker owners logged in and the github accounts, they logged in with, have e-mail addresses matching those in the workers info. Yours must be powerllvm@ca.ibm.com. Was that the case?

I have also noticed that you have a worker with the wrong name “ppc64le-flang+mlir-rhel-test” trying to connect over and over again from the same IP address the right one with the name “ppc64le-flang-mlir-rhel-test” connected from. Could you locate and stop the wrong worker, please?

Stay safe and warm!

Galina

That's a good thing to know. I tried this myself and added
pollybot@meinersbur.de to GitHub's list of my email addresses. After
logging in with GitHub, it unfortunately does not work. Any action I
try is denied with a "unable to pause worker
polly-x86_64-fdcserver:you need to have role 'LLVM Lab team'" message.

The problem that I currently have is that one of my buildbots
(Buildbot) live-locks after a few
hours. The master thinks ithe worker is still building, but the worker
is doing nothing. It's twisted.log says

2021-02-24 04:27:04-0600 [-] WorkerForBuilder.commandComplete
<buildbot_worker.commands.shell.WorkerShellCommand object at
0x7f59b68def10>
2021-02-24 04:33:24-0600 [-] sending app-level keepalive
2021-02-24 04:33:24-0600 [Broker,client] Master replied to keepalive,
everything's fine
2021-02-24 04:43:24-0600 [-] sending app-level keepalive
2021-02-24 04:43:24-0600 [Broker,client] Master replied to keepalive,
everything's fine
[... lots of keepalive entries ...]

I have to restart the buildbot-worker to start working on the next job
only to live-lock again in a few hours. For this reason I put this
worker to staging instead of production. I sthis a known problem?

Michael

Only the primary e-mail is used for the authorization purposes.
If that was the case, we can set up time when we are both available and troubleshoot, if needed.

From the buildbot server perspective, the test step never ended, and the worker seemed online and responsive.
I have restarted the staging to make sure all the connections reset, and now your bot builds fine, it seems.

Thanks

Galina

Really sorry for the late reply.

Thank you so much for canceling the queues for these bots. I actually didn’t realize that the email for the logged-in user has to match the one we use for the bot. I was logged in with a different one. Thanks for explaining that.
As for the old bot with the invalid name, I suspect it started itself back up after the machine came back to life. We’ll be sure to disable it and nuke it once and for all.

Nemanja

Unfortunately, just restarting the server did not help. It is stuck again.

Michael