This buildbot looks like it’s been failing since Friday - does anyone know/own/care about it?
Yes, we're looking into it.
As you probably noticed, debugging ARM buildbots are not easy, not
fast. Reverting commits at random also don't help with the problem,
and bisecting can take days, if not weeks. So the one week rule to
disable bots is too harsh on those bots.
Also, please know that I do care a lot about *all* ARM bots (including
AArch64) and I do check them multiple times a day, so if they're red,
I'm definitely aware and trying to fix it.
cheers,
--renato
> This buildbot looks like it's been failing since Friday - does anyone
> know/own/care about it?Yes, we're looking into it.
As you probably noticed, debugging ARM buildbots are not easy, not
fast. Reverting commits at random also don't help with the problem,
and bisecting can take days, if not weeks. So the one week rule to
disable bots is too harsh on those bots.
Is it? While it's failing, the buildbot doesn't seem to be any use to the
community at large - it's essentially the buildbot owners problem at that
point and probably shouldn't be engaging with the community until it's
green again, I think?
Is the buildbot useful to you during this time? Or are you debugging
elsewhere/privately?
If the buildbot is useful to you, but not the community at large - perhaps
we could get in the habit of moving it into a "no email" pool whenever a
failure occurs, until it can be cleared up. (hopefully this pool is clearly
distinguished from the rest of the buildbots in the waterfall/grid view -
because it'd be helpful to be able to look at an easily distinguished
subset of the waterfall/grid and see the bots that are expected to be green
for any developer there)
> This buildbot looks like it's been failing since Friday - does anyone
> know/own/care about it?Yes, we're looking into it.
As you probably noticed, debugging ARM buildbots are not easy, not
fast. Reverting commits at random also don't help with the problem,
and bisecting can take days, if not weeks.
Also - if the blame list isn't short enough to provide effective/actionable
blame for the actual developer who caused the regression, sending email
seems noisy and unhelpful. This seems like a buildbot that should just be
emailing you (and anyone else tasked with/interested in investigating these
failures), not a long list project contributors?
Is it? While it's failing, the buildbot doesn't seem to be any use to the
community at large - it's essentially the buildbot owners problem at that
point and probably shouldn't be engaging with the community until it's green
again, I think?
The bot is useful as it still shows if there are new bugs since the
initial problem, and can help bisect any further problem when they
come. If we disable that bot, when we fix the issue and bring it back,
there could be a number of new failures that we didn't monitor and
that will need a few more days/weeks to remove, especially if they're
cumulative. This way, it's likely that we'll never have that bot
online ever again. This is bad for the community.
Is the buildbot useful to you during this time? Or are you debugging
elsewhere/privately?
Both. As I described above, this bot is useful not just to me, but the
community, as they can cross check if their commits introduced bugs to
all ARM bots, not just one, and the slow bot will show that. I'm also
investigating elsewhere, since if I turn this bot off, what I said
above will happen. I'm also not alone in investigating this, Saleem is
helping me.
If the buildbot is useful to you, but not the community at large - perhaps
we could get in the habit of moving it into a "no email" pool whenever a
failure occurs, until it can be cleared up. (hopefully this pool is clearly
distinguished from the rest of the buildbots in the waterfall/grid view -
because it'd be helpful to be able to look at an easily distinguished subset
of the waterfall/grid and see the bots that are expected to be green for any
developer there)
Any movement means restarting the buildmaster, which means stopping
all current builds and upsetting all other bots. If we start taking
the stance of moving things up and down the priority list, we'll have
more unstable buildbots and that's worse for the community. Our
agreement, at least from what I understood, was that we should move
unstable bots to offline if: they're broken for a while AND no one is
trying to or can fix it. "A while" is vague because it depends on the
hardware, and I'm definitely trying to fix it.
It's not because the hardware is slow that it has no value to the
community, unless you're arguing that we shouldn't test ARM at all,
which is a whole different story.
Not emailing bugs in this bot when it's green means it's probably
useless, so I wouldn't want to have any bots in there. I already have
a separate buildmaster which doesn't email where I test my prototypes,
but those are work in progress, while my production bots are not.
A neater solution would be to not email *any* buildbot that moves from
exception to failure if the previous non-exceptional status is also
failure. This way, we won't have the kind of email that upset you, but
we still have the value that a red bot provides.
cheers,
--renato
> Is it? While it's failing, the buildbot doesn't seem to be any use to the
> community at large - it's essentially the buildbot owners problem at that
> point and probably shouldn't be engaging with the community until it's
green
> again, I think?The bot is useful as it still shows if there are new bugs since the
initial problem, and can help bisect any further problem when they
come. If we disable that bot, when we fix the issue and bring it back,
there could be a number of new failures that we didn't monitor and
that will need a few more days/weeks to remove, especially if they're
cumulative. This way, it's likely that we'll never have that bot
online ever again. This is bad for the community.
The community generally doesn't pay attention to the bot once it goes red -
so this seems to be only relevant to the "we didn't monitor" and by "we"
I/you mean you-and-other-people-who-care-about-the-bot, not the community
at large.
I certainly don't look beyond "oh, the bot was already red" and /maybe/ if
you're lucky "oh, a different thing is failing now", but I often don't get
that far owing to the high false positive rate (due to flakes and existing
errors) in the buildbots.
Maybe other people's experiences are different, but I don't have much
evidence to suggest that.
> Is the buildbot useful to you during this time? Or are you debugging
> elsewhere/privately?Both. As I described above, this bot is useful not just to me, but the
community, as they can cross check if their commits introduced bugs to
all ARM bots, not just one, and the slow bot will show that.
I don't know about other people, but I don't cross reference bots that
closely. I mostly ignore the low rumble of noise I get back from the
buildbots every time I commit. I have to measure by magnitude (& level of
trust with different bots) this is really not possible for newer
contributors - they won't know what to pay attention to or not. I don't
think it's a sustainable way to run the bots.
I'm also
investigating elsewhere, since if I turn this bot off, what I said
above will happen. I'm also not alone in investigating this, Saleem is
helping me.> If the buildbot is useful to you, but not the community at large -
perhaps
> we could get in the habit of moving it into a "no email" pool whenever a
> failure occurs, until it can be cleared up. (hopefully this pool is
clearly
> distinguished from the rest of the buildbots in the waterfall/grid view -
> because it'd be helpful to be able to look at an easily distinguished
subset
> of the waterfall/grid and see the bots that are expected to be green for
any
> developer there)Any movement means restarting the buildmaster, which means stopping
all current builds and upsetting all other bots. If we start taking
the stance of moving things up and down the priority list, we'll have
more unstable buildbots and that's worse for the community. Our
agreement, at least from what I understood, was that we should move
unstable bots to offline if: they're broken for a while AND no one is
trying to or can fix it. "A while" is vague because it depends on the
hardware, and I'm definitely trying to fix it.It's not because the hardware is slow that it has no value to the
community, unless you're arguing that we shouldn't test ARM at all,
which is a whole different story.
If the failure mails are not actionable, they're not useful to the
community. If the blame list is too long (or too delayed) it's not likely
to be useful.
If a certain platform just takes a long time (though we could reduce that
with a hybrid approach - cross build the compiler on a fast platform, run
the tests on the other) then it's necessary to put more hardware (multiple
slaves) behind it to reduce the blame lists, I think.
Not emailing bugs in this bot when it's green means it's probably
useless,
It doesn't seem useless - it's still a signal to you and other developers
who care about the platform and will investigate failures.
so I wouldn't want to have any bots in there. I already have
a separate buildmaster which doesn't email where I test my prototypes,
but those are work in progress, while my production bots are not.A neater solution would be to not email *any* buildbot that moves from
exception to failure if the previous non-exceptional status is also
failure. This way, we won't have the kind of email that upset you, but
we still have the value that a red bot provides.
Sure, I'd be OK-ish with that, though it'd still make looking at the
waterfall/grid problematic as it is today (though I don't do that often, so
I don't personally care about that). It'd be the same as moving the
buildbot to a "no email" group until fixed, but without the need to cycle
the buildmaster (& with the benefit that it'd happen automatically - though
I'm only suggesting moving it off emailing when there's active
investigation, so the small manual task at the beginning and end of that
cycle doesn't seem too detrimental - no need to do it when someone just
checks in a buildbreak by mistake, etc)
- Dave
Yup. Best of both worlds.
cheers,
--renato