Buildbots timeout

Hi Galina,

I have observed that some of my bots timeout for no apparent reason
(no external load, no network mishaps, no issues), and when I go to
check the buildslave process, I see that it has just being restarted.

Do you know what could be causing this?

Swap is untouched, plenty of disk left, CPUs a plenty. So, I'm
contemplating the possibility that the master lost connectivity to it
and didn't get it back until it was too late.

I've seen this happen with my local master, when I reload the config,
the bot doesn't die but doesn't do much either. What I do is then a
full restart of all slaves and the master.

This is impractical in the LLVM Lab, but would be good to at least
understand what's going on before I get people accusing me of
"unstable behaviour". :slight_smile:

cheers,
--renato

Hi Renato,

Could you point me please (what bots and when did this happen)?

I will look.

Thanks

Galina

Could you point me please (what bots and when did this happen)?

http://lab.llvm.org:8011/builders/clang-cmake-armv7-a15/builds/6793

I will look.

Thanks! I may be wrong, but this is not the first phantom issue I had
to chase recently. :slight_smile:

Just want to make sure there wasn't anything from your side before I
dwell into the hardware parts.

--renato

I took a look at this log and the previous one that succeeded, and the thing that struck me was that both successfully completed 82 steps.

However, the build that failed thought it needed to do 106 steps. It hung on the 83 that didn’t exist.

Could this be a cmake/ninja issue?

However, the build that failed thought it needed to do 106 steps. It hung
on the 83 that didn't exist.

Interesting...

Could this be a cmake/ninja issue?

It could. Those bots are not cleaning between builds (or it would take
hours). If that's what's happening, I'll have to re-think my approach.

I'll investigate more, thanks!

--renato

I do not see anything disturbing in master logs about this bot, but I completed master reconfig today at 12:34.
This bot reported failure at failed at 13:00.

BTW, I did master restart yesterday at about 6:10 PM, and before that the last master restart/reconfig was on October 2.

Hope these will be useful in understating what could be wrong.

Thanks

Galina

Thanks

Galina

Can you use ccache in your environment?

I do not see anything disturbing in master logs about this bot, but I
completed master reconfig today at 12:34.
This bot reported failure at failed at 13:00.

Right, this is suspiciously similar to the problem I'm seeing on my own master.

BTW, I did master restart yesterday at about 6:10 PM, and before that the
last master restart/reconfig was on October 2.

I'll check, thanks!

cheers,
--renato

I use dirty builds + ccache, in case either one gets it wrong.

Ccache it's not perfect on its own, though. It still builds a lot that
didn't need to (can't remember exactly what), but small builds can
double time.

cheers,
--renato