Test-Suite Bots failing randomly

Hi folks,

I noticed our test-suite bot is failing randomly when getting a timeout on the install phase:

http://lab.llvm.org:8011/builders/clang-native-arm-lnt/builds/666/steps/venv.lnt.install/logs/stdio

Downloading/unpacking Werkzeug>=0.6.1 (from Flask->LNT==0.4.1dev)
Exception:

File “/opt/buildbot/clang-native-arm-lnt/lnt.venv/local/lib/python2.7/site-packages/pip-1.2.1-py2.7.egg/pip/download.py”, line 380, in _download_url
chunk = resp.read(4096)
(…)
timeout: timed out

The package getting time out is not always the same, sometimes it also happens with SQLAlchemy. I’m guessing the server is loaded or the connection is somehow bad?

cheers,
–renato

Ping?

This is getting as often as every other run… Is that process downloading from an internal server? If not, should we? If yes, should we not?

thanks,
–renato

Hey Renato,

I’ll take a look this morning and now to fetching from an internal server so this doesn’t happen again.

  • Daniel

Thanks!

--renato

Ping?

This is getting as often as every other run... Is that process downloading
from an internal server? If not, should we? If yes, should we not?

Ok, so we are already downloading from a private server (lab.llvm.org),
which IMHO we should be doing because we want tight control over the
dependencies etc.

I forgot that your machine is outside the lab, so that may be involved
here. I don't fully understand the networking setup but it is very
different outside vs inside. That said, I ran some timings from our network
here to lab.llvm.org, and I couldn't reproduce any connectivity issues from
here.

Renato, can you try running some testing on your side to quantify where the
connectivity issue is? Things like:
$ ab -c 3 -n 100 http://lab.llvm.org/packages/SQLAlchemy-0.6.6.tar.gz
work fine from here.

That said, it *would* be nice to harden this code against that failure mode
-- one reasonable way would be to have the buildbot slave mirror the
packages directory locally (using rsync or so) and do the install from
there. We can make the rsync command not cause a buildbot failure, since we
rarely change the packages. Any interest in making these changes to the
buildbot config?

- Daniel

Ok, so we are already downloading from a private server (lab.llvm.org),
which IMHO we should be doing because we want tight control over the
dependencies etc.

Agreed.

I forgot that your machine is outside the lab, so that may be involved

here. I don't fully understand the networking setup but it is very
different outside vs inside. That said, I ran some timings from our network
here to lab.llvm.org, and I couldn't reproduce any connectivity issues
from here.

By the same time you replied I found out that there were networking issues
in our lab, so it's probably our fault. Sorry for the noise. :frowning:

In theory, it should be fixed, so let's wait until tomorrow and I'll follow
the bot closely to see if there's any problem.

Renato, can you try running some testing on your side to quantify where the

connectivity issue is? Things like:
$ ab -c 3 -n 100 http://lab.llvm.org/packages/SQLAlchemy-0.6.6.tar.gz
work fine from here.

Works on our lab, too. That is, now...

That said, it *would* be nice to harden this code against that failure mode

-- one reasonable way would be to have the buildbot slave mirror the
packages directory locally (using rsync or so) and do the install from
there. We can make the rsync command not cause a buildbot failure, since we
rarely change the packages. Any interest in making these changes to the
buildbot config?

I wonder if the Python code that downloads it doesn't already provide a
mechanism to cache the files. I don't think we need an external rsync, just
save the temporary files on /tmp and retrieve from there on failure to
resolve the external host. A warning would probably be in order.

I can have a look at the script. Would be great if you could tell me where
I can find the basic stuff. :wink:

cheers,
--renato

Ok, so we are already downloading from a private server (lab.llvm.org),
which IMHO we should be doing because we want tight control over the
dependencies etc.

Agreed.

I forgot that your machine is outside the lab, so that may be involved

here. I don't fully understand the networking setup but it is very
different outside vs inside. That said, I ran some timings from our network
here to lab.llvm.org, and I couldn't reproduce any connectivity issues
from here.

By the same time you replied I found out that there were networking issues
in our lab, so it's probably our fault. Sorry for the noise. :frowning:

In theory, it should be fixed, so let's wait until tomorrow and I'll
follow the bot closely to see if there's any problem.

Great.

Renato, can you try running some testing on your side to quantify where

the connectivity issue is? Things like:
$ ab -c 3 -n 100 http://lab.llvm.org/packages/SQLAlchemy-0.6.6.tar.gz
work fine from here.

Works on our lab, too. That is, now...

That said, it *would* be nice to harden this code against that failure

mode -- one reasonable way would be to have the buildbot slave mirror the
packages directory locally (using rsync or so) and do the install from
there. We can make the rsync command not cause a buildbot failure, since we
rarely change the packages. Any interest in making these changes to the
buildbot config?

I wonder if the Python code that downloads it doesn't already provide a
mechanism to cache the files. I don't think we need an external rsync, just
save the temporary files on /tmp and retrieve from there on failure to
resolve the external host. A warning would probably be in order.

I can have a look at the script. Would be great if you could tell me where
I can find the basic stuff. :wink:

The buildbot scripts are in the 'zorg' repo, the one in question is here:

http://llvm.org/viewvc/llvm-project/zorg/trunk/zorg/buildbot/builders/LNTBuilder.py?view=log

- Daniel

cheers,

Great.

All green so far... I think it was on our end.

The buildbot scripts are in the 'zorg' repo, the one in question is here:

http://llvm.org/viewvc/llvm-project/zorg/trunk/zorg/buildbot/builders/LNTBuilder.py?view=log

I don't know Python well enough to know which module would be good to
download and cache, etc. but I feel that this should be shared among all
builders, especially if we have to build a new module from scratch.

cheers,
--renato