The AArch64 builder in the pre-commit CI has been timing out the entire day today. Is anyone working on fixing this?
It’s related to the AWS outage. @tstellar was looking at moving the jobs over to a different region. Based on AWS’s last status report though, it looks like things are recovering in the impacted region.
Everything seems to be back to normal now. I’m going to follow up with them to see what are options are in the future if something like this happens again.
An entire region from a major cloud provider going down should be rare, but this is the reason that the x86_64 Linux/Windows premerge is setup in a HA (high availability) configuration across two different clusters. These clusters are setup in independent cloud regions (us-west1 and us-central1-a in GCP terminology), which should make us a bit more redundant.
If splitting work across two AWS regions on the depot side is a possibility, that would be great to handle exactly this sort of case.
Is has started happening again. E.g. [clang] Make 'fileScopeAsmDecl' matcher public · llvm/llvm-project@c416680 · GitHub
The self-hosted runner lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
Update: We’ve made some configuration changes that should hopefully fix some of the failed jobs we’ve been seeing.
Root Cause: Depot maintains a certain number of ‘standby’ instances that can be started very quickly when a new job comes in. If there are a high number of jobs that deplete the pool of ‘standby’ instances then it needs to boot up new instances to refill the pool, which can take a few minutes for each machine. Since we had a low pool size (2), that meant that the system would only boot up at most 2 new instances at the same time. So if we had a big influx of jobs, the system would not create enough new instances in time and jobs would start to fail due to GitHub’s 10 minute timeout for new job processing.
Solution: The pool size has now been upped to 10 which should hopefully eliminate this problem. If anyone sees any more issues with the AArch64 CI please let me know.