I’m not sure where you came up with the required list of capabilities, or the idea that GH Actions doesn’t meet them.
GH Actions can be triggered on every commit individually (which eliminates the need for a blame list), or they can be batch triggered based on setting build pool sizes. If they are batch triggered and fail, all committers since the last success get the failure email (notifications from GitHub Actions are controlled by user notification settings and pipeline configurations).
GH Actions publishes the results of the action into the commit history on GitHub’s page. For example, my team’s fork shows our status checks here in the commit view. I don’t see why we would need a separate dashboard or display of the blame list given that it is right there in the commit history. Many projects do post the current CI status in the project README, which we could also do if we wanted.
All of this aside. I’m not actually suggesting we implement and adopt GitHub Actions. My point in bringing cloud solutions up is that they do solve the scalability issues we need for pre-merge testing and Buildbot has significant limitations.
If we want mandatory blocking pre-merge testing, I do not believe it can be built on Buildbot.
I also would really like to hear from the people who manage the current buildbot builders to know how many (and which) builders would opt-in to being available for this style of optional trybuild pre-merge testing. I think knowing which configurations are willing to opt-in plays a huge role in evaluating the utility of a buildbot-based solution for optional pre-merge testing.
I also think that if we want mandatory pre-merge testing, the scalability and speed concerns come to the forefront, and we absolutely need to think about how to solve those problems.
This was discussed a couple of times over the last couple of years on this forum, I’m just a bit lazy to find the references
I’m familiar with this part, IMO this is unusable, but I may have missed another view that the commit history: when you have as many config as we have on buildbot, you need to be able to track the history of a single config (can you do that on GitHub? In a per-branch basis?), or better a group of configs (“show me
lld build history in a grid”). The “badge” attached to each commit in the history is just giving a green/red signal based on the aggregation of all the builds (one single bot red for some time and you only see red ever!), you need to click individually on each and find your build (assuming it ran on this commit).
Even if you want to know the status of a single commit, the batching would make it hard to track on GitHub: the build status shown there for a given commit will not show a build that batched this commit with subsequent ones (as far as I can tell).
Batching also brings the need for blamelist: you can only avoid this part if your bot really runs on every single commit.
You seem to oppose “cloud solutions” and buildbot, but I’m not sure why?
I set-up a scalable Kubernetes clusters where nodes are spawned on demand and connecting to buildbot as workers (basically one Ubuntu VM with all the toolchain versions I wanted to test). I had multiple “builders” (in buildbot terminology) that could all targets all of my workers.
The Kubernetes cluster could trigger shutdown for a node/worker when the cluster load was low, the worker would issue a request to buildbot to stop scheduling new job on this worker, and wait for the current build to finish before shutting down gracefully.
The only thing I was missing was a Kubernetes plugin to poll the REST API of buildbot for querying the job queue length instead of using the cluster CPU load to trigger resizing events.
I don’t disagree here, but I am also not convinced this is saying much about neither the value of @kwk work, neither the post-merge flow. It seems quite orthogonal to me: didn’t you say in this thread that many of the buildbot couldn’t support the load of being in the default pre-merge set of builds? Then having an opt-in solution for these seems highly valuable to me, regardless of what’s happening on the “mandatory” pre-merge flow (as long as the same VM and build-scripts are used in buildbot and in any pre-merge system)
I think this thread has gone off topic and we’re lost in the weeds confusing a conversation about hammers with screwdrivers again. One last thing I’m going to say is a bit of personal experience.
For the last 18 months I’ve been working on a project that uses exclusively cloud-based solutions. Mandatory pre-merge testing with post-merge additional verification. Most of it built on Azure Dev Ops and GitHub. It has been the most satisfying project infrastructure I’ve ever worked with. It stays out of my way when I don’t need it nagging me, and is there to provide me with details when I need it.
I’ve never missed a dashboard or any of the web views provided by Buildbot or Jenkins. I’ve found the blame emails and action logs to be more than sufficient, and I’ve loved that the infrastructure has 99.9% uptime with only a handful of infrastructure related failures.
I know that LLVM can’t move to a fully cloud solution like that because cloud hosting for our full testing matrix doesn’t exist. That doesn’t mean I don’t wish we had some of that great infrastructure where we could apply it.
I’m going to step back from this thread. I’ve said what I think needs to be said. I’m not going to stand in the way of people who want to build and roll something out. I think we can do better than buildbot, but I’ve thought that for a decade now, so maybe I’m wrong.
sorry if I am missing something and this went completely over my head: what benefits are we getting from the buildbot infra here? I was looking on zorg repo the other day and as I understand we have there some infra to a) authenticate bots and store their settings (like how often we run) b) define what commands to be run on every bot c) process bot results and send emails.
a) is purely infra dependent
b) seems quite narrow (cmake flags plus ninja commands, am I missing something?), plus there is no way to define a custom logic of “if this files are modified run those tests”
c) does not work for the premerge scenario, needs rewriting
So for me it looks like that building something on top of buildbots will require a lot of custom tooling while existing codebase will not help much. It was also mentioned that current fleet of buildbots does not have enough capacity.
Overall it seems to me that implementing something custom on top of buildbot infra might be an “easy start” as we already have experience and understanding of it but long term we should try to move to something more up-to date like GitHub actions and offload some of the maintenance to service providers: they are - like buildkite - seems to be fine with supporting big OSS project for free or with big discounts.
In theory you can run shell scripts or alike and we even do that: https://github.com/llvm/llvm-zorg/tree/main/zorg/buildbot/builders/annotated .
The logic about when to run a test exists but at a higher per-project level if I’m not mistaken.
I see the buildbot infra that we have as providing a hub for builders and workers to be instructed when to run builds. That’s quite a bit plus IMHO. I’m not sure we need rewriting or more of an extension of this setup.
Buildbot simply builds a codebase at certain events. I proposed to add another event to its list: A GitHub PR comment. That’s not a lot of custom tooling in my mind. With respect to @beanz comments about not wanting to maintain an extra GitHub app, I totally understand that and it was simply a matter of demoing what I had in mind. After giving it more thought we could totally pull this off without an app in the middle. A GitHub Workflow or Action could be written that polls instead of listens for new build events and associate them to a particular github check run.
The issue was brought up by @mehdi_amini and @beanz that those additional builds ruin the view of a builder when going through the history. That is true and I’ve looked at the buildbot UI. Apparently there’s no way to fix this. But I think we can duplicate the builders when they opt-in to pre-merge testing and give them a distinct name. Then you can still view the old builder history without any interleaved pre-merge builds. In buildbot terms, a builder is really just a matter of steps (called factory) to execute on a worker (the actual machine). Duplicating a builder doesn’t cost much.
I’m all for an easy start with a long term goal in mind. If we bring on-demand pre-merge testing to github pull requests, then we could utilize this under the covers in the future as well. Expensive checks (no matter where they run, i.e. Azure, Buildbot, Buildkite…) could be executed on demand.
Contacting worker admins for potential opt-in to on-demand pre-merge
@beanz you’ve brought this up more than once now and I hope this is helpful. In order to get an opinion from the admins of the builders that we have in production, I’ve created a set of commands: Get list of builder admins for buildbot · GitHub.
A worker is associated to an admin. The worker knows about to which builder ID it is associated and the builder has a name. I’ve mashed all this up, gave it some stir and I hope the list of admins and which builders they can be associated with helps us for contacting them.
What we get from buildbot isn’t really a good mandatory pre-merge flow orchestration, but a good post-merge orchestration (that includes dashboard, blamelist, etc.).
Now when you have workers registered into a scheduler, you can’t really share them with another scheduler (you want a unified task queue for the jobs to run). Also many bots in the post-merge scenario aren’t the kind of bots you want in a default pre-merge flow (slow HW / limited resource), however you still want to get a pre-merge opt-in there (think a comment on a pull-request to target a RISCV bot for example, or a full MSAN build).
From there it seems quite natural to be able to reach on-demand the pool of post-merge resources, and then implies talking to the scheduler used for post-merge orchestration. I don’t see GitHub action replacing this any time soon right now but I may be missing some pieces there as well (it is always evolving and hard to keep track!).