Ah I saw this a bit late I’m working on something loosely related to this.
We’ve found a way to reduce build time of the critical path of LLVM to 5 seconds, essentially being only bound by network speed (realistically, building Clang/LLVM/libcxx in 2 minutes on a regular laptop).
We do this by using a relatively complex remote execution setup wrapped in reproducible nix environments and skipping the actual remote executors. This ways we get essentially 100% cache hit rate with a remote cache that is compatible across systems (of the same architecture. We tested WSL2/Ubuntu, Arch and Gentoo).
Unless some unexpected issue arises I plan to create a release in the next few days which will include proper documentation and a more user-friendly setup. I’ll probably open an extra post for this then as I think this could significantly improve the dev experience for LLVM (or at least of those using the Bazel build).
Btw regarding pre-commit hooks, an interesting option is wrapping those hooks in nix like this. This way the hooks are reproducible and essentially guaranteed to build. In the past we encountered issues with some tools not building properly on every user’s machine (e.g. wrong rust or python versions), which caused us to do things this way. Such a setup reduces the github action workflow to something like this.
Hi! I have been developing and maintaining current setup that uses buildkite. If you have a resources to invest into that I will be happy to talk about current issues / plans we had. I think it makes sense to have a joint effort here (and I also don’t have any bias to keep buildkite as long as we have the feature ). Will read your proposal and comment today.
The main thing for me is not to put effort into integrating anything new with Phabricator (though I do respect the work done on the current Phabricator setup).
Once there is a prototype of each option we can make an educated decision (and all the Github command stuff can apply to either I expect).
I would like to hear what you hit running the buildkite side of things. Maybe start a new thread to do that, and we will be able to reference that when talking about future systems. Cherry pick some of it, etc.
You won’t be able to run the app on your own without more information if you’re new to Github App development.
While developing this I found it quite easy to implement new ideas because you don’t need to roll out code somewhere, you just re-start your local app server code and there you go. And after a bit of exploration in the Github API you know your way around.
Right now the buildbot-app is not very smart it just understands the /buildbot command but it can be extended easily as we can see in Glasgow if anybody’s interested to maybe have a roundtable on this. btw. is the deadline already over?
I still like the idea of using a buildbot try build for the purpose of pre-merge tests, so you’ll find it in the current iteration as well.
I’ve used the HTTPStatusPush (220.127.116.11. HttpStatusPush — Buildbot latest documentation) service integrated in buildbot to signal back to my GitHub app when a build status changes. That means, no additional build step is needed in buildbot to signal back to the GitHub app, that something has changed. The app then takes care of informing the user via Check Run status updates and comments.
In order to have more than one builder that can be triggered from GitHub I thought about having only one try-scheduler that is called from the GitHub App. This try-scheduler runs a delegationBuilder. This builder then delegates the build to a target builder through a trigger step (18.104.22.168. Trigger — Buildbot latest documentation). The decision about the builder is currently hardcoded but could be made based on properties that we send from the app to buildbot with the buildbot try command.
So far I didn’t read that a builder can only be part of one scheduler. For LLVM, the idea could be to put existing builders that want to participate in pre-merge testing in the triggerable scheduler as well. Then they can still operate post-merge and can also be used pre-merge.
To all skeptics I want to say that we don’t make anything mandatory. Just look at how I create a check run on the fly. None is present in the beginning of a Pull Request’s lifetime. This lets us create as many checks as we think we need and we can even make some mandatory and ignore others. In theory when you create a build in buildbot with a /buildbot PR comment you could say /buildbot mandatory=yes or /buildbot mandatory=no from the get-go. But these are details that we don’t have to deal with as long as there’s no consensus on whether this has a future or not.
I would like to continue talking about this a bit. We had a roundtable at EuroLLVM’23. Unfortunately I didn’t recall everybody’s name. There were Chris Bieneman from Microsoft (any handle?), @boomanaiden154-1, someone from Playstation and another guy whose name I forgot, sorry.
I was able to demo my project really quickly and although the feedback was nice, there was something completely different that bothered me. We came back to the usual discussion about how fast builders need to be and how many commits we have per day yada yada.
For the many commits I suggest to look into Merge Queues (Managing a merge queue - GitHub Docs) and see if those might help. I knew that gitlab has this feature but apparently GitHub has them as well.
My proposal and demo targeted Buildbot and Github but then I learned from Chris that the foundation is looking into a potential future without Buildbot? Especially when it comes to lifting up the legacy of CI systems from various players not to be named, the foundation wants something more scalable and “cloudy”. But what is going to be run in the cloud then? Buildkite?
We can prototype the pre-commit checks with your bot and the buildbot. This could be done relatively easy, unless I’m missing something, would reuse the existing infrastructure, and has other obvious advantages. This would prove the things work as we think they should, and provide good service. Later, if someone will be moving the whole build infrastructure to other solutions, pre-commit checks will move along with many other things. Don’t see that as a show stopper either way, frankly.
Merge queues would definitely help if we want to move significantly more towards pre-commit CI in terms of compute requirements. It doesn’t alleviate the issue of having one failing commit bundled up with multiple commits causing them all to fail, but that’s definitely not a regression over what we have today.
I don’t think the current suggestion is to migrate away from Buildbot completely. If I’m understanding things correctly, I believe Chris was specifically suggesting moving to Github actions for mandatory pre-commit CI due to scaling requirements (which is somewhat of a separate discussion in my opinion). I don’t think a future without Buildbot is feasible anytime soon. There are too many configurations that can’t be tested with Github actions currently due to a lack of support from Github (i.e., anything other than ARM/x86).
I think currently the path of least resistance is a hybrid approach. There are already some small post-commit github actions that run. Having native Github actions in addition to the custom scripting that you’ve demoed integrating Github with buildbot seems like something that should work. It allows for all of the benefits that you’ve mentioned while still allowing for easy scaling through Github actions if at some point everyone does decide to implement mandatory pre-commit on a couple essential platforms.
I think a future of moving completely to Github actions is not particularly feasible at this point due to what platforms Github supports, but with self-hosted runners and appropriate scripting, we could also have opt-in pre commit CI that people self host.
I think there are a couple concrete steps that we can take now if others are interested and no one has any major objections:
I want to clarify here. The LLVM Foundation is not investigating anything here. My point was that the demo involved setting up additional infrastructure including a web service to respond to HTTP requests in order to get the demo’d tools working with Buildbot. I was trying to suggest we should consider driving this tooling without relying on Buildbot if it can simplify the solution.
Also want to clarify, this is not the opinion of “The Foundation”. My personal opinion is that The Foundation should not take over management for the complex array of build hardware that is currently being operated. That could change if The Foundation decides to hire dedicated DevOps staff to support this infrastructure, but there are money concerns that need to be explored there.
In the absence of taking over the array of existing hardware, The Foundation could fund cloud based solutions that have lower maintenance costs.
I suggested we should consider various existing cloud based solutions. Buildkite, Google Cloud Builder, Azure DevOps, GitHub Actions, or whatever else is out there.
I did think that @kwk’s demo was really cool and showed a lot of promise.
One concern that I did express was that when we look to pre-merge testing there are probably somewhere between 8 & 16 configurations that we could come up with which could be built and tested “fast” (for some definition of fast) if we had cloud-based build infrastructure that could scale up to reasonable parallelism. In a world where we have that, the utility of what @kwk built is being able to test obscure configurations pre-merge. That would be super useful, but does cause a resourcing issue. The more obscure buildbot configurations (like PPC-BE and ARM32) are also the slowest, so using them on-demand and post-merge would put significant strain on the hardware pool.
I wasn’t present at the roundtable and I can’t speak for those who were, but I would be against automatically sending patches to existing buildbot builders. From community feedback, we already know that buildbot builders are often slow and fail for unrelated reasons. I wouldn’t want to inflict that experience on the developer community at large by sending most patches to a broad selection of our existing buildbots. I think when we are talking about the standard pre-merge process that we expect everyone to run, we need to set a higher bar than what we’ve accepted so far for continuous integration buildbots.
If your proposal is more along the lines of tryjobs, or allowing folks to experiment with applying their patch and running it on specific builders of their choice, that seems reasonable to me. Developers often run into big endian or Windows-specific issues that are real, but are hard to debug without hardware access. We may not have enough big endian resources to test that configuration before merging, after all. Catching this after landing and debugging it with tryjobs seems like a reasonable compromise.
The workflow I’ve designed is strictly on-demand for now. So, nothing happens automatically. And no existing buildbot builder would automatically be participating. One has to manually make the buildbot builder triggerable at first (see details below). Then I would gate who can trigger the bot through a /buildbot comment. This could be done by username, group or something alike. Think of a YAML config like this:
I hope that my comment above clarifies why there’s no need for a concern.
My idea is to not make something anything mandatory for now. The Github check runs are added dynamically as you request a certain builder. A check that fails will block the PR from being mergable. That said, I translate a buildbot result state into a check run conclusion and it is always possible to click a button in a check run details page and influence this translation to make the check run become neutral, thus not influencing the mergability of a check run.
Github App vs. Github Actions/Workflow: Bidirectional Communication
My proposal involves on-demand buildbot usage, controlled by username and builder matching and nothing being run automatically. The infrastructure for buildbot is there, hence making it perfect with minimal adoption. As I’ve told in the roundtable, I chose to implement this as a GitHub App and not a workflow file because it makes things simpler to test before deploying. I especially like that the app can be tested with a mocked GitHub server (GitHub - kwk/buildbot-app: Your GitHub App to make Buildbot a part of your Pull Request workflow.).
When I started out my first experiment with this two years ago I chose a native Github workflow file. The development was way more complex for some reasons. Let’s first take a look at where the GitHub App sits:
Github <--> App <--> Buildbot
That means, the app can talk to github and buildbot and receive messages from both of them. Talking to buildbot is easily done from a github workflow file but receiving messages from it is not. Buildbot has a quite nice and neat way to notify interested parties about built states through an HTTP Push Service (see minimal change to config: buildbot-app/master.cfg at main · kwk/buildbot-app · GitHub). The app, when served can consume those state pushes and transform then into appropriate tasks in github, like adding a comment, creating a check run, etc.
When we don’t use a GitHub App but an action instead, things would look different:
I hope you see the unidirectional communication on the Workflow/Action side to Buildbot.
Buildbot, or any other CI system for that matter, would have to talk back to github directly. This requires heavy modifications in buildbot, that I would like to avoid at all costs.
I know you’ve said you want to be agnostic to buildbot when developing this solution. But I fear that any CI system would need modification to fit our thinking of pre-merge.
Maintainig an HTTP Service
The burden of maintaining an HTTP service is not that heavy I guess. That’s essentially a solved problem. There’s not much load on such a system. There may be many requests and talking back and forth to GitHub but not more; no compilation etc.
Too many hats
I’m sorry to have mixed up your opinion and the Foundation’s. You switched “hats” too often in our discussion
I hope I have clarified for why I think using plain GitHub Actions is not a practical way for dealing with state changes in builds (e.g. starting, running, failed). Those have to be reported back from a build system and somebody needs to either listen to them (e.g. my GitHub App) or we need to modify the buildsystem (e.g. buildbot) to report and deal with check runs manually. In the case of buildbot you could think of the reporters to do this job, but then we have to either maintain a custom reporter or get it upstreamed. Nothing is particularly fancy IMHO.
Build execution agnostic
What I can tell is that my proposal is agnostic to where a build is actually executed and where the compute needs to happen (money is burned). As long as a build system can report back about a build’s status, we’re good. I don’t know about Google Cloud Builder so far but I’m sure they have some HTTP hook to listen for builds.
Feedback on the demo
Thank you very much! And just to repeat: nothing in my proposal is mandatory. Only when you request a build, the result is taken to gate a PR through a dynamically added check run:
And if you change your mind you can translate the buildbot’s result state to a neutral check run conclusion to effectively have no effect on the PR’s mergability.
The check run you see above is one of potentially many others on different builders. The mapping from a CI build result to a check run is also one thing that my GitHub App is doing.
Decouple resource discussion from process
I’d love to see that we decouple the process from the implementation and from the hardware scalability. IMHO, I’ve come up with a process and a demo implementation. The scalability of buildbot could be achieved using latent workers (2.5.6. Workers — Buildbot 3.6.0 documentation):
Another approach is to let the Buildbot master start workers when builds are ready, on-demand. Thanks to services such as Amazon Web Services’ Elastic Compute Cloud (“AWS EC2”), this is relatively easy to set up, and can be very useful for some situations.
In my demo infrastructure I have a finite set of workers running as containers. But through latent workers those could become on-demand workers, backing a certain builder. For those who are less familiar with buildbot’s terminology: a builder is decoupled from a worker. The builder knows about a factory of steps to execute. Through this separation we can scale the number of workers to benefit a certain builder’s speed.
From my point of view, buildbot is equipped with a lot of stuff that we could use. Heck, if you want to see it in action, I could even demo the use of latent workers with AWS EC2 in my proposal. But for now I won’t invest time in it until we’re on the same page.
Just know that it is in our hands who can run a pre-merge check on a slow builder on-demand.
Prefacing with the fact that I’m not going to stand in your way of building and running this system. It isn’t what I want for the community, and it isn’t what I would build if I had the time to build something myself, but I don’t have the time to build something myself so I’m not going to get in the way of someone who is putting in effort to build something that is pretty cool.
Right, but the reason I suggested that you consider options not based on Buildbot is because the most unusual configurations where on-demand pre-building would be most useful are the most resource limited configurations. Not sure if the people who supply the hardware for the arm32 bot are following this discussion, but I can’t imagine they’d want to opt-in with their builder since it would delay the builder’s ability to keep up with changes that are merged.
We could (using cloud solutions) setup prebuilding independent of buildbot that have lower maintenance cost, don’t need to be opt-in, and cover the vast majority of common breakages.
This seems extremely odd to me. I don’t understand why this is a requirement as opposed to a design detail of your implementation. Why would we care if the workflow communicates to buildbot? Buildbot isn’t some single source of truth that everything needs to be captured in. Cloud builders can (and do) report directly on the PRs.
I would argue that reporting through Buildbot is a serious drawback of this approach. When I break a bot today one of the ways I know that I fixed the issue is by going back to that builder, viewing its history of jobs and seeing that the bot went back to green. If with your solution builder jobs for pre-commit builds get intermixed in the history with post-commit builds, the builder history becomes nearly useless.
If Buildbot can’t speak to GitHub we should not use Buildbot for this. Jenkins, Travis, AppVeyor, Azure DevOps, and Google Cloud Builders (and more) all support communicating back to GitHub by commenting in PRs, marking status checks, and more. If Buildbot doesn’t work with the most popular Git hosting site in the world, that seems like a pretty severe limitation.
Have you looked at other CI systems? Swift uses Jenkins with off-the-shelf plugins for pre-merge testing. My team uses a combination of Azure, GitHub Actions, and AppVeyor (the last we’re working to replace with Google Cloud). We do not maintain any services or build system extensions.
We have a lot of options. We also don’t need our post-commit and pre-commit solutions to be built on the same CI system. They have different requirements and we can consider alternate options.
No offense, but these all sound like “famous last words” in our industry. These things are never as easy as people hope, and bitrot is real as APIs and services shift or security issues get rolled out. Just look at Phabricator and Arcanist. If you want to maintain this service, pay for hosting, and be on call to fix it when it goes down… more power to you.
No. I still don’t at all understand why we need Buildbot to communicate or track anything here. We need something to drive a build and report back to GitHub, but Buildbot doesn’t need to know anything.
I don’t think we can decouple these. They have intrinsic connection. Your implementation is based on buildbot, a system that operates based on volunteer provided hardware with pretty limited resources. Ignoring resourcing would be an extreme disservice.
This does not work for our buildbot. A large number of our buildbot configurations are running hardware and/or operating system combinations that are not supported by cloud hosting providers. More significant to this conversation, IMO those configurations that are not supported by cloud providers are the ones where your solution would provide the greatest benefit. So resourcing is a big deal.
One question I have for the community: Are the maintainers of the slowest and oddest builders willing to sign on to this on-demand model?
If the big-endian, 32-bit, and other obscure configurations are signing up for your service. Great! If the only bots signing up are FreeBSD, Ubuntu, Windows, and macOS running on x86 (maybe aarch64), I think we should evaluate true cloud-based solutions.
I think we still have a separate problem for mandatory pre-merge testing. We could treat that as a completely separate issue, but I also think the two issues are related.
If we have fast mandatory pre-merge testing that catches 90% of the common failures, how much do we need an opt-in pre-merge solution for the less common configurations?
I never said that this is a requirement. The way I’ve build the GitHub App is so that you in theory can handle any response from any CI system. Buildbot happens to be just one right now. And Buildbot isn’t some single source of truth, for sure. It just posts a request with information about a build state to some HTTP endpoint without any knowledge about what is supposed to happen. Reporting on a PR is good and I’m sure many tools can do that already. But the question is about how frequent you want to test a PR. For me it is fine to push commits, fixups and reformats of a patch and I’m happy to only test it when it’s ready. Now, github has the draft mode of a PR. I’m sure that could be used to trigger a build on any CI. But then every subsequent is build. Maybe we don’t want that. Maybe we do. I don’t want to be the one who tells devs what to do. That’s why I think the on-demand process is nice. In fact, We could have an action that creates the comment for you automatically and then you have your automatism.
You can sure view it this way. But why not go through the github status of commits on main to see when things turn red/green? There I expect buildbot post-merge jobs to report as well. Currently they don’t but they should and could.
Please don’t twist my words. Buildbot can poll for github pull request changes out of the box today. But that would turn into a mandatory pre-merge that I don’t want. Sure at one point (when the computer resources are ready) we can turn that on for every PR.
NOTE: I do get that you’re not in favor of buildbot.
Yes, I know my way around Azure Dev Ops, Jenkins and GitHub Actions. It’s been a while but I have configured them.
But none of these solution can create multiple dynamic check runs for a GitHub PR based on my comment to a PR. They can react to comments, yes. The don’t let me create a build request on my self-provided worker.
Rhetorical question: But why have two when one is enough and could do the job?
I hear ya and I admit it might be too good to be true and easy.
Buildbot does not know anything but the build request. When triggering a build on buildbot, I pass it so called properties. Buildbot just hands them back to the GitHup App through the HTTP Status Push service. This is similar to X-Headers in HTTP: Forward what you don’t know about. When buildbot, or any other service for that matter, comes back it hands me the properties again. Those let me identify what PR in what repo this is about, which comment triggered the build, which check run it is associated with, yada yada. Buildbot isn’t a single source of truth in this regard. It merely is a messenger. If the GitHub App doesn’t understand what buildbot says (e.g. properties are missing) it does nothing. Right now I hard-code the properties in buildbot, but this could be automated and I’m sure there more elegant solutions for forwarding properties to a trigger build but it is good for now to demonstrate the concept: buildbot-app/master.cfg at main · kwk/buildbot-app · GitHub
I hope this showcases that buildbot is not aware of how it is being used. This was a design issue by myself.
Like I said before, just use latent workers on EC2 to do the work and you are in the cloud.
You’ve lost me. How can you advocate for looking into the cloud and then say that some configurations are too complicated to be build in the cloud? And yes, those exotic configurations are indeed the ones benefiting my approach. And the rest can use EC2 right now if somebody pays for the latent workers. But as far as I understand, we’re not talking money yet. I still think that I’ve provided a workflow that is both: usable by exotic configurations with maybe a limited set of users allowed and the official pre-merge configuration on ubuntu, mac and windows or whatever the foundation decides is worth an official release binary.
Why do you care so much about these odd configurations? They don’t have to agree, that’s the beauty in all of this. It’s opt-in. Of course you could ask for the value for them.
But what about those odd configs then? How are those supposed to be testable pre-merge? Should they be testable pre-merge? I’d say yes.
Valid point. But we don’t have mandatory pre-merge right now. Enforcing it is really hard from the get-go. From a community point of view we could see my proposal as transitioning tool to get to mandatory pre-merge testing by letting people opt-in and see over time who considers their build config worth being tested pre-merge. Then we can come up with real numbers that people have to fulfil in order for their config to be taken into consideration for mandatory pre-merge (e.g. build times etc.). Of course, the foundation can also decide what is needed for mandatory pre-merge checks (e.g. X86 ubuntu and alike) and this can be executed by a true cloud provider (as you described it) from the start.
I think providing user control of when trybuild/on-demand builds run is totally fine, but also just one part of the problem. GitHub draft PRs are great from a workflow perspective. Not only can you control which automated tests run on them, but they also signal to reviewers when a PR is ready for in-depth review.
Sure… but your proposal does break the usability of the buildbot UI as it exists today. Is that what we want? I use that view all the time, and I’m not entirely sure I’d want to switch to viewing the bot history via GitHub’s status. I really want to be able to quickly answer the question “did my latest change fix this bot that I broke?”. Since not every bot will report on every commit and we have loads of bots, the GitHub status-per-commit view will be extremely noisy.
I’m not trying to twist your words. You said, “This requires heavy modifications in buildbot…” when talking about communicating back to GitHub, something that other systems do out of the box. Other solutions can set up non-mandatory pre-merge checks.
Did I say that? I think buildbot does a very effective job as our post-commit CI. I’m not convinced that we should use it for pre-commit CI, but that doesn’t mean I’m not in favor of it.
We do this with Azure Dev Ops pipelines on my team. Azure Dev Ops also supports self-hosted workers.
I can use a hammer to put a screw through a piece of wood, but a screwdriver works better. We shouldn’t force ourselves to build a Rube Goldberg contraption just to avoid a dependency on another tool.
So… I don’t really want to dig into all the technical details of how this was built. You built something cool, and you built it using the tools we already use in the community. I get that.
We can accomplish a similar end result with GitHub Actions that doesn’t require buildbot or a web service acting in the middle. We can support self-hosted machines if we use Azure Pipelines (although there are some limits on what platforms the agent runs on).
I think a lot of the key question for the community here is which bot maintainers would sign up to include their bots in this service? If the answer is a few Ubuntu bots… we should just fund a cloud solution. If the answer is bots we can’t host in the cloud, your solution is likely a great starting point.
Does EC2 support RISC-V hosts? How about arm32 or Big Endian PowerPC? Do they have instances running FreeBSD or Arch Linux? What about AIX?
The operating systems supported by EC2 line up pretty well with the operating systems supported by Azure DevOps, Buildkite, GitHub Actions, and Google Cloud Builders. Why not just use those services for cloud scaling and simplify our maintenance.
I’m not advocating for a 100% cloud-based solution. I’m advocating for a mixed approach. We need both a hammer and a screwdriver.
The odd configurations are the ones that break the most, are the hardest to reproduce, and provide huge unique value in our CI. They are also the most resource constrained, so they will be most impacted positively and negatively by your proposal.
I actually think they shouldn’t be. They certainly shouldn’t be part of a mandatory pre-merge. For example, look at the armv7-2stage builder. This build takes 2-3 hours to run, so today we get trunk verifications 10-12 times per day. If one pull request pulls the bot away for a PR verification that could be a 10% reduction in trunk monitoring, and could basically result in someone who breaks the bot not knowing they broke it during their work day. That seems like a huge regression in service quality, and likely means this bot isn’t something anyone would want to run pre-merge (which I think is totally fine, and probably desirable).
Due to the complexity of the test matrix for LLVM/Clang/LLDB/MLIR/… I think we likely will want to come up with a core set of platforms and configurations that are mandatory pre-merge, and a long tail of everything else as post-merge verified.
A trybuild mechanism like yours could fill a gap between the two, but resourcing is a consideration that factors in. Ignoring it or trying to separate it isn’t helpful to us getting to a good solution for the community.
This is only partially correct. We do have mandatory pre-merge testing right now on Phabricator. It is not blocking pre-merge testing, but all commits posted on Phabricator do get tested.
I don’t think we’ve actually made a decision on whether or not moving to PRs means we’ll also block direct pushing to main (my guess is we’ll do that as a separate process change). As long as we allow pushes to main we can’t support fully mandatory pre-merge testing, but we can (and do) have mandatory pre-merge testing without it being blocking today.
I don’t actually agree with this. I might be underestimating our community’s ability to bike shed, but I think if we setup non-blocking mandatory pre-merge checking with PRs and it was accurate enough and fast enough, it becomes an easy sell to the community to switch it to blocking.
I don’t know that it is the foundation’s place to decide what is needed for mandatory blocking pre-merge checks. I think that’s an area that the community gets to decide and the foundation gets to try and help execute. The foundation does get a voice if the foundation’s money is funding it, but there’s a bit of a gray area there.
I mentioned this at the dev meeting when you talked about reducing load on post-merge builders. I don’t think this reduces compute load on post-merge builders. We effectively do this already. The post-merge builders grab batches of change at a time.
Having merge queues with branches being verified pre-merge may allow for a higher success rate on post-merge builders, which has benefits like preventing the storm of emails we all get (sometimes for days) after breaking a build or committing too close in time to someone else who broke it.
I see how merge queues change the number of commits or the hours that commits stream into the post-merge builders. That means regardless of using merge queues many of our post-merge builders will run nearly 24 hours a day keeping up with the inflow of changes.
Just briefly, and as a viewer from the sidelines (upstream glibc maintainer), I wanted to note that I agree with both @beanz and @kwk on a variety of points. Reading the entire thread from the start leaves me feeling that you both agree on a lot of the same things. I was moved to highlight the important points that both of you are making. I want to encourage @kwk to continue exploring the solution space.
@kwk and I have talked for a long time about pre-merge CI for systems engineering, and for glibc it is a topic that I’ve been trying to solve for the upstream community. We have been running non-mandatory pre-merge CI on every patch for 2 years (since May 2021). In glibc our volume is way lower than for llvm, but general concepts still apply (and libm testing could scale up without limit).
I agree with @beanz that pre-merge CI needs are going to be different and it is worth exploring the solution space there. Limiting yourself to buildbot as a choice because you already have buildbots is an artificial design constraint. In glibc we explored the thinnest possible build and test infrastructure that I expected could be lifted into the cloud, and we avoided everything except basic workload containerization as a framework requirement. Even after two years of experience running pre-merge CI I don’t feel I have enough data to make it mandatory (which also requires some kind of well tested waiver process). It has caught a lot of silly mistakes and real bugs. Downstream we have been running post-merge CI/CD into Fedora Rawhide for glibc with a batching interval of 7 days for 3 years, and that catches an even larger number of systems integration issues than we could ever have expected (we break seccomp jails frequently and the recent ENOSYS vs EPERM seccomp changes were driven by glibc developers pushing for a less fragile architecture). This to me highlights that pre-merge and post-merge are likely to need completely different systems with different goals and different tradeoffs.
I also feel that @kwk has a spectacularly good idea with on-demand pre-merge CI testing. I plan to leverage @kwk’s concepts and ideas in glibc to trigger testing that is in the abstract “expensive.” Many kinds of tests fall into this category including: those that use cloud infra but cost significant money, those that use special configurations with limited resources (can include brand new hardware like we’re seeing for RISC-V, or legacy 32-bit compatibility), or anything with a non-trivial cost/value tradeoff. As an example in glibc we have a bootstrap test that builds all ABI variants, and the single build cost in AWS EC2 for this is ~$20 USD per build at a “speedy” runtime of 2h. That quickly approaches $50K USD a year just for that one test in compute cost. This test should only be triggered if a authorized reviewer identifies something that might have global ABI impact (hard to quantify sometimes) and so it should be reviewer triggered.
Pre-merge CI should be as light weight as possible, covering 90% of cases with common hardware and configurations.
On-demand pre-merge CI solves the testing of difficult to review patches with a human making a cost/value tradeoff to engage additional testing resources.
Post-merge and Pre-merge CI have different cost/value tradeoffs and mixing the two is not something I would recommend.
@kwk Please continue exploring in this space. The ideas you’ve had have helped me make conceptual forward progress in this area. Thank you!
This was already (I asked the same thing a while back): these is no mixing of the history as buildbot will have a “per branch” history: main will be isolated from these “one off” run.
I agree this is the way to go, and a natural evolution of the system!
The main elements for me would be 1) zero-flakiness and 2) fast. That is I need to hit merge and never have to come back to it unless my code actually breaks something. I usually don’t care if after hitting “merge” the code lands in main with a slight delay.
While technically true, I would argue that a good pre-merge testing should also run post-merge and have a high-priority of “revert-to-green”: one issue with the Phabricator system is that it has never been put in Buildbot and so it can be broken without no one tracking it.
To your point: that does not mean that we need to use buildbot to orchestrate the pre-merge flow, we can directly use GitHub actions or whatever. However we likely need the exact same runners (and build/test invocation) to also be exercised post-merge (that is, “triggered by buildbot” until we find a suitable similar post-merge system).
Totally agree, and actions can run post merge too. In fact it is just one extra line in a YAML file .
Or we just have the action run post-merge too, and leverage GitHub Actions for that whole pipeline. That gives us a cloud-based solution that scales with the commits and can track PR and post-merge builds with flagging failing commits.
Right, but I have seen yet any post merge CI driven by GH actions which has the basic capabilities needed (blame list, dashboard, email notification to the blame list, …), any pointers?
(Buildkite is missing these as well last time I checked)
(I have used post-merge actions, mostly for automating deployment kind of scenario)
I’m not sure where you came up with the required list of capabilities, or the idea that GH Actions doesn’t meet them.
GH Actions can be triggered on every commit individually (which eliminates the need for a blame list), or they can be batch triggered based on setting build pool sizes. If they are batch triggered and fail, all committers since the last success get the failure email (notifications from GitHub Actions are controlled by user notification settings and pipeline configurations).
GH Actions publishes the results of the action into the commit history on GitHub’s page. For example, my team’s fork shows our status checks here in the commit view. I don’t see why we would need a separate dashboard or display of the blame list given that it is right there in the commit history. Many projects do post the current CI status in the project README, which we could also do if we wanted.
All of this aside. I’m not actually suggesting we implement and adopt GitHub Actions. My point in bringing cloud solutions up is that they do solve the scalability issues we need for pre-merge testing and Buildbot has significant limitations.
If we want mandatory blocking pre-merge testing, I do not believe it can be built on Buildbot.
I also would really like to hear from the people who manage the current buildbot builders to know how many (and which) builders would opt-in to being available for this style of optional trybuild pre-merge testing. I think knowing which configurations are willing to opt-in plays a huge role in evaluating the utility of a buildbot-based solution for optional pre-merge testing.
I also think that if we want mandatory pre-merge testing, the scalability and speed concerns come to the forefront, and we absolutely need to think about how to solve those problems.