[RFC] AI-assisted Bazel Fixer Bot

Sure, this is not a replacement for experienced engineers looking at Bazel PRs and validating everything is in order. All these AI suggested changes are subject to PR review by Bazel maintainers and owners who are experienced and can request changes. If PR needs more iterations as per review, inexperienced engineer on rotation can own that and already have something to iterate on. I expect such iterations to be rare based on looking at sample AI fixes. Overall, I think, we do end up with time savings that make all of this worth it for us (and as addressed before, also for other users for Bazel outside of Google).

I’m really happy to have something like this. Fixing the bazel build is usually not difficult, but it’s a source of toil that distracts from doing more interesting work.

To clarify the proposal: this is not automating any commits to main. It is still a manual step for a human to review the changes and merge it. The people reviewing these changes are the same people that currently fix the build manually; this isn’t a request for others to step up and fix the bazel build.

There are also some classes of issues, like Self-contained Headers and Library Layering that this RFC specifically avoids addressing so that the scope is contained to only touching files in utils/bazel. Commits that violate these policies often only break the bazel build, but the fix would have to touch non-bazel files (e.g. adding #include “…” to some .h file if the header file is not self-contained). Those will continue to be fixed manually.

Since changes are limited to Bazel files, it’s harder to come up with scenarios where this would be the case. It wouldn’t be adding code, just fixing the build for existing targets. Do you have any other examples in mind? I guess the AI suggested fix could delete the test rule in the build file as its way of “fixing” the build, but I would hope that would be an obvious mistake that humans notice when deciding to create/accept a PR.

Yeah, this is a valid concern. I think it’s a general LLM issue too. My idealistic perspective is that the engineer should be able to stand behind the change on their own, e.g. defend it with “we need a new cc_library target because of this new CMake file/library” and not “we need a new cc_library target because the AI told me so”. This would be in line with the latest draft of the general LLVM AI policy, but I think it’s an open question on how we ensure that newcomers understand the Bazel changes instead of blindly accepting them. Maybe the bot could leave some notes in the commit message for why it made each change.

I don’t know anything about GitHub logistics/ACLs, but this makes sense to me too. It would eliminate a lot of concerns around what access the fixer bot has. It wouldn’t necessarily be a google/llvm-project fork, since there are non-Google organizations that use the Bazel configs too.

This changes nothing about the community’s obligation to deal w/ bazel failures – specifically the lack of obligation. Please continue to only update cmake files in your PRs (unless you care about bazel builds), and if someone asks you to update the bazel configs, politely tell them no.

3 Likes

Regarding user branches, given overwhelming support to not create additional user branches, we are okay in using a fork.

Regarding merging AI patches, @waffles_the_dog raised similar concerns in the above threads. Would human-in-the-loop approach (human creating PRs instead of automatically creating PRs) make you more comfortable since there’s someone who’s taking responsibility now for AI? This would be similar to someone using AI features in their IDE and sending PRs.

Now for some of my own questions:

  • Where can we see the status of things? If I notice the bazel build is broken but there’s no PR created yet, is there a page that shows if it has given up trying to fix it vs if it’s still working on it? If it attempted to fix something but gave up, can I at least see what it attempted / where it got stuck?
  • What’s the playbook for dealing w/ it if something goes wrong? e.g. let’s say it starts generating multiple PRs for the same commit, how do we turn it off?
  • Does “verified (by running local bazel builds)” mean it will verify via bazel build or bazel test? I expect most issues can be verified w/ just bazel build, but bazel test would add more confidence to the verification. Though then we have to worry more about things like test flakiness.
  • Is there a way to annotate these fixes for visibility, e.g. create a new “ai-assisted” GitHub label to tag on the PR? This is more of a general AI-in-LLVM question, as this would apply to other LLVM commits.

This is not a preference, this is a requirement. The author of a PR is responsible for the PR. Generating a PR from a glorified text predictor does not make it good, or even just correct. You are the author, you are responsible for the totality of the PR, correctness, being an actual good patch, etc.

If you want to post a PR you are putting a burden on reviewers, the person posting that PR is expected to have already done all the correctness reviews and testing they would be required to do as if you wrote the PR in the first place. We’re already dealing with slop, automating that spam is not reasonable.

If the overhead of taking responsibility for these PRs is considered too high, please consider the cost it places on reviewers.

2 Likes

I have the read work-in-progress AI policy. We are asking for an exception in this RFC because I don’t believe it puts review burden on rest of the community. I addressed this in my previous comment here showing the review load of such PRs – except few, all PRs were reviewed by same set of people in Google. To clarify again, we are only changing files in one peripheral component, in utils/bazel, and nothing outside of this directory is going to be touched by the automation proposed in this RFC. This is going to be ensured by non-AI deterministic code, rather than asking/requesting AI to not do it in its prompt.

Having said that, we are open to dropping this request together with user branch creation, if it helps alleviate other concerns like licensing and ownership of the code generated by the LLM.

I interpreted the exception to be “we’ll largely generate code with LLMs but humans will always be in the loop”. This would already be an exception because larger patches would still put the burden on reviewers to validate, which is a key point in the AI policy.

I don’t think we should open an exception to allow LLM generated PRs even for low impact changes within such a separated sub-community. The hallucination on the PR description, combined with similar effects on code comments can seriously mislead reviewers.

These changes are not always clear cut. Rules that depend on rules that affect other rules can muddy the waters that even experienced developers can miss, especially when the PR text claims things like “the number of tests are the same”, “all test pass”, “no tests were removed”.

LLMs often make these wild claims even when it’s evident that this is not the case. When it’s not evident is when reviewers are more likely to trust it, than verify. A human saying those things in a public PR would hopefully actually check that those are true beforehand.

We can either expose this on a web page, or given we are going to be using our own fork based on discussion here, I’d prefer to push such things on branches. These branches can contain a failed attempt. Similarly, a work-in-progress fix can be detected by presence of a branch in a fork. This is the easiest way to do this. A web page would be nicer but not sure would be worth the effort.

We only have to worry about this if we move forward with automatically creating PRs. If we do, GitHub app installed in the LLVM org has all the control. Revoke the permissions and bot can’t do anything. Another way would be to ping someone who has access to bazel Kubernetes cluster to stop the bot.

Verified using bazel builds, not test. Easy to flip the switch to test but I think that may complicate things. Eg: test may be failing due to a non-bazel breakage. The PR will be running bazelbot anyways which does the testing. So we should have that covered through that.

Adding ai-assisted tag sounds great.

Regarding the AI policy, I would really like to enable this kind of application, personally. I think it’s really a question of who the audience is for these PRs, and whether the Bazel bot is “asking the community to review” these generated PRs. The AI policy is meant to protect the time of reviewers, but if we don’t require community review for such changes, then I’m not sure it should apply here.

@pranavk was clear that the only people reviewing these PRs should be Bazel build stakeholders. In the framing of the policy, we could consider these the “humans in the loop”, and then it becomes more of a question of whether we want to insist on two-party human review.

Maybe the right takeaway is that we need to make our code review policy more clear. For some categories of repetitive, boilerplate changes, we should grant contributors the right to approve and merge generated changes. If the automated change doesn’t require community review, there’s no incentive mismatch. The point of the AI policy is to protect maintainer time, but if we don’t require maintainer approval, then its moot and we don’t need to apply it to this category of changes.

LLVM doesn’t have many dependencies (although I glanced at lldb-dap and I found a pile of Javascript with a bunch of NPM dependencies… ew), but it’s extremely common for other projects to have automations (renovate) that auto-upgrade pinned dependencies when CI passes. Sometimes these systems are even fully automated, so they don’t require one-party review.

I guess that’s where I’ve ended up. I feel like the AI policy makes sense as written if you’re asking for community review. However, LLVM has never made it clear when community review is required, and it’s an existing common practice to make exceptions to the review requirements for automated changes with a limited scope (renovate, llvmgnsyncbot), so we should clarify that policy and cross-reference them.

1 Like

It did happen. @thevinster and I discussed this with @brooksmoses at the LLVM developer meeting as well. We have a translation infrastructure set up to adapt the Bazel builds to Buck, and having the Bazel builds in-tree and regularly maintained is super helpful for us.

This isn’t quite true for Meta’s use cases; Buck itself was originally inspired by Blaze, but the projects have diverged significantly since then. No one on my team has any former Google affiliation, and we’re working purely on LLVM. We still get a lot of value from the Bazel builds though.

2 Likes

To be clear, these are my personal opinions, not any sort of area team or project council position.

I don’t believe that a human-in-the-loop would practically address the copyright concerns because I have no idea how a human would actually go about validating the changes aren’t copyrighted. This goes beyond this Bazel RFC and is a more general concern with our AI policy (both existing and proposed policies).

We rely on a social contract which assumes good faith in the human who wrote the code for a PR. There is ample evidence that LLMs are trained on inputs incompatible with our copyright and that this can lead to copyrighted material being reproduced. An LLM follows no social contract and therefore it’s wrong to assume good faith in its output like there is with human-generated content. So when reviewing LLM-generated code, I believe a human has no realistic way to efficiently determine whether the code is or is not copyrighted elsewhere. If you believe I’m wrong about that, I’d be happy to provide ten code snippets so folks can tell me which ones contain copyrighted code, explain what they did to determine that, and how long it took. Instead, I expect “it built and passed tests so it’s fine” will be the bar in practice because the validation required when a human isn’t the one writing the patch in the first place is too difficult, and that makes me uncomfortable. Again, this is broader than just Bazel, this is for any LLM-generated non-trivial change. Whether Bazel changes are non-trivial or not is probably debatable, but I consider fixing a broken build to be something other than an NFC change.

3 Likes

If we’re talking about LLMs in the abstract, these are valid concerns. LLMs are a black box, and I would be surprised if I were able to pass your test of correctly identifying 10 snippets. I won’t disagree with anything you said.

However, Bazel fixes are utterly boring and mechanical. There isn’t a lot of variance in fixes that people create. For example, look at https://github.com/pranavk/llvm-project/pull/54/changes (linked from the spreadsheet above). The commit it fixes moves something from the Fuschia clang-tidy module to the “:misc” build target, so it needs a dep added for that. You certainly could find a way to do it differently, but I’m reasonably confident that 20 different engineers would provide the exact same fix as that PR. Even a potential detail like I might add “:misc” to the end of the list and you might add it to the beginning is not something that would differ – buildifier (the bazel formatter) sorts deps lists.

Some less trivial changes are [Bazel] Fixes 9af00e6 by pranavk · Pull Request #31 · pranavk/llvm-project · GitHub or [Bazel] Fixes c9f5734 by pranavk · Pull Request #26 · pranavk/llvm-project · GitHub. But even though those are more complicated than just adding an item to a list, it’s still very mechanical. There might be some variance like where a new build target is inserted, but IMO that is a trivial detail.

When a fix requires creating a new build target, there’s theoretically freedom to name the build target whatever you want. But in practice we just base it off the CMake library name or the directory name. So even then, people will almost always choose the same thing.

Relevant anecdote: I once created a Bazel fix PR, but when I merged it, GitHub converted it into an empty commit (just the PR message), because someone had submitted an identical commit in the meantime.

I’m still having a hard time picturing this specifically in the context of fixing a bazel build. For making code changes in general, sure. But to remove bazel tests from a PR, you would have to make an obvious change, like removing the build target entirely or adding a tag to skip the test (uncommon; this change would stand out). Also, “all test pass” would be verified by the bazel fixer bot, as well as the bazel precommit CI that runs on changes under utils/bazel.

The first guideline on the BUILD style guide is “Prefer DAMP BUILD files over DRY” which might explain the bazel philosophy better, and why I think these concerns aren’t as big as they would be for a general fixer bot. If a build configuration has a high complexity, e.g. refactor common deps lists into variables to make it more DRY, then it’s more likely that a subtle change could have a surprising effect, like tests not running. But the vast majority of build file configuration is DAMP, which favors readability at the expense of verbosity, so it’s more obvious what’s going on when someone makes a change.

2 Likes

It seems to me like people are fixated on the mechanics of creating an LLVM PR. Generally, creating an LLVM PR implies that it’s available for review, and you want community feedback. However, I think that’s definitely not the intention for these Bazel build fixes.

If we take automated PR creation off the table, does anyone here object to humans using LLMs to create changes which they use to create a PR which is auto-merged to main without review? If nobody objects to that pattern, I could imagine a process like the following:

  • Automation detects Bazel build failures
  • Automation creates fixes, pushes to a branch of some other llvm-project git fork, notifies the relevant stakeholders using some mechanism (email, chat ping, creating a PR on the fork)
  • Human Bazel stakeholder adopts the PR, create a PR with their github account on the main LLVM repo, and merges it once the checks pass. This can be scripted with the github API.

With this workflow, the question is not whether a human is in the loop, it’s a question of whether we need to require two-party review for Bazel build files, and I hope we agree that we shouldn’t require 2p review to fix the build.

The Bazel stakeholder who adopts the change is the author, they own it, they are responsible for ensuring the PR doesn’t contain copyrighted content, just like any other change anyone uploads.

The key difference with this workflow are that now all PR authors are humans.

I think this is marginally less convenient than setting up a role account and github app which directly creates PRs in the llvm-project, but if it makes people happy, reduces policy exceptions, and reduces the number of bots on people’s PR dashboards, maybe it’s worth it.

2 Likes

But then that conflicts with the idea all commits need a pull request which is also being discussed.

@rnk this is basically what I was trying to suggest upthread in [RFC] AI-assisted Bazel Fixer Bot - #29 by asb. I think as you say, this is all doable without any exception to the standard guidelines and it has the advantage that it maintains the invariant that all PRs posted to the LLVM repo are posted because a human decided they should be. The person adopting the fix can make a judgement call on whether it can be merged without review by someone else or not, as our current developer policy allows.

See also @nikic’s comment a few posts down after that - I think there’s rough agreement that the two approaches are functionally very similar, and we maybe have different views on what to prefer based on aspects like 1) whether we feel it’s helpful to maintain the invariant (for now) that PRs are always human directed, or 2) whether having more of the human/bot interaction exposed is important, (and probably more). For 2), my personal view is it’s not very important as long as a human has adopted and is vouching for the work. I actively prefer people not to expose irrelevant history. If there were important things learned while developing the patch then say it in the patch description rather than making me step through the history. Plus if people really wanted to expose the back and forth, they could push a PR that contains those commits.

Given these changes are

  1. not adding to reviewer load for rest of the community.
  2. trivial majority of the time and rarely requires us to patch them.

waiting for a human to be online to press the Create a pull request button seems unnecessary hurdle while we could have the PR ready with all the PR checks finished when a human looks at it. All the comments against this so far seems to ignore (1) or (2).

It also think it’s more faithful representation of what’s going on (as @nikic mentioned in [RFC] AI-assisted Bazel Fixer Bot - #33 by nikic).

2 Likes

I certainly wasn’t intentionally ignoring anything :slight_smile: I had forgotten that there is a utils/bazel specific pre-commit test flow and I agree that having the PR posted earlier to let that run is a point in favour of having the bot-generated PRs posted directly. So I’ll change my weak negative stance to neutral. Thanks for raising that point.

1 Like

I’ve always felt that going through PRs should be a requirement - people can’t ping the person who landed changes when the changes are pushed directly, such push commits have not gone through the various test and validation bots, etc

Note I’m not saying “trivial build fixes should require review” I’m saying PRs should be the route through which all changes should be landed.

1 Like

The problem here is that text predictors are, by their very nature, incredibly good at making changes that look reasonable, but are subtly wrong. They’re like a very new engineer, or a summer intern, who are quite reasonably able to make mistakes or error, but at the same time much much better at hiding those errors.

So far I’ve only encountered text predictor produced content in the form of bogus security vulnerability reports against llvm, and directly in code only once, which I encountered solely due to it producing a weird code pattern that triggered a new warning. When looking at that problem, by complete chance the end of the prior function was in the visible section of that window. The section I saw had what was to me, an obvious error (and a good indication of a new warning I need to add). That made me read the entire function and then resulted in me having to file a series of security bugs.

As a quality indicator, the issue that originally had me looking at the code was it doing what was meant to be a page allocation via posix_memalign. “Correct” but not what the code should have been doing. You might say “well reviewers should have caught this”, but again the problem is that text predictors are designed to create plausible looking code. If some of the code does something that is outside of what the “Author” or reviewer usually does, and the code appears reasonable - posix_memalign absolutely will return you page aligned memory if you want it to, and again the only reason anyone found out about it was an independent mechanism that identified the weird use of posix_memalign. If that have not been detected the completely independent vulnerabilities in the generated code would have survived undetected.

I think if we are going to allow this kind of patch the PRs need to contain the name of the predictor used, the exact version of the predictor, the input to that predictor, and ideally a link to a reproducible version of that query and output.

The reason for the last point is related to the “plausible looking but incorrect” code these predictors create. If we are getting plausible looking changes that claim to be from a predictor, it becomes trivial to start inserting subtle malicious code into those PRs, and if they are noticed the author can just blame the predictor to disguise ill intent. I recognize that this is also hard: again the predictors are designed specifically to present the content they generate as if thought or understanding was involved, and so randomness is intentionally introduced for the specific purpose of preventing reproducible output (not out of malice, just for the purpose of .

A PR model that permits code generated by text predictors that are already extremely well documented as producing clearly nonsense output (“make things up”, or falsely labelled as hallucinations), do the wrong thing, while also being designed specifically to make whatever they produce seem reasonable, needs to have more robust protections than simply relying on author honesty and review.

1 Like

I want to start out by saying I agree that LLMs are good at producing output that looks reasonable with no regard for the actual correctness of that output. That’s exactly what they’re trained to produce.

However, I don’t think this point is as relevant for this specific proposal. For compiler work (and low-level systems work in general) correctness on a multitude of fronts is paramount and there can be extremely subtle issues here. But I do not think that is the case here. We’re talking specifically about patches to do build rule updates. Either to add/remove dependencies to existing build targets or add new build targets as needed. These changes in themselves should be quite unlikely to introduce subtle correctness issues and almost certainly not security issues. Any obvious issues with the build will be caught by tooling outside of the LLM that runs the build and tests. Any security issues would also likely be out of scope as the colloquial LLVM threat model (that I do not think has been formalized yet) does not consider adversial attacks on tools like clang to be in scope.

We want to make the process as transparent as possible (e.g., upstreaming all the code). But I don’t think all of these requests are possible exactly as stated with how current LLM APIs work. I’m not familiar with exact specifics of how these APIs work internally, but it’s generally not possible to get a full guarantee that the model you’re querying will be the exact same despite specifying the same name (e.g., if a different quantization is used under the hood). Results are also typically non-deterministic between API calls purposefully due to non-zero temperatures. Setting the temperature to zero typically degrades LLM output. We want to make this as reproducible as possible and all prompts/queries/model names will be in the open in the code, but we can’t guarantee token-per-token deterministic output.

I don’t think ensuring that the results from the LLM/Agent are verifiable significantly mitigates this threat model. The author could also just presumably claim in most cases that they didn’t have an understanding of the implications of such a choice and achieve the same effect.

1 Like